view post Post 2569 How do I test an LLM for my unique needs?If you work in finance, law, or medicine, generic benchmarks are not enough.This blog post uses Argilla, Distilllabel and 🌤️Lighteval to generate evaluation dataset and evaluate models.https://github.com/argilla-io/argilla-cookbook/blob/main/domain-eval/README.md
benchmarks meituan-longcat/LARYBench Updated Apr 30 • 2.61k • 18 llamaindex/ParseBench Benchmark • Updated Apr 19 • 169k • 19.1k • 97 nvidia/QCalEval Viewer • Updated Apr 13 • 243 • 1.08k • 19 allenai/olmOCR-bench Benchmark • Updated Feb 19 • 5.82k • 245
RULER Datasets Falcon-H1-3B-Base RULER Datasets lighteval/RULER-131072-Falcon-H1-3B-Base Viewer • Updated Jun 18, 2025 • 6.5k • 41 lighteval/RULER-65536-Falcon-H1-3B-Base Viewer • Updated Jun 18, 2025 • 6.5k • 40 lighteval/RULER-32768-Falcon-H1-3B-Base Viewer • Updated Jun 18, 2025 • 6.5k • 20 lighteval/RULER-16384-Falcon-H1-3B-Base Viewer • Updated Jun 18, 2025 • 6.5k • 25
benchmarks meituan-longcat/LARYBench Updated Apr 30 • 2.61k • 18 llamaindex/ParseBench Benchmark • Updated Apr 19 • 169k • 19.1k • 97 nvidia/QCalEval Viewer • Updated Apr 13 • 243 • 1.08k • 19 allenai/olmOCR-bench Benchmark • Updated Feb 19 • 5.82k • 245
RULER Datasets Falcon-H1-3B-Base RULER Datasets lighteval/RULER-131072-Falcon-H1-3B-Base Viewer • Updated Jun 18, 2025 • 6.5k • 41 lighteval/RULER-65536-Falcon-H1-3B-Base Viewer • Updated Jun 18, 2025 • 6.5k • 40 lighteval/RULER-32768-Falcon-H1-3B-Base Viewer • Updated Jun 18, 2025 • 6.5k • 20 lighteval/RULER-16384-Falcon-H1-3B-Base Viewer • Updated Jun 18, 2025 • 6.5k • 25