igormolybog 's Collections evals
updated
Holistic Evaluation of Text-To-Image Models
Paper
• 2311.04287
• Published
• 15
MEGAVERSE: Benchmarking Large Language Models Across Languages,
Modalities, Models and Tasks
Paper
• 2311.07463
• Published
• 15
Trusted Source Alignment in Large Language Models
Paper
• 2311.06697
• Published
• 12
DiLoCo: Distributed Low-Communication Training of Language Models
Paper
• 2311.08105
• Published
• 16
Instruction-Following Evaluation for Large Language Models
Paper
• 2311.07911
• Published
• 22
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Paper
• 2311.12022
• Published
• 36
GAIA: a benchmark for General AI Assistants
Paper
• 2311.12983
• Published
• 246
Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
Paper
• 2312.04724
• Published
• 21
Evaluation of Large Language Models for Decision Making in Autonomous
Driving
Paper
• 2312.06351
• Published
• 6
PromptBench: A Unified Library for Evaluation of Large Language Models
Paper
• 2312.07910
• Published
• 16
TrustLLM: Trustworthiness in Large Language Models
Paper
• 2401.05561
• Published
• 69
OLMo: Accelerating the Science of Language Models
Paper
• 2402.00838
• Published
• 85
Can Large Language Models Understand Context?
Paper
• 2402.00858
• Published
• 24
Design2Code: How Far Are We From Automating Front-End Engineering?
Paper
• 2403.03163
• Published
• 98
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
• 2403.14624
• Published
• 53
Long-context LLMs Struggle with Long In-context Learning
Paper
• 2404.02060
• Published
• 37
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark
for Video Generation
Paper
• 2410.05363
• Published
• 45
LongGenBench: Long-context Generation Benchmark
Paper
• 2410.04199
• Published
• 22
GLEE: A Unified Framework and Benchmark for Language-based Economic
Environments
Paper
• 2410.05254
• Published
• 85