QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents Paper • 2606.32034 • Published 5 days ago • 10
Great Models Think Alike and this Undermines AI Oversight Paper • 2502.04313 • Published Feb 6, 2025 • 32
ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities Paper • 2412.06745 • Published Dec 9, 2024 • 6