Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle Paper • 2606.09376 • Published 16 days ago • 6
Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle Paper • 2606.09376 • Published 16 days ago • 6
SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces Paper • 2606.01317 • Published 24 days ago • 1 • 3
Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories Paper • 2606.02060 • Published 23 days ago • 55 • 9
Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories Paper • 2606.02060 • Published 23 days ago • 55 • 9
FineVerify: Scaling Test-Time Compute with Fine-Grained Self-Verification for Agentic Search Paper • 2606.00660 • Published 25 days ago • 8 • 2
Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems Paper • 2606.00090 • Published May 23 • 6 • 3