Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories Paper • 2606.02060 • Published 11 days ago • 54
CoVEBench: Can Video Editing Models Handle Complex Instructions? Paper • 2606.08415 • Published 5 days ago • 47
SWE-Explore: Benchmarking How Coding Agents Explore Repositories Paper • 2606.07297 • Published 7 days ago • 110