AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios Paper • 2602.23166 • Published 8 days ago • 28
How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities Paper • 2603.02578 • Published 4 days ago • 22
RubricBench: Aligning Model-Generated Rubrics with Human Standards Paper • 2603.01562 • Published 4 days ago • 51
K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model Paper • 2602.19128 • Published 12 days ago • 6
DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning Paper • 2602.11089 • Published 23 days ago • 18
CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion Paper • 2602.10999 • Published 23 days ago • 10
When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning Paper • 2602.10560 • Published 23 days ago • 29
Code2World: A GUI World Model via Renderable Code Generation Paper • 2602.09856 • Published 24 days ago • 198
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning Paper • 2602.08234 • Published 26 days ago • 69
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models Paper • 2602.02185 • Published Feb 2 • 123
Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models Paper • 2601.22060 • Published Jan 29 • 154
ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas Paper • 2601.21558 • Published Jan 29 • 59
daVinci-Dev: Agent-native Mid-training for Software Engineering Paper • 2601.18418 • Published Jan 26 • 126
Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model Paper • 2601.15892 • Published Jan 22 • 53