The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs Paper • 2506.18403 • Published Jun 23, 2025 • 3
ReCode: Updating Code API Knowledge with Reinforcement Learning Paper • 2506.20495 • Published Jun 25, 2025 • 10
SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution Paper • 2507.23348 • Published Jul 31, 2025 • 12
LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering Paper • 2509.09614 • Published Sep 11, 2025 • 7
LongCodeZip: Compress Long Context for Code Language Models Paper • 2510.00446 • Published Oct 1, 2025 • 108
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution Paper • 2510.08697 • Published Oct 9, 2025 • 40
ReCode: Unify Plan and Action for Universal Granularity Control Paper • 2510.23564 • Published Oct 27, 2025 • 123
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation Paper • 2511.02778 • Published Nov 4, 2025 • 104
WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation Paper • 2511.06251 • Published Nov 9, 2025 • 14
MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences Paper • 2601.06789 • Published Jan 11 • 82
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development Paper • 2601.11077 • Published Jan 16 • 67
Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model Paper • 2601.15892 • Published Jan 22 • 55
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces Paper • 2601.11868 • Published Jan 17 • 37
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding Paper • 2602.01785 • Published Feb 2 • 97
SWE-Universe: Scale Real-World Verifiable Environments to Millions Paper • 2602.02361 • Published Feb 2 • 61
Code2World: A GUI World Model via Renderable Code Generation Paper • 2602.09856 • Published Feb 10 • 201
K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model Paper • 2602.19128 • Published Feb 22 • 7
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale Paper • 2602.23866 • Published Feb 27 • 91
CodePercept: Code-Grounded Visual STEM Perception for MLLMs Paper • 2603.10757 • Published Mar 11 • 15
InCoder-32B: Code Foundation Model for Industrial Scenarios Paper • 2603.16790 • Published Mar 17 • 312
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models Paper • 2604.04707 • Published Apr 6 • 204
Automating Database-Native Function Code Synthesis with LLMs Paper • 2604.06231 • Published Apr 2 • 17
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems Paper • 2604.14228 • Published Apr 14 • 25
WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models Paper • 2604.18224 • Published Apr 20 • 22
A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression Paper • 2604.19572 • Published Apr 21 • 23
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows Paper • 2604.28139 • Published Apr 30 • 42
ClawGym: A Scalable Framework for Building Effective Claw Agents Paper • 2604.26904 • Published Apr 29 • 54
Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution Paper • 2605.15301 • Published May 14 • 22
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution Paper • 2605.18401 • Published May 18 • 130
π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows Paper • 2605.14678 • Published May 19 • 108
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation Paper • 2605.27366 • Published May 26 • 29
LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents Paper • 2605.29559 • Published 30 days ago • 17
Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution Paper • 2606.06492 • Published 23 days ago • 92
When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents Paper • 2606.05806 • Published 23 days ago • 23
SWE-Explore: Benchmarking How Coding Agents Explore Repositories Paper • 2606.07297 • Published 22 days ago • 119
DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch Paper • 2606.10728 • Published 18 days ago • 34
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields Paper • 2606.11042 • Published 18 days ago • 21
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks Paper • 2606.12344 • Published 17 days ago • 68
CODA-BENCH: Can Code Agents Handle Data-Intensive Tasks? Paper • 2606.15300 • Published 14 days ago • 13
FastContext: Training Efficient Repository Explorer for Coding Agents Paper • 2606.14066 • Published 15 days ago • 91
Guava: An Effective and Universal Harness for Embodied Manipulation Paper • 2606.18363 • Published 11 days ago • 28
CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents Paper • 2606.22883 • Published 5 days ago • 33
Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence Paper • 2606.15932 • Published 11 days ago • 31
The Verification Horizon: No Silver Bullet for Coding Agent Rewards Paper • 2606.26300 • Published 3 days ago • 30