MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents Paper • 2508.13186 • Published Aug 14, 2025 • 19
MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams Paper • 2508.06851 • Published Aug 9, 2025
InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles Paper • 2508.16072 • Published Aug 22, 2025 • 4
Dialogue as Discovery: Navigating Human Intent Through Principled Inquiry Paper • 2510.27410 • Published Oct 31, 2025
SVBench: Evaluation of Video Generation Models on Social Reasoning Paper • 2512.21507 • Published Dec 25, 2025 • 8
Yume-1.5: A Text-Controlled Interactive World Generation Model Paper • 2512.22096 • Published Dec 26, 2025 • 61
ProSoftArena: Benchmarking Hierarchical Capabilities of Multimodal Agents in Professional Software Environments Paper • 2601.02399 • Published Dec 30, 2025
MeepleLM: A Virtual Playtester Simulating Diverse Subjective Experiences Paper • 2601.07251 • Published Jan 12 • 11
World Craft: Agentic Framework to Create Visualizable Worlds via Text Paper • 2601.09150 • Published Jan 14 • 19
LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces Paper • 2602.14337 • Published Feb 15 • 13
WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG Paper • 2603.23497 • Published 4 days ago • 84
UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge Paper • 2405.14554 • Published May 23, 2024
GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation Paper • 2411.18499 • Published Nov 27, 2024 • 18
Multi-Sourced Compositional Generalization in Visual Question Answering Paper • 2505.23045 • Published May 29, 2025
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models Paper • 2504.05782 • Published Apr 8, 2025 • 3
SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model Paper • 2505.22126 • Published May 28, 2025 • 3
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models Paper • 2408.02718 • Published Aug 5, 2024 • 62
ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy Paper • 2503.06542 • Published Mar 9, 2025 • 7