Abstract
A compact open-weight multimodal reasoning model is presented that achieves competitive performance through careful architecture design, high-quality data curation, and a hybrid approach combining direct answering with chain-of-thought reasoning.
We present Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model, and share the motivations, design choices, experiments, and learnings that informed its development. Our goal is to contribute practical insight to the research community on building smaller, efficient multimodal reasoning models and to share the result of these learnings as an open-weight model that is good at common vision and language tasks and excels at scientific and mathematical reasoning and understanding user interfaces. Our contributions include demonstrating that careful architecture choices and rigorous data curation enable smaller, open-weight multimodal models to achieve competitive performance with significantly less training and inference-time compute and tokens. The most substantial improvements come from systematic filtering, error correction, and synthetic augmentation -- reinforcing that data quality remains the primary lever for model performance. Systematic ablations show that high-resolution, dynamic-resolution encoders yield consistent improvements, as accurate perception is a prerequisite for high-quality reasoning. Finally, a hybrid mix of reasoning and non-reasoning data with explicit mode tokens allows a single model to deliver fast direct answers for simpler tasks and chain-of-thought reasoning for complex problems.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods (2026)
- VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice (2026)
- CoReTab: Improving Multimodal Table Understanding with Code-driven Reasoning (2026)
- VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration (2026)
- On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs (2026)
- DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning (2026)
- Vision-aligned Latent Reasoning for Multi-modal Large Language Model (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper