SciIR: A Large-scale Training Dataset and Benchmark for Scientific Image Reasoning Generation
Abstract
Scientific image generation faces challenges in semantic alignment and logical reasoning, prompting the creation of SciIR-82k dataset and SciIR-Bench evaluation framework to improve scientific reasoning capabilities in text-to-image models.
While Text-to-Image (T2I) models have shown remarkable success in generating photorealistic visual content, they still struggle with the rigorous semantic alignment and logical reasoning required for scientific imagery. Inspired by Peirce's Semiotic Triad, we introduce Scientific Image Reasoning (SciIR), a comprehensive resource for training and evaluation of scientific image generation. We formalize scientific reasoning into three core dimensions: Entity Structure (Icon), Scientific Process (Index), and Scientific Law (Symbol). Specifically, to overcome the scarcity of training data in scientific image generation, we elaborately create SciIR-82k, a large-scale dataset containing over 80,000 high-quality scientific image-text pairs from cutting-edge publications. The dataset is hierarchically organized according to the semiotic dimensions and incorporates a Scientific Reasoning Chain-of-Thought (Sci-RCoT) to explicitly model underlying visual logic. For evaluation, we propose SciIR-Bench, which aligns with these three semiotic levels and employs an Atomic Checklist to convert the outcome-oriented scientific accuracy into process-oriented, verifiable, fine-grained questions. Our extensive experiments reveal significant deficiencies in current models' scientific reasoning capabilities. Furthermore, by fine-tuning on the SciIR-82k dataset, we developed the Qwen-Image-SciIR model, which achieves a substantial improvement on the SciIR-Bench, increasing the final score from 35\% to 43\%, laying a solid foundation for future advances in scientific image generation.
Community
While Text-to-Image (T2I) models have shown remarkable success in generating photorealistic visual content, they still struggle with the rigorous semantic alignment and logical reasoning required for scientific imagery. Inspired by Peirce's Semiotic Triad, we introduce Scientific Image Reasoning (SciIR), a comprehensive resource for training and evaluation of scientific image generation.
We formalize scientific reasoning in three core dimensions: Entity Structure (Icon), Scientific Process (Index), and Scientific Law (Symbol). To overcome the scarcity of training data in scientific image generation, we elaborately create SciIR-82k, a large-scale dataset containing over 80,000 high-quality scientific image–text pairs from cutting-edge publications. The dataset is hierarchically organized according to the semiotic dimensions and incorporates a Scientific Reasoning Chain-of-Thought (Sci-RCoT) to explicitly model the underlying visual logic.
For evaluation, we propose SciIR-Bench, which aligns with these three semiotic levels and employs an Atomic Checklist to convert the outcome-oriented scientific accuracy into process-oriented, verifiable, fine-grained questions. Our extensive experiments reveal significant deficiencies in current models' scientific reasoning capabilities. Furthermore, by fine-tuning on the SciIR-82k dataset, we developed the Qwen-Image-SciIR model, which achieves a substantial improvement on the SciIR-Bench, increasing the final score from 35% to 43%, laying a solid foundation for future advances in scientific image generation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models (2026)
- WeGenBench: A Multidimensional Diagnostic Benchmark towards Text-to-Image Model Optimization (2026)
- TextSculptor: Training and Benchmarking Scene Text Editing (2026)
- Towards Characterizing Scientific Image Utility and Upgradability (2026)
- VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence (2026)
- DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning (2026)
- DiCoBench: Benchmarking Multi-Image Fine-Grained Perception via Differential and Commonality Visual Cues (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
🎉 Official resources for the ECCV 2026 paper are now available!
📄 Paper: https://arxiv.org/abs/2606.30124
📦 Code: https://github.com/MAIR-Lab-HUST/SciIR
🤗 Dataset: https://huggingface.co/datasets/MAIR-Lab-HUST/SciIR-82k
SciIR introduces a large-scale training dataset and benchmark for Scientific Image Reasoning Generation, aiming to push text-to-image models beyond visual plausibility toward scientific correctness.
Built upon the semiotic triad of Entity Structure, Scientific Process, and Scientific Law, SciIR-82k provides 80K+ high-quality scientific image-text pairs with reasoning-chain supervision, while SciIR-Bench evaluates scientific accuracy through fine-grained atomic checklists.
By fine-tuning on SciIR-82k, Qwen-Image-SciIR improves the SciIR-Bench score from 35% to 43%, laying a foundation for more faithful and reasoning-aware scientific image generation.
If you find our work useful, please consider starring the repository, using the dataset, and citing our paper. Thanks for your support! ⭐🚀
Get this paper in your agent:
hf papers read 2606.30124 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper