From Pixels to Words -- Towards Native One-Vision Models at Scale Paper • 2605.28820 • Published 22 days ago • 73
ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration Paper • 2605.03042 • Published May 4 • 134
GenLIP Collection Model weights of paper "Let ViT Speak: Generative Language-Image Pre-training" • 6 items • Updated May 5 • 6
CutClaw: Agentic Hours-Long Video Editing via Music Synchronization Paper • 2603.29664 • Published Mar 31 • 50
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence Paper • 2602.08683 • Published Feb 9 • 52
StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation Paper • 2512.09363 • Published Dec 10, 2025 • 74
view article Article Why Did MiniMax M2 End Up as a Full Attention Model? MiniMax-AI • Oct 30, 2025 • 80
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation Paper • 2510.08673 • Published Oct 9, 2025 • 128
view article Article Introducing Idefics2: A Powerful 8B Vision-Language Model for the community +1 Leyo, HugoLaurencon, VictorSanh • Apr 15, 2024 • 191