view article Article Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action nvidia • 10 days ago • 75
Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents Paper • 2509.26539 • Published Sep 30, 2025 • 10
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Paper • 2509.16197 • Published Sep 19, 2025 • 58
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms Paper • 2410.18967 • Published Oct 24, 2024 • 1
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Paper • 2509.16197 • Published Sep 19, 2025 • 58
DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation Paper • 2503.10618 • Published Mar 13, 2025 • 19
STIV: Scalable Text and Image Conditioned Video Generation Paper • 2412.07730 • Published Dec 10, 2024 • 74
Improve Vision Language Model Chain-of-thought Reasoning Paper • 2410.16198 • Published Oct 21, 2024 • 26
Improve Vision Language Model Chain-of-thought Reasoning Paper • 2410.16198 • Published Oct 21, 2024 • 26
Aria: An Open Multimodal Native Mixture-of-Experts Model Paper • 2410.05993 • Published Oct 8, 2024 • 111