7 26 11

Haotian Zhang

haotiz

AI & ML interests

Vision and Language

Recent Activity

upvoted a paper 7 days ago

Cosmos 3: Omnimodal World Models for Physical AI

liked a model 10 days ago

nvidia/Cosmos3-Super

upvoted a collection 10 days ago

Cosmos3

View all activity

Organizations

upvoted a paper 7 days ago

Cosmos 3: Omnimodal World Models for Physical AI

Paper • 2606.02800 • Published 11 days ago • 113

liked a model 10 days ago

nvidia/Cosmos3-Super

65B • Updated 5 days ago • 38.2k • 166

upvoted a collection 10 days ago

Cosmos3

Collection

Omnimodal World Models for Physical AI • 15 items • Updated 4 days ago • 109

upvoted an article 10 days ago

Article

Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action

nvidia

•

10 days ago

• 75

liked a model 10 days ago

nvidia/Cosmos3-Nano

16B • Updated 10 days ago • 45.8k • 221

liked a dataset 6 months ago

nvidia/PhysicalAI-Autonomous-Vehicle-Cosmos-Drive-Dreams

Updated Jun 15, 2025 • 7.54k • 54

upvoted a paper 8 months ago

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Paper • 2509.26539 • Published Sep 30, 2025 • 10

authored 2 papers 9 months ago

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Paper • 2509.16197 • Published Sep 19, 2025 • 58

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Paper • 2410.18967 • Published Oct 24, 2024 • 1

upvoted 2 papers 9 months ago

AToken: A Unified Tokenizer for Vision

Paper • 2509.14476 • Published Sep 17, 2025 • 37

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Paper • 2509.16197 • Published Sep 19, 2025 • 58

liked a model about 1 year ago

reducto/RolmOCR

Image-Text-to-Text • 8B • Updated Apr 2, 2025 • 210k • 586

upvoted a paper about 1 year ago

DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation

Paper • 2503.10618 • Published Mar 13, 2025 • 19

upvoted a paper over 1 year ago

STIV: Scalable Text and Image Conditioned Video Generation

Paper • 2412.07730 • Published Dec 10, 2024 • 74

authored a paper over 1 year ago

Improve Vision Language Model Chain-of-thought Reasoning

Paper • 2410.16198 • Published Oct 21, 2024 • 26

upvoted 2 papers over 1 year ago

Improve Vision Language Model Chain-of-thought Reasoning

Paper • 2410.16198 • Published Oct 21, 2024 • 26

Aria: An Open Multimodal Native Mixture-of-Experts Model

Paper • 2410.05993 • Published Oct 8, 2024 • 111

authored a paper over 1 year ago

MM-Ego: Towards Building Egocentric Multimodal LLMs

Paper • 2410.07177 • Published Oct 9, 2024 • 22

upvoted 2 papers over 1 year ago

Pixtral 12B

Paper • 2410.07073 • Published Oct 9, 2024 • 69

MM-Ego: Towards Building Egocentric Multimodal LLMs

Paper • 2410.07177 • Published Oct 9, 2024 • 22

Haotian Zhang

AI & ML interests

Recent Activity

Organizations

haotiz's activity

Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action