From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
Haiwen Diao
Paranioar
AI & ML interests
Vision-and-Language, Parameter-efficient Transfer Learning, Multi-modal Large Language Model
Recent Activity
authored a paper 1 day ago
DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation authored a paper 1 day ago
VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? authored a paper 1 day ago
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture