WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
AI & ML interests
None defined yet.
Recent Activity
View all activity
Papers
How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning
Communicating about Space: Language-Mediated Spatial Integration Across Partial Views
Benchmarking Vision Language Models for Cultural Understanding
VisMin (visual minimal-change ) is a controlled benchmark and fine-tuned models trained on vismin training set e.g. VisMin-CLIP and VisMin-Idefics2.
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
CTRL-O: Language-Controllable Object-Centric Visual Representation Learning
Benchmarking Vision Language Models for Cultural Understanding
Official artifacts for the paper, The Promise of RL for Autoregressive Image Editing (EARL).
VisMin (visual minimal-change ) is a controlled benchmark and fine-tuned models trained on vismin training set e.g. VisMin-CLIP and VisMin-Idefics2.