Papers
arxiv:2602.20161

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

A compact vision-language-diffusion model called Mobile-O enables efficient unified multimodal understanding and generation on mobile devices through specialized architecture design and optimized training methodology.

AI-generated summary

Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/

Community

TL;DR: Introducing Mobile-O

  • What it is: A compact, unified multimodal model that brings both visual understanding and image generation directly to mobile devices.
  • The Breakthrough: It eliminates cloud dependency for multimodal AI. Using a novel "Mobile Conditioning Projector," it achieves high efficiency with minimal compute.
  • Real-World Speed: It can generate a 512x512 image in about 3 seconds natively on an iPhone.
  • The Benchmarks: Despite its small size, it outperforms existing unified models like Show-O and JanusFlow in both visual understanding and generation (scoring 74% on GenEval), while running 6x to 11x faster.

iOS App: https://apps.apple.com/us/app/mobile-o/id6759238106

image

image

image

More Examples

image_generation image_editing

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.20161 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.20161 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.20161 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.