arxiv:2605.24830

Macaron-A2UI: A Model for Generative UI in Personal Agents

Published on May 24

· Submitted by

Andrew Chen on May 26

#3 Paper of the day

Mind Lab

Upvote

Authors:

Pony Ma

Abstract

Generative UI models enable personal agents to synthesize dynamic interfaces with lightweight executable actions for enhanced interaction beyond text-only formats.

AI-generated summary

As personal agents evolve to handle complex, user-centric tasks, static plain-text chat is rapidly becoming a bottleneck. Generative UI emerges as the necessary new interface layer, dynamically synthesizing the right controls, options, and state from the interaction context in real time. We present Macaron-A2UI, a model for Generative UI in personal agents. Our goal is to move beyond text-only interaction by enabling agents to generate natural language together with lightweight, executable UI actions for information collection, preference refinement, confirmation, and multi-goal organization. We build a large-scale Generative UI corpus from heterogeneous dialogue sources, introduce A2UI-Bench for controlled evaluation, and train 30B, 235B and 754B models with parameter-efficient LoRA-based supervised fine-tuning followed by reward-driven reinforcement learning. The best Macaron-A2UI model reaches 75.6 overall on A2UI-Bench without explicit schema hints, surpassing the strongest full-schema frontier baseline. We release the models, benchmark, and evaluation protocol to support future work on Generative UI for personal agents.

View arXiv page View PDF Add to collection

Community

anchen1011

Paper submitter about 18 hours ago

Macaron-A2UI: A Model for Generative UI in Personal Agents

zzy-hugging

about 12 hours ago

Interesting work!

avahal

42 minutes ago

the fact you can hit 75.6 on a2ui-bench without explicit schema hints is pretty striking. that schema-light training recipe, with loRA-sft followed by reward-driven rl, basically lets the model learn to generate executable ui alongside natural language. i’d love to see an ablation where you cut the rl reward model entirely and rely only on supervised fine-tuning — my hunch is rl is doing most of the heavy lifting for action validity and safety. edge cases where controls differ across apps or safety policies kick in could expose brittleness in the generated widgets. btw, arxivlens had a solid breakdown that helped me parse the method details: https://arxivlens.com/PaperView/Details/macaron-a2ui-a-model-for-generative-ui-in-personal-agents-495-62505cf9 do you plan to publish an ablation on rl vs sft and test true cross-app robustness in a follow-up?