arxiv:2603.05757

EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

Published on Mar 5

· Submitted by

Harry Zhang on Mar 12

Northwestern University

Upvote

Authors:

Gehao Zhang ,

Zhenyang Ni ,

Abstract

A data-free framework aligns video generative model outputs with vision-language model constraints for improved robotic manipulation, achieving higher success rates through constraint-guided selection and trajectory optimization.

AI-generated summary

Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present , a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, uses a VLM to automatically extract a set of compositional constraints capturing task-specific requirements, which are then applied at two stages: (1) constraint-guided rollout selection, which scores and filters a batch of VGM rollouts to retain the most physically plausible candidate, and (2) constraint-based trajectory optimization, which uses the selected rollout as initialization and refines the robot trajectory under the same constraint set to correct retargeting errors. We evaluate on six real-robot manipulation tasks requiring precise, constraint-sensitive execution, improving the overall success rate by 43.3\% points over the strongest baseline without any task-specific training data.

View arXiv page View PDF Add to collection

Community

Harry4582

Paper author Paper submitter 1 day ago

Excited to share our new paper: EmboAlign!

EmboAlign aligns video generation with compositional constraints for zero-shot robotic manipulation.

Our key idea:
VLMs provide structured spatial reasoning that complements VGMs. We use VLM-generated task constraints in two stages:
• Constraint-guided rollout selection
• Constraint-based trajectory optimization

On 6 real-robot manipulation tasks, EmboAlign improves success rate by 43.3 points over the strongest baseline, with no task-specific training data.

Paper: https://lnkd.in/gFvQg6He

librarian-bot

about 10 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.05757 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.05757 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.05757 in a Space README.md to link it from this page.