EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation
Abstract
A data-free framework aligns video generative model outputs with vision-language model constraints for improved robotic manipulation, achieving higher success rates through constraint-guided selection and trajectory optimization.
Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present , a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, uses a VLM to automatically extract a set of compositional constraints capturing task-specific requirements, which are then applied at two stages: (1) constraint-guided rollout selection, which scores and filters a batch of VGM rollouts to retain the most physically plausible candidate, and (2) constraint-based trajectory optimization, which uses the selected rollout as initialization and refines the robot trajectory under the same constraint set to correct retargeting errors. We evaluate on six real-robot manipulation tasks requiring precise, constraint-sensitive execution, improving the overall success rate by 43.3\% points over the strongest baseline without any task-specific training data.
Community
Excited to share our new paper: EmboAlign!
EmboAlign aligns video generation with compositional constraints for zero-shot robotic manipulation.
Our key idea:
VLMs provide structured spatial reasoning that complements VGMs. We use VLM-generated task constraints in two stages:
• Constraint-guided rollout selection
• Constraint-based trajectory optimization
On 6 real-robot manipulation tasks, EmboAlign improves success rate by 43.3 points over the strongest baseline, with no task-specific training data.
Paper: https://lnkd.in/gFvQg6He
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning (2026)
- TC-IDM: Grounding Video Generation for Executable Zero-shot Robot Motion (2026)
- Grounding Generated Videos in Feasible Plans via World Models (2026)
- V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks (2026)
- UniManip: General-Purpose Zero-Shot Robotic Manipulation with Agentic Operational Graph (2026)
- AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation (2026)
- Scaling World Model for Hierarchical Manipulation Policies (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper