arxiv:2606.07723

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

Published on Jun 5

· Submitted by

Siyi Chen on Jun 9

NVIDIA

Upvote

Authors:

Abstract

VoLoAgent enables physical orchestration by integrating vision-language models with robot capabilities for open-vocabulary long-horizon manipulation tasks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Open-vocabulary long-horizon manipulation requires robots to reason over flexible instructions and complex multi-object scenes while adaptively planning, executing, monitoring, and recovering from failures. We address these demands with a closed agent loop in which a VLM orchestrates heterogeneous robot capabilities as interruptible tools. Unlike in virtual AI agents, the timing of decisions, actions and tool calls is important in a physical world that does not pause for reasoning. We refer to this setting as Physical Orchestration, and propose VoLoAgent, a VLM that plans, monitors, and recovers by treating a VLA/WAM as an interruptible tool it steers mid-rollout alongside vision models and action primitives. To evaluate these long-horizon capabilities, we introduce RoboVoLo, a high-fidelity benchmark for open-vocabulary long-horizon manipulation across common sense, memory/state tracking, complex references, and world knowledge, with both task-level success and failure-mode diagnostics. Experiments show VoLoAgent substantially outperforms single VLA/VLM or tool-based systems, with validation on real-robot experiments. Project page: https://chicychen.github.io/VoLo/