| FALCON | From Spatial to Actions:
Grounding Vision-Language-Action Model in Spatial Foundation Priors (ICLR 2026)

Zhengshen Zhang Hao Li Yalun Dai Zhengbang Zhu Lei Zhou
Chenchen Liu Dong Wang Francis E. H. Tay Sijin Chen
Ziwei Liu Yuxiao Liu^*^† Xinghang Li^* Pan Zhou^*

^*Corresponding Author ^†Project Lead

ByteDance Seed
National University of Singapore Nanyang Technological University
Tsinghua University Singapore Management University

🚀 Introduction

Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head of a VLA model, enabling robust spatial understanding and SOTA performance across diverse manipulation tasks without disrupting vision-language alignment. See our paper at here.

🤗 Model Zoo

We provide the following model weights and their config files in our paper:

Model Name	VLA Model	Embodied Spatial Model	Note
FALCON-FC-CALVIN-ABC	falcon-esm-fc-calvin-abc-pt	esm-1b	finetune on calvin-abc with RGB inputs to ESM, Tab. 4 and 5.
FALCON-FC-CALVIN-ABC-WDepth	falcon-esm-fc-calvin-abc-wdepth-pt	esm-1b	finetune on calvin-abc with RGB-D inputs to ESM, Tab. 5.
FALCON-3DPC-FC-CALVIN-ABC	falcon-3dpc-fc-calvin-abc-pt	improved DP3 encoder	finetune on calvin-abc with point cloud inputs to idp3 encoder, Tab. 5-Kosmos-VLA (w/ rgb-d).
FALCON-LSTM-CALVIN-ABC	falcon-lstm-calvin-abc-pt	esm-1b	finetune on calvin-abc with RGB inputs to ESM, Tab. 1.
FALCON-LSTM-CALVIN-ABCD	falcon-lstm-calvin-abcd-pt	esm-1b	finetune on calvin-abcd with RGB inputs to ESM, Tab. 1.
FALCON-FC-SimplerEnv-Bridge	falcon-fc-simpler-bridge-pt	esm-1b	pretrained on oxe then finetune on bridge dataset with RGB inputs to ESM, Tab. 2.
FALCON-FC-SimplerEnv-Fractal	falcon-fc-simpler-fractal-pt	esm-1b	pretrained on oxe then finetune on fractal dataset with RGB inputs to ESM, Tab. 3.

📦 Usage

FALCON can be used to predict action based on the vision and language input. FALCON supports several VLA structures, multi-view input, and multi-sensory input (RGB, RGB-D, point cloud). Taking FALCON-FC-CALVIN-ABC as an example:

import torch
import json, functools, copy
from PIL import Image
from falcon.train.base_trainer import BaseTrainer
from falcon.data.data_utils import preprocess_image, get_text_function
from falcon.model.policy_head.esm_utils.vggt.utils.load_fn import load_and_preprocess_images_square_new

configs = josn.load(open('configs/falcon-esm-fc-calvin-abc.json', 'r'))
pretrained_path = 'checkpoints/falcon-esm-fc-calvin-abc-pt'
configs['model_load_path'] = pretrained_path

model = BaseTrainer.from_checkpoint(configs)

image_fn = functools.partial(
    preprocess_image,
    image_processor=model.model.image_processor,
    model_type=configs["model"],
)
text_fn = get_text_function(model.model.tokenizer, configs["model"])
prompt = "Task: pull the handle to open the drawer"
text_tensor, attention_mask = text_fn([prompt])

for step in range(MAX_STEPS):
    image: Image.Image = get_from_side_camera(...)
    # get inputs for esm
    image_vggt = copy.deepcopy(image)
    image = image_fn([image]).unsqueeze(0)

    esm_target_size = 224
    image_vggt_x, _ = load_and_preprocess_images_square_new([image_vggt], target_size=esm_target_size)
    image_vggt_x = image_vggt_x.unsqueeze(0)
    
    input_dict["rgb"] = image
    input_dict["text"] = text_tensor
    input_dict['text_mask'] = attention_mask
    input_dict["rgb_vggt"] = image_vggt_x

    ### if wrist camera is available
    wrist_image: Image.Image = get_from_wrist_camera(...)
    wrist_image = image_fn([wrist_image]).unsqueeze(0)
    input_dict["hand_rgb"] = wrist_image

    with torch.no_grad():
      action = model.inference_step(input_dict)["action"]
    print(action)

🤗 FAQs

If you encounter any issues, feel free to open an issue or reach out through discussions. We appreciate your feedback and contributions! 🚀

🖊️ Citation

If you find this project useful in your research, please consider cite:

@article{zhang2025spatial,
  title={From spatial to actions: Grounding vision-language-action model in spatial foundation priors},
  author={Zhang, Zhengshen and Li, Hao and Dai, Yalun and Zhu, Zhengbang and Zhou, Lei and Liu, Chenchen and Wang, Dong and Tay, Francis EH and Chen, Sijin and Liu, Ziwei and others},
  journal={arXiv preprint arXiv:2510.17439},
  year={2025}
}

🪪 License

All FALCON checkpoints, as well as our codebase are released under the Apache-2.0 License.

❤️ Acknowledgement

FALCON is built with reference to the code of the following projects: RoboVLMs, Microsoft Kosmos-2, VGGT, and ManiUniCon. Thanks for their awesome work!

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for FALCON-VLA/FALCON-series

From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

Paper • 2510.17439 • Published Oct 20, 2025 • 28

| FALCON | From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors (ICLR 2026)