LeRobot documentation

Multitask DiT Policy

LeRobot

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Multitask DiT Policy

Multitask Diffusion Transformer (DiT) Policy is an evolution of the original Diffusion Policy architecture, which leverages a large DiT with text and vision conditioning for multitask robot learning. This implementation supports both diffusion and flow matching objectives for action generation, enabling robots to perform diverse manipulation tasks conditioned on language instructions.

Model Overview

The model uses:

CLIP Vision Encoder: Processes RGB images from multiple camera views
CLIP Text Encoder: Encodes language task instructions (frozen weights with learnable projection)
Diffusion Transformer: Predicts action sequences conditioned on observations and language
Two Objectives: Supports both diffusion (DDPM/DDIM) and flow matching for action generation

This model is exciting because you can achieve extremely high dexterity, competitive with multi-billion parameter VLAs, with only ~450M parameters and significantly less training.

Installation Requirements

Multitask DiT Policy has additional dependencies. Install it with:

pip install lerobot[multi_task_dit]

This will install all necessary dependencies including the HuggingFace Transformers library for CLIP models.

Usage

To use Multitask DiT in your LeRobot configuration, specify the policy type as:

policy.type=multi_task_dit

Training

Basic Training Command

Here’s a complete training command for training Multitask DiT on your dataset:

lerobot-train \
  --dataset.repo_id=YOUR_DATASET \
  --output_dir=./outputs/multitask_dit_training \
  --batch_size=32 \
  --steps=5000 \
  --save_freq=500 \
  --log_freq=100 \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.repo_id="HF_USER/multitask-dit-your-robot" \
  --wandb.enable=true

Recommended Hyperparameters and Dataset Details (30Hz Control Frequency)

For reliable performance, start with these suggested default hyperparameters:

lerobot-train \
  --dataset.repo_id=YOUR_DATASET \
  --output_dir=./outputs/mutitask_dit_training \
  --batch_size=320 \
  --steps=30000 \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.horizon=32 \
  --policy.n_action_steps=24 \
  --policy.objective=diffusion \
  --policy.noise_scheduler_type=DDPM \
  --policy.num_train_timesteps=100 \
  --policy.repo_id="HF_USER/multitask-dit-your-robot" \
  --wandb.enable=true

Key Parameters:

Batch Size: 192-320 - If you have access to a GPU that can support this, you will get the best training dynamics
Horizon: 32 - number of action steps to predict, ~1.0 sec at 30Hz
n_action_steps: 24 - ~0.8 seconds at 30Hz
Objective: diffusion - start with diffusion and experiment with flow matching if generation quality is poor
Training Steps: >30k steps recommended for a single task

Training Configuration Parameters

Objective Selection

Choose between diffusion and flow matching:

# Diffusion objective (default)
--policy.objective=diffusion \
--policy.noise_scheduler_type=DDPM \  # or "DDIM"
--policy.num_train_timesteps=100 \
--policy.num_inference_steps=10 \  # For faster inference
--policy.beta_schedule=squaredcos_cap_v2 \  # Noise schedule type
--policy.prediction_type=epsilon \  # "epsilon" (predict noise) or "sample" (predict clean)
--policy.clip_sample=true \  # Clip samples during denoising
--policy.clip_sample_range=1.0  # Clipping range [-x, x]

# Flow matching objective
--policy.objective=flow_matching \
--policy.timestep_sampling_strategy=beta \  # or "uniform" | the beta sampling strategy performance appears much better in practice
--policy.num_integration_steps=100 \
--policy.integration_method=euler \  # or "rk4"
--policy.sigma_min=0.0  # Minimum noise in flow interpolation path

Transformer Architecture

Adjust model capacity based on dataset size:

# Small datasets (< 100 examples)
--policy.num_layers=4 \
--policy.hidden_dim=512 \
--policy.num_heads=8  # should ideally be hidden_dim // 64

# Medium datasets (100-5k examples) - default
--policy.num_layers=6 \
--policy.hidden_dim=512 \
--policy.num_heads=8  # should ideally be hidden_dim // 64

# Large datasets (> 5k examples)
--policy.num_layers=8 \
--policy.hidden_dim=512 \
--policy.num_heads=8   # should ideally be hidden_dim // 64

Positional Encoding Options:

The model supports two positional encoding methods for action sequences:

# Rotary Position Embedding (RoPE) - default, recommended
--policy.use_rope=true \
--policy.rope_base=10000.0  # Base frequency for RoPE

# Absolute positional encoding
--policy.use_positional_encoding=true  # Disables RoPE when true

Other Transformer Parameters:

--policy.dropout=0.1  # Dropout rate for DiT blocks (0.0-1.0)
--policy.timestep_embed_dim=256  # Timestep embedding dimension

Vision Encoder Configuration

# Use different CLIP model for more expressivity at the cost of inference time
# experiment with larger or smaller models depending on the complexity of your tasks and size of dataset
--policy.vision_encoder_name=openai/clip-vit-large-patch14

# Use separate vision encoder per camera
# This may be useful when cameras have significantly different characteristics, but
# be wary of increased VRAM footprint.
--policy.use_separate_rgb_encoder_per_camera=true

# Image preprocessing
--policy.image_resize_shape=[XXX,YYY] \ # you may need to resize your images for inference speed ups
--policy.image_crop_shape=[224,224] \
--policy.image_crop_is_random=true  # Random during training, center at inference

Text Encoder Configuration

# Use different CLIP text encoder model
# same as vision: experiment with larger or smaller models depending on the
# complexity of your tasks and size of dataset
--policy.text_encoder_name=openai/clip-vit-large-patch14

Learning Rate Configuration

The vision encoder uses a separate learning rate multiplier, where 1/10th is suggested to be the ideal staritng point:

--policy.optimizer_lr=2e-5 \
--policy.vision_encoder_lr_multiplier=0.1  # Vision encoder LR = 0.1 * optimizer_lr

Training Tuning Guidelines

1. Flow Matching with Beta Sampling

The original diffusion implementation here is based on the work described in TRI’s LBM paper

Additionally, we have implemented a flow-matching objective, which is described at a high-level in Boston Dynamics blog post.

Consider testing the flow-matching objective and evaluating performance differences for your task:

--policy.objective=flow_matching \
--policy.timestep_sampling_strategy=beta \
--policy.timestep_sampling_alpha=1.5 \
--policy.timestep_sampling_beta=1.0 \
--policy.timestep_sampling_s=0.999

This hasn’t been shown to be a silver bullet across every user case, but it occasionally results in smoother and more consistent actions.

2. Number of Transformer Layers

Match model capacity to your dataset size:

Small datasets (< 100 examples): Reduce to 4 layers
Large datasets (> 5k examples): Increase to 8 layers

3. horizon Tuning

The model can be sensitive to the horizon you choose. Start with around a 1 second horizon based on your control frequency:

30 Hz frequency: horizon=30
10 Hz frequency: horizon=10

Then experiment with increasing from there. The horizon determines how far into the future the model predicts actions.

4. n_action_steps Sensitivity

The model can also be very sensitive to n_action_steps. Start with it being around 0.8 seconds based on your control frequency and tune from there:

Lower values: More reactive but potentially less stable for long-horizon tasks
Higher values: Better for long-horizon execution but open-loop failures are limited in their recovery

Inference Tuning

For faster inference, use DDIM with fewer sampling steps:

--policy.noise_scheduler_type=DDIM \
--policy.num_inference_steps=10

Resuming Training

To resume training from a checkpoint:

lerobot-train \
  --config_path=./outputs/mutitask_dit_training/checkpoints/last/pretrained_model/train_config.json \
  --resume=true

The checkpoint directory should contain model.safetensors and config.json files (saved automatically during training). When resuming, the configuration is loaded from the checkpoint, so you don’t need to specify other parameters.

Common Failure Modes and Debugging

Training these models can be finicky. Here are common failure modes and debugging approaches:

Idling / No Motion

The model may “collapse” during inference, resulting in static or no motion. This can occur when:

Insufficient training data: If you only have 20-50 examples, try to roughly double your dataset size. Once you have above 300 examples, if you’re still seeing this, the task may be too complex.
Multiple similar tasks: When your dataset contains multiple similar tasks (e.g., picking up 2 different objects), the model may rely too heavily on language conditioning which might not be rich enough.

Debugging tips:

Increase dataset size (double until you get to over 300 examples)
Train for longer, up to 100k steps, even when the loss flatlines
Check if the model is receiving proper language instructions or increase diversity of instruction

Executing the Wrong Task

Sometimes the robot will completely ignore your instruction and perform some other task. This generally only happens if you have trained on multiple tasks.

Potential causes:

Language instruction ambiguity
Insufficient task-specific training data
Model confusion between similar tasks in the multitask dataset

Debugging tips:

Verify language instruction specificity, especially if descriptions are similar between multiple tasks
Check task distribution in your training dataset and add weighting to the failing/ignored task
Consider task-specific fine-tuning

Training Instability

If training loss is unstable or diverging:

Try adjusting learning rate between 1e-5 and 3e-4
Increase batch size if possible
Check that your dataset normalization is correct
Verify image preprocessing is working correctly

Performance Considerations

GPU Requirements

Inference: At least an RTX 5070 Ti (or equivalent GPU) is recommended for reasonable speed performance
Training: A GPU with enough VRAM to load batch sizes of >64 is ideal, which will vary depending on the number of image observations, etc

Batch Size Recommendations

Minimum: 64 (less than this may result in unstable training)
Recommended: 256-320 (best performance, requires larger GPU)

Example: Training on Custom Dataset

Here’s a complete example training on a custom dataset:

lerobot-train \
  --dataset.repo_id=YOUR_DATASET \
  --output_dir=./outputs/mutitask_dit_training \
  --batch_size=320 \
  --steps=30000 \
  --save_freq=1000 \
  --log_freq=100 \
  --env_eval_freq=1000 \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.horizon=32 \
  --policy.n_action_steps=24 \
  --policy.objective=diffusion \
  --policy.noise_scheduler_type=DDPM \
  --policy.num_layers=6 \
  --policy.hidden_dim=512 \
  --policy.vision_encoder_name=openai/clip-vit-base-patch16 \
  --policy.image_resize_shape=[320,240] \
  --policy.image_crop_shape=[224,224] \
  --policy.repo_id="HF_USER/multitask-dit-your-robot" \
  --wandb.enable=true \
  --wandb.project=multitask_dit

Libero Results

python -m lerobot.scripts.lerobot_train \
  --dataset.repo_id=HuggingFaceVLA/libero \
  --policy.type=multi_task_dit \
  --policy.push_to_hub=false \
  --output_dir="./outputs/multitask_dit_libero" \
  --job_name="multitask-dit-libero" \
  --wandb.enable=true \
  --wandb.project=multitask_dit_libero \
  --dataset.image_transforms.enable=true \
  --dataset.image_transforms.max_num_transforms=4 \
  --dataset.image_transforms.tfs='{"brightness":{"type":"ColorJitter","kwargs":{"brightness":[0.75,1.25]}},"contrast":{"type":"ColorJitter","kwargs":{"contrast":[0.6,1.4]}},"saturation":{"type":"ColorJitter","kwargs":{"saturation":[0.8,1.2]}},"hue":{"type":"ColorJitter","kwargs":{"hue":[-0.05,0.05]}},"sharpness":{"type":"SharpnessJitter","kwargs":{"sharpness":[0.6,1.4]}},"rotation":{"type":"RandomRotation","kwargs":{"degrees":[-5,5]}},"translation":{"type":"RandomAffine","kwargs":{"degrees":0,"translate":[0.1,0.1]}}}' \
  --dataset.video_backend=torchcodec \
  --policy.use_amp=true \
  --policy.horizon=48 \
  --policy.n_obs_steps=2 \
  --policy.use_rope=true \
  --policy.use_positional_encoding=false \
  --policy.hidden_dim=768 \
  --policy.num_layers=8 \
  --policy.num_heads=12 \
  --policy.dropout=0.1 \
  --policy.timestep_embed_dim=256 \
  --policy.objective=diffusion \
  --policy.optimizer_lr=3e-4 \
  --policy.optimizer_weight_decay=0 \
  --policy.scheduler_warmup_steps=0 \
  --policy.vision_encoder_name=openai/clip-vit-base-patch16 \
  --policy.image_resize_shape=[256,256] \
  --policy.image_crop_is_random=true \
  --policy.text_encoder_name=openai/clip-vit-base-patch16 \
  --policy.vision_encoder_lr_multiplier=0.1 \
  --policy.device=cuda \
  --num_workers=8 \
  --save_freq=4000 \
  --log_freq=100 \
  --steps=100000 \
  --batch_size=320

Results:

LIBERO Spatial	LIBERO Object	LIBERO Goal	LIBERO 10	Average
87.0	98.2	93.8	83.2	90.6

References

For more details on the technical implementation and architecture, see:

Update on GitHub

←X-VLA WALL-OSS→