LeRobot documentation

Adding a New Benchmark

LeRobot

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Adding a New Benchmark

This guide walks you through adding a new simulation benchmark to LeRobot. Follow the steps in order and use the existing benchmarks as templates.

A benchmark in LeRobot is a set of Gymnasium environments that wrap a third-party simulator (like LIBERO or Meta-World) behind a standard gym.Env interface. The lerobot-eval CLI then runs evaluation uniformly across all benchmarks.

Existing benchmarks at a glance

Before diving in, here is what is already integrated:

Benchmark	Env file	Config class	Tasks	Action dim	Processor
LIBERO	`envs/libero.py`	`LiberoEnv`	130 across 5 suites	7	`LiberoProcessorStep`
Meta-World	`envs/metaworld.py`	`MetaworldEnv`	50 (MT50)	4	None
IsaacLab Arena	Hub-hosted	`IsaaclabArenaEnv`	Configurable	Configurable	`IsaaclabArenaProcessorStep`

Use src/lerobot/envs/libero.py and src/lerobot/envs/metaworld.py as reference implementations.

How it all fits together

Data flow

During evaluation, data moves through four stages:

1. gym.Env  ──→  raw observations (numpy dicts)

2. Preprocessing  ──→  standard LeRobot keys + task description
   (preprocess_observation in envs/utils.py, env.call("task_description"))

3. Processors  ──→  env-specific then policy-specific transforms
   (env_preprocessor, policy_preprocessor)

4. Policy  ──→  select_action()  ──→  action tensor
   then reverse: policy_postprocessor → env_postprocessor → numpy action → env.step()

Most benchmarks only need to care about stage 1 (producing observations in the right format) and optionally stage 3 (if env-specific transforms are needed).

Environment structure

make_env() returns a nested dict of vectorized environments:

dict[str, dict[int, gym.vector.VectorEnv]]
#    ^suite       ^task_id

A single-task env (e.g. PushT) looks like {"pusht": {0: vec_env}}. A multi-task benchmark (e.g. LIBERO) looks like {"libero_spatial": {0: vec0, 1: vec1, ...}, ...}.

How evaluation runs

All benchmarks are evaluated the same way by lerobot-eval:

make_env() builds the nested {suite: {task_id: VectorEnv}} dict.
eval_policy_all() iterates over every suite and task.
For each task, it runs n_episodes rollouts via rollout().
Results are aggregated hierarchically: episode, task, suite, overall.
Metrics include pc_success (success rate), avg_sum_reward, and avg_max_reward.

The critical piece: your env must return info["is_success"] on every step() call. This is how the eval loop knows whether a task was completed.

What your environment must provide

LeRobot does not enforce a strict observation schema. Instead it relies on a set of conventions that all benchmarks follow.

Env attributes

Your gym.Env must set these attributes:

Attribute	Type	Why
`_max_episode_steps`	`int`	`rollout()` uses this to cap episode length
`task_description`	`str`	Passed to VLA policies as a language instruction
`task`	`str`	Fallback identifier if `task_description` is not set

Success reporting

Your step() and reset() must include "is_success" in the info dict:

info = {"is_success": True}   # or False
return observation, reward, terminated, truncated, info

Observations

The simplest approach is to map your simulator’s outputs to the standard keys that preprocess_observation() already understands. Do this inside your gym.Env (e.g. in a _format_raw_obs() helper):

Your env should output	LeRobot maps it to	What it is
`"pixels"` (single array)	`observation.image`	Single camera image, HWC uint8
`"pixels"` (dict)	`observation.images.<cam>`	Multiple cameras, each HWC uint8
`"agent_pos"`	`observation.state`	Proprioceptive state vector
`"environment_state"`	`observation.env_state`	Full environment state (e.g. PushT)
`"robot_state"`	`observation.robot_state`	Nested robot state dict (e.g. LIBERO)

If your simulator uses different key names, you have two options:

Recommended: Rename them to the standard keys inside your gym.Env wrapper.
Alternative: Write an env processor to transform observations after preprocess_observation() runs (see step 4 below).

Actions

Actions are continuous numpy arrays in a gym.spaces.Box. The dimensionality depends on your benchmark (7 for LIBERO, 4 for Meta-World, etc.). Policies adapt to different action dimensions through their input_features / output_features config.

Feature declaration

Each EnvConfig subclass declares two dicts that tell the policy what to expect:

features — maps feature names to PolicyFeature(type, shape) (e.g. action dim, image shape).
features_map — maps raw observation keys to LeRobot convention keys (e.g. "agent_pos" to "observation.state").

Step by step

At minimum, you need two files: a **gym.Env wrapper** and an **EnvConfig subclass** with a `create_envs()` override. Everything else is optional or documentation. No changes to `factory.py` are needed.

Checklist

File	Required	Why
`src/lerobot/envs/<benchmark>.py`	Yes	Wraps the simulator as a standard gym.Env
`src/lerobot/envs/configs.py`	Yes	Registers your benchmark and its `create_envs()` for the CLI
`src/lerobot/processor/env_processor.py`	Optional	Custom observation/action transforms
`src/lerobot/envs/utils.py`	Optional	Only if you need new raw observation keys
`pyproject.toml`	Yes	Declares benchmark-specific dependencies
`docs/source/<benchmark>.mdx`	Yes	User-facing documentation page
`docs/source/_toctree.yml`	Yes	Adds your page to the docs sidebar

1. The gym.Env wrapper ( src/lerobot/envs/<benchmark>.py )

Create a gym.Env subclass that wraps the third-party simulator:

class MyBenchmarkEnv(gym.Env):
    metadata = {"render_modes": ["rgb_array"], "render_fps": <fps>}

    def __init__(self, task_suite, task_id, ...):
        super().__init__()
        self.task = <task_name_string>
        self.task_description = <natural_language_instruction>
        self._max_episode_steps = <max_steps>
        self.observation_space = spaces.Dict({...})
        self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32)

    def reset(self, seed=None, **kwargs):
        ...  # return (observation, info) — info must contain {"is_success": False}

    def step(self, action: np.ndarray):
        ...  # return (obs, reward, terminated, truncated, info) — info must contain {"is_success": <bool>}

    def render(self):
        ...  # return RGB image as numpy array

    def close(self):
        ...

GPU-based simulators (e.g. MuJoCo with EGL rendering): If your simulator allocates GPU/EGL contexts during __init__, defer that allocation to a _ensure_env() helper called on first reset()/step(). This avoids inheriting stale GPU handles when AsyncVectorEnv spawns worker processes. See LiberoEnv._ensure_env() for the pattern.

Also provide a factory function that returns the nested dict structure:

def create_mybenchmark_envs(
    task: str,
    n_envs: int,
    gym_kwargs: dict | None = None,
    env_cls: type | None = None,
) -> dict[str, dict[int, Any]]:
    """Create {suite_name: {task_id: VectorEnv}} for MyBenchmark."""
    ...

See create_libero_envs() (multi-suite, multi-task) and create_metaworld_envs() (difficulty-grouped tasks) for reference.

2. The config ( src/lerobot/envs/configs.py )

Register a config dataclass so users can select your benchmark with --env.type=<name>. Each config owns its environment creation and processor logic via two methods:

create_envs(n_envs, use_async_envs) — Returns {suite: {task_id: VectorEnv}}. The base class default uses gym.make() for single-task envs. Multi-task benchmarks override this.
get_env_processors() — Returns (preprocessor, postprocessor). The base class default returns identity (no-op) pipelines. Override if your benchmark needs observation/action transforms.

@EnvConfig.register_subclass("<benchmark_name>")
@dataclass
class MyBenchmarkEnvConfig(EnvConfig):
    task: str = "<default_task>"
    fps: int = <fps>
    obs_type: str = "pixels_agent_pos"

    features: dict[str, PolicyFeature] = field(default_factory=lambda: {
        ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(<action_dim>,)),
    })
    features_map: dict[str, str] = field(default_factory=lambda: {
        ACTION: ACTION,
        "agent_pos": OBS_STATE,
        "pixels": OBS_IMAGE,
    })

    def __post_init__(self):
        ...  # populate features based on obs_type

    @property
    def gym_kwargs(self) -> dict:
        return {"obs_type": self.obs_type, "render_mode": self.render_mode}

    def create_envs(self, n_envs: int, use_async_envs: bool = True):
        """Override for multi-task benchmarks or custom env creation."""
        from lerobot.envs.<benchmark> import create_<benchmark>_envs
        return create_<benchmark>_envs(task=self.task, n_envs=n_envs, ...)

    def get_env_processors(self):
        """Override if your benchmark needs observation/action transforms."""
        from lerobot.processor import PolicyProcessorPipeline
        from lerobot.processor.env_processor import MyBenchmarkProcessorStep
        return (
            PolicyProcessorPipeline(steps=[MyBenchmarkProcessorStep()]),
            PolicyProcessorPipeline(steps=[]),
        )

Key points:

The register_subclass name is what users pass on the CLI (--env.type=<name>).
features tells the policy what the environment produces.
features_map maps raw observation keys to LeRobot convention keys.
No changes to factory.py needed — the factory delegates to cfg.create_envs() and cfg.get_env_processors() automatically.

3. Env processor (optional — src/lerobot/processor/env_processor.py )

Only needed if your benchmark requires observation transforms beyond what preprocess_observation() handles (e.g. image flipping, coordinate conversion). Define the processor step here and return it from get_env_processors() in your config (see step 2):

@dataclass
@ProcessorStepRegistry.register(name="<benchmark>_processor")
class MyBenchmarkProcessorStep(ObservationProcessorStep):
    def _process_observation(self, observation):
        processed = observation.copy()
        # your transforms here
        return processed

    def transform_features(self, features):
        return features  # update if shapes change

    def observation(self, observation):
        return self._process_observation(observation)

See LiberoProcessorStep for a full example (image rotation, quaternion-to-axis-angle conversion).

4. Dependencies ( pyproject.toml )

Add a new optional-dependency group:

mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"]

Pinning rules:

Always pin benchmark packages to exact versions for reproducibility (e.g. metaworld==3.0.0).
Add platform markers when needed (e.g. ; sys_platform == 'linux').
Pin fragile transitive deps if known (e.g. gymnasium==1.1.0 for Meta-World).
Document constraints in your benchmark doc page.

Users install with:

pip install -e ".[mybenchmark]"

5. Documentation ( docs/source/<benchmark>.mdx )

Write a user-facing page following the template in the next section. See docs/source/libero.mdx and docs/source/metaworld.mdx for full examples.

6. Table of contents ( docs/source/_toctree.yml )

Add your benchmark to the “Benchmarks” section:

- sections:
    - local: libero
      title: LIBERO
    - local: metaworld
      title: Meta-World
    - local: envhub_isaaclab_arena
      title: NVIDIA IsaacLab Arena Environments
    - local: <your_benchmark>
      title: <Your Benchmark Name>
  title: "Benchmarks"

Verifying your integration

After completing the steps above, confirm that everything works:

Install — pip install -e ".[mybenchmark]" and verify the dependency group installs cleanly.
Smoke test env creation — call make_env() with your config in Python, check that the returned dict has the expected {suite: {task_id: VectorEnv}} shape, and that reset() returns observations with the right keys.
Run a full eval — lerobot-eval --env.type=<name> --env.task=<task> --eval.n_episodes=1 --policy.path=<any_compatible_policy> to exercise the full pipeline end-to-end. (batch_size defaults to auto-tuning based on CPU cores; pass --eval.batch_size=1 to force a single environment.)
Check success detection — verify that info["is_success"] flips to True when the task is actually completed. This is what the eval loop uses to compute success rates.

Writing a benchmark doc page

Each benchmark .mdx page should include:

Title and description — 1-2 paragraphs on what the benchmark tests and why it matters.
Links — paper, GitHub repo, project website (if available).
Overview image or GIF.
Available tasks — table of task suites with counts and brief descriptions.
Installation — pip install -e ".[<benchmark>]" plus any extra steps (env vars, system packages).
Evaluation — recommended lerobot-eval command with n_episodes for reproducible results. batch_size defaults to auto; only specify it if needed. Include single-task and multi-task examples if applicable.
Policy inputs and outputs — observation keys with shapes, action space description.
Recommended evaluation episodes — how many episodes per task is standard.
Training — example lerobot-train command.
Reproducing published results — link to pretrained model, eval command, results table (if available).

See docs/source/libero.mdx and docs/source/metaworld.mdx for complete examples.

Update on GitHub

←Control & Train Robots in Sim (LeIsaac) LIBERO→