Isaac Lab Reinforcement Learning: A Robot-Training Tutorial

Isaac Lab Reinforcement Learning: A Robot-Training Tutorial

Isaac Lab Reinforcement Learning: A Robot-Training Tutorial

Every robotics team eventually discovers the same uncomfortable truth: training a policy that looks perfect in simulation and then falls apart the moment it touches real hardware is almost never an algorithm problem. The PPO update rule did not fail you. The failure trace almost always points back to reward shaping that was too narrow, or domain randomization that was too thin — a mismatch between the distribution of worlds your policy saw during training and the distribution it encountered on the floor. Isaac Lab reinforcement learning is built specifically to address those variables at scale, running thousands of physics-accurate parallel environments on a single GPU so you can iterate on reward functions and randomization ranges fast enough to actually close that gap.

This tutorial walks through Isaac Lab end to end: where it sits in the NVIDIA simulation stack, how to stand up an environment using either of its two design workflows, how to wire in an RL library, how to design rewards and randomization schedules that produce robust policies, and how to get those policies off the GPU and onto a robot. Along the way we make the thesis explicit: most sim-to-real failures are reward-shaping and randomization problems, not algorithm problems. Code snippets are illustrative and labeled as such; exact API surface may shift across minor releases.

What this covers: Isaac Lab’s position in the simulation stack; installation and project layout; Direct vs Manager-based environment design with code; reward function construction and failure modes; domain randomization with the EventManager; the RSL-RL and skrl training loops; ONNX export and real-robot deployment; trade-offs and a practical pre-deployment checklist.


Context: Isaac Lab vs Isaac Sim vs Isaac Gym

Understanding where Isaac Lab sits in NVIDIA’s simulation ecosystem saves a significant amount of confusion when reading documentation written at different points in the stack’s evolution.

Isaac Sim is the full Omniverse-based physics simulation application. It provides photorealistic rendering, USD-based scene description, the PhysX 5 physics engine, and a Python scripting layer. Isaac Sim is the runtime that Isaac Lab runs on top of — it handles GPU-accelerated rigid body simulation, articulation physics, sensor simulation (cameras, lidar, IMUs), and USD asset loading.

Isaac Lab is a Python framework that lives above Isaac Sim. It provides the abstractions you actually interact with when training robot policies: environment classes, task registries, observation and action managers, reward computation, event scheduling for domain randomization, and built-in wrappers for multiple RL libraries. Isaac Lab is the successor to two earlier frameworks: the original Isaac Gym (a standalone OpenAI-Gym-compatible simulator that predated Omniverse and is now deprecated) and OmniIsaacGymEnvs (the transitional Omniverse-based gym that bridged Isaac Gym and Isaac Lab). Both legacy frameworks are no longer receiving active development. Isaac Lab consolidates their functionality with cleaner abstractions and full Isaac Sim integration. If you find Stack Overflow answers or GitHub issues referencing omni.isaac.gym, VecEnvBase, or OmniIsaacGymEnvs, assume they are describing the deprecated path.

The practical implication is that your entry point is always Isaac Lab’s isaaclab.sh launcher script (or isaaclab.bat on Windows), not bare Isaac Sim. The launcher manages the Python environment, Isaac Sim version pinning, and extension loading. You write Python that imports from isaaclab.* and never interact with the Omniverse Kit application layer directly unless you are building a custom extension.

The relationship is worth visualizing. Isaac Sim provides the physics and rendering substrate. Isaac Lab provides the RL scaffolding. Your task code — the environment definition, reward function, and training config — sits on top of Isaac Lab.


System Overview and Setup

Hardware and software prerequisites

Isaac Lab requires an NVIDIA GPU with Ampere architecture or newer (RTX 30xx / A-series or better) and CUDA 12.x. A single RTX 4090 can run approximately 4,096 parallel environments for a locomotion task; an A100 or H100 scales that further. System RAM should be at least 32 GB because Isaac Sim loads USD assets into CPU memory before staging them on the GPU. Ubuntu 22.04 LTS is the primary supported host OS; Windows support exists but is less tested in CI.

Installation

Isaac Lab ships as a Git repository. The canonical installation path (illustrative; pin to the release tag matching your Isaac Sim version):

# Illustrative — check the Isaac Lab GitHub for the exact release tag
git clone https://github.com/isaac-sim/IsaacLab.git
cd IsaacLab

# Install Isaac Sim as a pip package (requires NVIDIA Python index)
pip install isaacsim --extra-index-url https://pypi.nvidia.com

# Install Isaac Lab in editable mode
./isaaclab.sh --install

The --install flag installs Isaac Lab’s Python package and all RL library extensions (rsl_rl, skrl, RL-Games, Stable Baselines3 wrappers). After installation, all training commands are invoked through isaaclab.sh -p (which invokes the Isaac-Sim-aware Python interpreter) rather than bare python.

Project layout

A clean Isaac Lab project follows this structure:

my_robot_task/
├── my_robot_task/
│   ├── __init__.py
│   ├── envs/
│   │   ├── __init__.py
│   │   ├── my_task_env.py          # Environment class
│   │   └── my_task_env_cfg.py      # Environment config dataclass
│   └── tasks/
│       └── __init__.py             # Task registry entry point
├── scripts/
│   └── train.py                    # Training entry point
├── configs/
│   └── rsl_rl_ppo_cfg.py           # RL algorithm config
└── setup.py

The separation between env.py (runtime logic) and env_cfg.py (dataclass configuration) is a hard convention in Isaac Lab. Configuration lives in Python dataclasses, not YAML, which enables type checking and programmatic override without parsing overhead.

The Isaac Lab training pipeline

The Isaac Lab training pipeline: from physics engine through parallel environments to policy checkpoints
Figure 1: The Isaac Lab training pipeline. Isaac Sim drives the physics; Isaac Lab vectorizes environments and buffers; the RL algorithm consumes observations and produces actions; checkpoints are saved for evaluation and deployment.


Building the Environment and Reward

Two design workflows: Direct and Manager-based

Isaac Lab offers two philosophically different ways to write an environment. The choice affects how much of the boilerplate Isaac Lab handles for you.

Manager-based environments decompose the environment into named subsystems — an ObservationManager, ActionManager, RewardManager, TerminationManager, and EventManager — each configured via dataclass entries. The framework iterates over all registered terms at each step, accumulates reward components, applies action transformations, and checks termination conditions automatically. Manager-based environments are verbose at the configuration layer but are nearly boilerplate-free at the runtime layer. They are the recommended approach for new tasks that fit standard manipulation or locomotion patterns.

Direct environments give you a single Python class where you implement _get_observations(), _get_rewards(), _get_dones(), and _reset_idx() explicitly. There is no automatic term iteration. This is closer to how Isaac Gym worked and is better suited to tasks with unusual observation structures, hybrid physics modes, or bespoke termination logic that does not map cleanly onto the manager abstraction.

The following is an illustrative skeleton of each style. Neither is a drop-in runnable snippet; they show the structural difference.

Manager-based environment config (illustrative)

# my_task_env_cfg.py — illustrative
import isaaclab.envs as envs_cfg
from isaaclab.utils import configclass
from isaaclab.managers import (
    ObservationGroupCfg, ObservationTermCfg,
    RewardTermCfg, TerminationTermCfg, EventTermCfg,
)
import my_robot_task.envs.mdp as mdp

@configclass
class MyTaskEnvCfg(envs_cfg.ManagerBasedRLEnvCfg):
    # --- scene ---
    scene: MySceneCfg = MySceneCfg(num_envs=4096, env_spacing=2.5)

    # --- observations ---
    @configclass
    class ObservationsCfg:
        @configclass
        class PolicyObs(ObservationGroupCfg):
            joint_pos = ObservationTermCfg(func=mdp.joint_pos_rel)
            joint_vel = ObservationTermCfg(func=mdp.joint_vel_rel)
            base_lin_vel = ObservationTermCfg(func=mdp.base_lin_vel)
            base_ang_vel = ObservationTermCfg(func=mdp.base_ang_vel)
            projected_gravity = ObservationTermCfg(func=mdp.projected_gravity)
            velocity_command = ObservationTermCfg(func=mdp.generated_commands,
                                                  params={"command_name": "base_velocity"})
        policy: PolicyObs = PolicyObs()
    observations: ObservationsCfg = ObservationsCfg()

    # --- rewards ---
    @configclass
    class RewardsCfg:
        # Positive: track commanded velocity
        track_lin_vel_xy = RewardTermCfg(
            func=mdp.track_lin_vel_xy_exp,
            weight=1.0,
            params={"std": 0.25},
        )
        # Negative: penalise joint torques
        action_rate = RewardTermCfg(func=mdp.action_rate_l2, weight=-0.01)
        joint_torques = RewardTermCfg(func=mdp.joint_torques_l2, weight=-1e-4)
        # Negative: penalise base height deviation
        flat_orientation = RewardTermCfg(func=mdp.flat_orientation_l2, weight=-0.5)
    rewards: RewardsCfg = RewardsCfg()

    # --- terminations ---
    @configclass
    class TerminationsCfg:
        time_out = TerminationTermCfg(func=mdp.time_out, time_out=True)
        base_contact = TerminationTermCfg(
            func=mdp.illegal_contact,
            params={"sensor_cfg": SceneEntityCfg("contact_forces", body_names="base"),
                    "threshold": 1.0},
        )
    terminations: TerminationsCfg = TerminationsCfg()

    # episode length
    episode_length_s: float = 20.0

Direct environment (illustrative)

# my_task_direct_env.py — illustrative
import torch
from isaaclab.envs import DirectRLEnv, DirectRLEnvCfg

class MyDirectTaskEnv(DirectRLEnv):
    cfg: DirectRLEnvCfg  # typed reference to config

    def _setup_scene(self) -> None:
        # Load articulation, ground plane, sensors
        ...

    def _get_observations(self) -> dict:
        obs = torch.cat([
            self._robot.data.joint_pos,
            self._robot.data.joint_vel,
            self._robot.data.root_lin_vel_b,
            self._robot.data.root_ang_vel_b,
        ], dim=-1)
        return {"policy": obs}

    def _get_rewards(self) -> torch.Tensor:
        # All operations are batched over num_envs
        vel_error = self._vel_command - self._robot.data.root_lin_vel_b[..., :2]
        vel_reward = torch.exp(-torch.sum(vel_error ** 2, dim=-1) / 0.25)
        torque_penalty = torch.sum(self._robot.data.applied_torque ** 2, dim=-1)
        return vel_reward - 1e-4 * torque_penalty

    def _get_dones(self) -> tuple[torch.Tensor, torch.Tensor]:
        timed_out = self.episode_length_buf >= self.max_episode_length
        base_contact = self._contact_sensor.data.net_forces_w_history[..., 2] > 1.0
        terminated = base_contact.any(dim=-1)
        return terminated, timed_out

    def _reset_idx(self, env_ids: torch.Tensor) -> None:
        super()._reset_idx(env_ids)
        # Re-sample initial joint poses for specified envs
        ...

Reward design: where sim-to-real is won or lost

Reward design is not a footnote in Isaac Lab tutorials — it is the primary engineering decision. The thesis of this post is that most sim-to-real failures stem from reward shaping and randomization errors rather than from choosing the wrong RL algorithm. Here is why that claim holds up.

When you write a velocity-tracking reward as a simple L2 error (-||v_cmd - v_actual||²), the policy learns to minimize that exact signal. On a perfectly modelled simulator, this produces fine-looking locomotion. But on real hardware, where motor models are imperfect and ground contact dynamics differ from the simulator, the policy has learned to exploit simulation-specific characteristics rather than to solve the underlying locomotion problem. The fix is not to switch from PPO to SAC — it is to shape the reward so that the policy cannot exploit simulation artifacts.

Practical reward design principles, each grounded in failure modes observed across sim-to-real deployments:

Use exponential kernels instead of linear error. A term like exp(-||error||² / σ²) saturates near zero and provides a dense gradient even when the error is large. Linear or quadratic penalties create sparse gradients far from the target, causing the policy to ignore the term early in training and overfit to it late. The σ parameter controls the width of the “success zone” and is a hyperparameter worth tuning.

Penalise mechanism cost explicitly. Joint torques, action rate (the L2 norm of successive action differences), and base angular acceleration are cheap to compute and expensive to ignore. A policy that reaches the goal with high torques and jerky actions will fail on hardware where actuator bandwidth is limited. Adding weight * sum(torque²) at a small negative weight (typically 1e-4 to 1e-3) forces the policy to find efficient trajectories.

Separate task reward from style reward. Stack reward components in two groups: task-completion terms (velocity tracking, goal reaching) that define success, and style terms (smoothness, energy, posture) that define how success is achieved. Scale the task group so it dominates early training; let the style group tighten the policy as training progresses. Mixing them at fixed weights from step zero causes the policy to sacrifice task progress for style compliance.

Avoid terminal bonuses early. A large reward spike at episode success teaches the policy to maximize episode length in pathological ways. Prefer dense shaping for the first several million environment steps, then introduce sparse bonuses once the task-completion rate exceeds a threshold.

Curriculum over time limits. Starting with short episodes (2–4 seconds) and extending the episode length as the policy improves produces faster initial learning and prevents the policy from learning to “give up” — padding returns with time-out behavior rather than succeeding.

The reward architecture lives entirely in the RewardsCfg dataclass (manager-based) or the _get_rewards() method (direct). Changing a weight requires editing a single number in the config and rerunning training — there is no need to retrain from scratch if you keep a checkpoint.

The RL interaction loop

The RL training interaction loop: observation, action, reward, reset cycle across parallel environments
Figure 2: The RL training interaction loop. The policy network consumes the observation vector, produces an action, the physics engine steps, and the environment computes reward and checks termination. All operations are GPU-batched across the full environment count.


Training, Domain Randomization, and Sim-to-Real

Connecting an RL library: RSL-RL and skrl

Isaac Lab ships first-party wrappers for four RL libraries: RSL-RL (the library used in ETH Zurich’s ANYmal locomotion work and widely used in quadruped research), RL-Games (the library that powers NVIDIA’s internal benchmarks), skrl (a modular library designed for flexibility and readability), and Stable Baselines3 (CPU-based; used mainly for comparison). The wrapper converts the Isaac Lab VecEnv-compatible interface into whatever format the RL library expects.

Training with RSL-RL (illustrative)

# Illustrative launch command — adapt paths to your project
./isaaclab.sh -p scripts/train.py \
    --task MyRobotTask \
    --num_envs 4096 \
    --headless \
    --logger tensorboard

The RSL-RL config is a Python dataclass. An illustrative PPO configuration:

# configs/rsl_rl_ppo_cfg.py — illustrative
from rsl_rl.runners import OnPolicyRunner

@configclass
class MyTaskPPOCfg:
    seed: int = 42
    device: str = "cuda:0"
    num_steps_per_env: int = 24        # horizon per rollout
    max_iterations: int = 3000         # total training iterations
    save_interval: int = 100
    experiment_name: str = "my_robot_task"
    run_name: str = ""
    logger: str = "tensorboard"        # or "wandb", "neptune"

    @configclass
    class PolicyCfg:
        class_name: str = "ActorCritic"
        init_noise_std: float = 1.0
        actor_hidden_dims: list = [512, 256, 128]
        critic_hidden_dims: list = [512, 256, 128]
        activation: str = "elu"

    @configclass
    class AlgorithmCfg:
        class_name: str = "PPO"
        value_loss_coef: float = 1.0
        use_clipped_value_loss: bool = True
        clip_param: float = 0.2
        entropy_coef: float = 0.005
        num_learning_epochs: int = 5
        num_mini_batches: int = 4
        learning_rate: float = 1e-3
        schedule: str = "adaptive"     # LR adapts to KL divergence
        gamma: float = 0.99
        lam: float = 0.95              # GAE lambda
        desired_kl: float = 0.01
        max_grad_norm: float = 1.0

    policy: PolicyCfg = PolicyCfg()
    algorithm: AlgorithmCfg = AlgorithmCfg()

The schedule: "adaptive" setting is worth highlighting: RSL-RL adjusts the learning rate based on the KL divergence between successive policies. In high-environment-count runs (4k+), this adaptive schedule typically stabilizes training better than a fixed learning rate because the effective batch size is very large and gradient noise is low.

Training with skrl (illustrative)

skrl offers a more modular approach where you explicitly construct the agent, policy, value function, and memory objects. This is advantageous when you want to mix components — for example, using a recurrent policy (GRU or LSTM) with a standard PPO update:

# scripts/train_skrl.py — illustrative
import torch
from skrl.agents.torch.ppo import PPO, PPO_DEFAULT_CONFIG
from skrl.models.torch import GaussianMixin, DeterministicMixin, Model
from skrl.memories.torch import RandomMemory
from skrl.trainers.torch import SequentialTrainer

class MyPolicy(GaussianMixin, Model):
    def __init__(self, obs_space, action_space, device):
        Model.__init__(self, obs_space, action_space, device)
        GaussianMixin.__init__(self, clip_actions=True)
        self.net = torch.nn.Sequential(
            torch.nn.Linear(obs_space.shape[0], 512),
            torch.nn.ELU(),
            torch.nn.Linear(512, 256),
            torch.nn.ELU(),
            torch.nn.Linear(256, action_space.shape[0]),
        )
        self.log_std = torch.nn.Parameter(torch.zeros(action_space.shape[0]))

    def compute(self, inputs, role):
        return self.net(inputs["states"]), self.log_std, {}

# Instantiate env, wrap with skrl's IsaacLab wrapper
# (exact wrapper class name — verify in current skrl docs)
env = wrap_env(isaac_lab_env, wrapper="isaaclab")

memory = RandomMemory(memory_size=24, num_envs=env.num_envs, device="cuda:0")
cfg = PPO_DEFAULT_CONFIG.copy()
cfg["learning_rate"] = 1e-3
cfg["mini_batches"] = 4
cfg["learning_epochs"] = 5

agent = PPO(
    models={"policy": MyPolicy(...), "value": MyValue(...)},
    memory=memory,
    cfg=cfg,
    observation_space=env.observation_space,
    action_space=env.action_space,
    device="cuda:0",
)

trainer = SequentialTrainer(cfg={"timesteps": 50_000_000}, env=env, agents=agent)
trainer.train()

Domain randomization

Domain randomization is Isaac Lab’s primary mechanism for narrowing the reality gap. The principle is that if the policy learns to succeed across a wide distribution of simulated worlds — each with different masses, frictions, motor lags, and visual appearances — then the real world is just one more sample from a distribution the policy has already handled.

Domain randomization inputs feeding the parallel environment pool
Figure 3: Domain randomization inputs. The EventManager samples from configured distributions at each reset (or on a schedule), producing a unique physical and visual configuration per environment per episode.

In Isaac Lab, domain randomization is managed by the EventManager, which schedules terms at three lifecycle moments: startup (applied once at scene load), reset (applied each time an environment is reset), and interval (applied on a fixed step schedule regardless of resets). Each term is a Python callable that receives the environment and optionally a set of environment IDs.

An illustrative EventManager configuration showing both physics and initial-state randomization:

# In your env config dataclass — illustrative
@configclass
class EventsCfg:
    # Physics randomization at reset
    randomize_robot_physics = EventTermCfg(
        func=mdp.randomize_rigid_body_material,
        mode="reset",
        params={
            "asset_cfg": SceneEntityCfg("robot", body_names=".*"),
            "static_friction_range": (0.6, 1.2),
            "dynamic_friction_range": (0.4, 0.9),
            "restitution_range": (0.0, 0.1),
            "num_buckets": 64,
        },
    )
    randomize_robot_mass = EventTermCfg(
        func=mdp.randomize_rigid_body_mass,
        mode="reset",
        params={
            "asset_cfg": SceneEntityCfg("robot"),
            "mass_distribution_params": (-0.15, 0.15),  # ±15%
            "operation": "scale",
        },
    )
    # Initial state randomization
    reset_robot_joints = EventTermCfg(
        func=mdp.reset_joints_by_offset,
        mode="reset",
        params={
            "asset_cfg": SceneEntityCfg("robot"),
            "position_range": (-0.1, 0.1),   # radians
            "velocity_range": (-0.05, 0.05),  # rad/s
        },
    )
    # Payload randomization (add a random load to the base)
    add_base_mass = EventTermCfg(
        func=mdp.add_body_mass,
        mode="reset",
        params={
            "asset_cfg": SceneEntityCfg("robot", body_names="base"),
            "mass_range": (0.0, 3.0),  # kilograms
        },
    )

events: EventsCfg = EventsCfg()

Choosing randomization ranges requires care. Ranges that are too narrow produce policies that fail on any hardware tolerances outside the training distribution — a common mistake when copying default configs from the Isaac Lab example tasks without inspecting the physical units. Ranges that are too wide make the learning problem significantly harder and require more training iterations to converge. A practical approach is to start with the nominal physics parameters and widen the ranges incrementally, checking that the policy’s reward curve still converges at each stage. The mass randomization range of ±15% shown above is a reasonable starting point for a ground robot; aerial robots or tasks with tight dynamic constraints may require tighter ranges early in training.

Observation noise is a separate and often overlooked form of randomization. Real sensors have noise floors, quantization, and latency. Adding Gaussian noise to joint position and velocity observations during training makes the policy more tolerant of imperfect state estimation on hardware:

# In ObservationTermCfg — illustrative
joint_pos = ObservationTermCfg(
    func=mdp.joint_pos_rel,
    noise=GaussianNoiseCfg(std=0.01),  # 0.01 radians RMS
)
joint_vel = ObservationTermCfg(
    func=mdp.joint_vel_rel,
    noise=GaussianNoiseCfg(std=0.05),  # 0.05 rad/s RMS
)

Action delay is another commonly missed detail. If your hardware runs at 200 Hz but your Isaac Lab simulation runs physics at 200 Hz with the RL policy queried every 4 physics steps (50 Hz effective control), the latency profile does not match most real hardware setups where there is additional communication delay between the inference process and the motor controller. Adding a random action delay term (typically 1–3 control steps) during training helps:

randomize_action_delay = EventTermCfg(
    func=mdp.apply_action_noise_with_delay,
    mode="reset",
    params={"delay_range": (1, 3)},   # in control steps; illustrative API name
)

Sim-to-real deployment flow

After training converges — a judgment call based on reward plateau, curriculum completion, and manual inspection of rollout videos — the deployment path involves three steps: export, runtime setup, and hardware integration.

Sim-to-real deployment flow: trained policy to ONNX or TorchScript to robot controller
Figure 4: Sim-to-real deployment flow. The trained checkpoint is exported to ONNX or TorchScript, loaded into an inference runtime on the robot’s compute board, and integrated with the hardware abstraction layer and a safety monitor.

Exporting the policy

RSL-RL saves checkpoints as raw PyTorch state_dict files. To deploy on a robot without a full Isaac Lab installation, export to TorchScript (which can run without a Python interpreter) or ONNX (which can run on TensorRT or ONNX Runtime):

# scripts/export_policy.py — illustrative
import torch
from rsl_rl.runners import OnPolicyRunner

# Load the trained runner and extract the actor
runner = OnPolicyRunner(env, train_cfg, log_dir=None, device="cpu")
runner.load(checkpoint_path)
policy = runner.get_inference_policy(device="cpu")

# Export to TorchScript
scripted = torch.jit.script(policy)
scripted.save("policy.pt")

# Export to ONNX
dummy_obs = torch.zeros(1, obs_dim)
torch.onnx.export(
    policy, dummy_obs, "policy.onnx",
    input_names=["obs"], output_names=["actions"],
    dynamic_axes={"obs": {0: "batch_size"}},
    opset_version=17,
)

On-robot inference loop (illustrative)

# robot_controller.py — illustrative ROS2-compatible structure
import onnxruntime as ort
import numpy as np

class PolicyRunner:
    def __init__(self, model_path: str, obs_dim: int):
        self.session = ort.InferenceSession(
            model_path, providers=["CUDAExecutionProvider"]
        )
        self.obs_dim = obs_dim

    def compute_action(self, obs: np.ndarray) -> np.ndarray:
        # obs shape: (1, obs_dim)
        outputs = self.session.run(
            None, {"obs": obs.astype(np.float32)}
        )
        # Clip actions to safe joint ranges before sending to hardware
        actions = np.clip(outputs[0], -1.0, 1.0)
        return actions

The critical implementation detail is observation normalization consistency. Isaac Lab computes a running mean and standard deviation over observations during training and normalizes each observation term before feeding it to the policy. That running normalizer must be exported along with the policy weights and applied identically during inference. Failing to apply the normalizer is one of the most common causes of apparently-working sim policies that produce random outputs on hardware, because the input distribution the policy expects is completely different from the raw sensor values it receives.

For ROS2 integration on Jetson Orin hardware, the inference node typically runs as a lifecycle node subscribing to joint state topics and publishing joint position or velocity commands. See our ROS2 Jazzy and Jetson Orin robotics deployment tutorial for the full deployment stack.


Trade-offs and What Goes Wrong

Direct vs Manager-based: when each breaks

The Manager-based workflow is powerful precisely because it decouples configuration from logic. But this decoupling becomes a liability when your task requires interdependencies between reward terms that the manager’s flat accumulation model does not support — for example, a reward term that must read the output of another reward term, or a termination condition that depends on a running average computed inside the observation. In those cases, Direct is not a concession; it is the correct choice.

The Manager-based approach also has higher configuration verbosity. A medium-complexity locomotion task will produce an environment config file that runs to several hundred lines. This is not inherently bad — the config is the specification and the logic is hidden in named mdp.* functions — but it makes debugging harder because the execution path is not visible in the config file.

The reward hacking problem in GPU-scale training

Running 4,096 parallel environments with a well-tuned PPO configuration generates hundreds of millions of environment steps per hour. This speed is a double-edged sword: the policy converges much faster than in CPU-based simulation, but it also finds and exploits reward loopholes much faster. Behavior that would not appear until tens of millions of steps on a CPU simulator can appear within minutes of GPU-scale training.

Common reward hacking patterns in locomotion training:

  • Velocity reward with no stability penalty: the policy learns to fall forward at high speed, satisfying the velocity command while ignoring all standing behavior.
  • Energy penalty without a lower bound: the policy discovers that staying motionless minimizes energy cost while triggering the time-out termination for a zero reward, which is better than a negative reward. Fix: add a small positive survival bonus, or use a multiplicative reward structure.
  • Contact penalty without a force threshold: the policy learns to hover slightly above the ground, avoiding contact rewards but unable to move. Fix: threshold contact penalties on force magnitude.
  • Observation history without matching hardware latency: the policy learns to use future information that the history buffer inadvertently provides (through indexing errors). Fix: unit-test the history buffer indexing explicitly.

Sim-to-real failure taxonomy

Based on published sim-to-real deployment reports (including the Isaac Lab technical report, arXiv:2511.04831, and the Real-is-Sim work on digital twin bridging), failures fall into three categories ranked by frequency:

Category 1 — Reward and termination mismatch (most common): The policy learned to succeed under conditions that do not match the real task. Sub-categories include velocity reward without energy penalty (produces energetically infeasible gaits), termination conditions that are too lenient in sim (the policy never learned to recover from near-falls), and episode length that is too short (the policy learned short-horizon behavior that compounds in longer real runs).

Category 2 — Randomization gap (common): The randomization ranges covered the sim’s default parameters but not the real hardware’s actual parameter range. Motor friction, joint damping, and ground contact compliance are the three most commonly misidentified parameters. Measuring actual hardware parameters with system identification and then setting randomization ranges that bound those measurements — rather than guessing — closes most of this gap.

Category 3 — Algorithm failure (rare): The RL algorithm genuinely failed to find a good policy. This is uncommon for standard locomotion tasks with well-shaped rewards. If you are seeing category 3 behavior, it is almost always masking a category 1 or category 2 problem.

This taxonomy supports the original thesis: address reward shaping and randomization first before attributing failure to the algorithm.

Observation space design pitfalls

The observation space defines what the policy is allowed to know. A common mistake is including privileged information in the policy observation that is not available on real hardware — for example, ground truth body velocity from the physics engine, exact contact forces, or the randomized physics parameters themselves. A policy trained with privileged observations will appear to perform perfectly in simulation and will behave erratically on hardware where those signals are unavailable.

The standard solution is the teacher-student distillation pattern: train a teacher policy with privileged observations (which trains faster and to higher reward), then distill it into a student policy that observes only what the hardware sensors can provide. Isaac Lab supports this pattern through the two-observation-group convention: define a critic observation group with privileged information and a policy observation group with hardware-available signals. The critic uses the full group during training; only the policy group is exported.

For a deep dive into what domain randomization parameters NVIDIA’s Isaac Sim tooling exposes and how to configure them, see our Isaac Sim 4.5 domain randomization tutorial.


Practical Recommendations

The following recommendations are aimed at teams starting their first Isaac Lab project or debugging a policy that trains well in simulation but fails on hardware. They are ordered by impact, not by workflow stage.

First, instrument your reward terms individually before training at scale. Add per-term reward logging from the first training run. When the total reward curve looks good but the policy behaves oddly, per-term curves reveal which component is being maximized at the expense of the others. RSL-RL logs named scalars through its extras["log"] dictionary, which Isaac Lab’s RewardManager populates automatically in manager-based environments.

Second, run a “sim-to-sim” transfer check before touching hardware. Train in Isaac Lab, then run rollouts in a second simulator (PyBullet or MuJoCo) if one is available for your robot. If the policy transfers cleanly between simulators, most of the implementation is correct. If it does not, the observation normalization, action scaling, or physics parameter matching is likely wrong.

Third, treat action scaling as a first-class design decision. Isaac Lab actions are typically in normalized space (-1 to 1), scaled by a PD_controller action term into actual joint targets. The scale factor determines how aggressively the policy can move joints per control step. A scale that is too large produces jerky behavior that does not transfer; too small makes the task infeasible. Match the scale to your hardware’s maximum safe joint velocity at the control frequency.

Fourth, use checkpoint-based curriculum advancement rather than time-based. Advance curriculum stages (extending episode length, tightening reward thresholds, or adding harder terrain) only when the policy’s mean episode reward exceeds a threshold, not after a fixed number of training steps. Time-based advancement is fragile across different hardware speeds; reward-based advancement is robust.

Fifth, validate the normalizer export before every hardware test. Serialize the running mean and standard deviation along with the policy checkpoint. A one-line check in your deployment script — assert that the normalizer’s mean vector is close to zero for a set of known-good observations — catches normalization bugs before the robot moves.

Pre-deployment checklist

  • Reward terms logged individually; no single term dominates above 80% of total reward at convergence
  • Randomization ranges verified against actual hardware system-identification measurements
  • Observation space contains only hardware-available signals in the exported policy group
  • Observation normalizer exported and validated against known-good observations
  • Action scaling verified at the hardware’s maximum safe joint velocity
  • Policy tested in headless rollout mode at 10x the deployment episode length
  • Safety monitor implemented in the hardware controller (joint limits, base contact, communication timeout)
  • Export format (TorchScript or ONNX) tested on the target compute board before lab integration
  • Curriculum stage at deployment matches the task complexity of the real environment

FAQ

What GPU do I need to run Isaac Lab reinforcement learning training effectively?

An NVIDIA RTX 4090 is the practical minimum for serious training runs, supporting roughly 4,096 parallel locomotion environments. An A100 or H100 scales to higher environment counts and reduces wall-clock time for long training campaigns. The memory requirement is dominated by the USD scene loading and parallel physics buffers, not the policy network itself. CPU-only training is possible with small environment counts using Isaac Lab’s CPU physics mode but is impractically slow for anything beyond proof-of-concept.

How does Isaac Lab reinforcement learning compare to Isaac Gym and OmniIsaacGymEnvs?

Isaac Gym and OmniIsaacGymEnvs are both deprecated. Isaac Lab is the successor, built on Isaac Sim and Omniverse rather than Isaac Gym’s standalone renderer. The key architectural difference is that Isaac Lab’s Manager-based workflow decouples environment logic into named subsystems, making it significantly easier to compose tasks and share reward terms across environments. Performance on comparable tasks is similar or better due to improved vectorization.

Why does my policy train successfully in Isaac Lab but fail on real hardware?

The most frequent causes, in order of likelihood: the observation normalizer was not exported or not applied during inference; the observation space contains privileged information (ground-truth velocity, exact contact forces) unavailable on hardware; the randomization ranges do not cover the real hardware’s parameter space; or the reward function was satisfied by a strategy that is physically infeasible on hardware (such as exploiting simulation-specific contact dynamics). Start by checking these four before investigating the RL algorithm.

What is the difference between Direct and Manager-based environments in Isaac Lab?

Direct environments give you full control over observation, reward, and termination computation in explicit Python methods — closer to the Isaac Gym style. Manager-based environments decompose these into named ObservationManager, RewardManager, TerminationManager, and EventManager components, each configured via dataclass entries, with the framework handling iteration and accumulation. Manager-based is recommended for standard tasks; Direct is better when your task requires logic that does not map cleanly onto the manager’s flat term model.

How many training steps are typically needed to achieve a deployable locomotion policy?

For a quadruped velocity-tracking task starting from scratch, convergence typically requires between 100 million and 500 million environment steps with PPO at standard hyperparameters, though this range varies significantly with reward design quality and environment count. At 4,096 parallel environments, 100 million steps corresponds to roughly 24,000 PPO rollouts and can complete in a few hours on an A100. Policies trained with broader domain randomization often require more steps to converge. These figures are based on the Isaac Lab technical report (arXiv:2511.04831) and ETH Zurich locomotion research; your task may differ substantially.

Can Isaac Lab reinforcement learning be used for manipulation tasks, not just locomotion?

Yes. Isaac Lab ships example tasks for dexterous manipulation, in-hand object reorientation, and arm trajectory following. The same Direct vs Manager-based choice applies. Manipulation tasks tend to require more careful reward shaping around contact — contact-rich manipulation is notoriously hard to randomize because the contact mechanics in simulation diverge significantly from real object compliance. For manipulation, the teacher-student distillation pattern (teacher with tactile force observations, student with only proprioception) is particularly important.


Further Reading

For a deeper look at how NVIDIA’s foundation model strategy connects to Isaac Lab’s training pipeline, see our coverage of NVIDIA Isaac GR00T N1.5 and Cosmos, which contextualizes Isaac Lab’s role in the broader generalist robot training stack.

The Isaac Sim 4.5 domain randomization tutorial covers the Isaac Sim-level randomization API in detail, including visual randomization (texture, lighting, material) that complements the physics randomization described here.

For deploying trained policies to production robotics hardware, our ROS2 Jazzy and Jetson Orin warehouse robotics tutorial walks through the full inference and communication stack.

Primary references:

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *