Isaac GR00T N1.5 vs Cosmos: Robot Foundation Models Compared
The NVIDIA Isaac GR00T N1.5 vs Cosmos comparison is the one robotics teams in 2026 keep getting wrong, because both stacks ship under the “physical AI foundation model” banner but solve orthogonal problems. GR00T N1.5 is an open Vision-Language-Action (VLA) policy that runs on the robot and outputs joint commands. Cosmos is a family of world-foundation models that synthesize, transform, and reason about physical scenes — primarily as a data and simulation engine, secondarily as a runtime predictor. The original thesis of this post is that they are not competitors. They are two axes of the same pipeline. The right architectural question is not “which one do I pick”, but “where in my training and runtime loop does each one earn its compute budget?” This post breaks the NVIDIA Isaac GR00T N1.5 vs Cosmos question into model purpose, training recipe, action vs scene representation, on-robot footprint, the GR00T-Dreams data engine, license and ecosystem, and ends with a decision matrix and three reference architectures you can actually deploy.
Architecture at a glance





Why this comparison matters now
NVIDIA stacked three announcements at Computex 2026 that forced the comparison: Isaac GR00T N1.5 (the second open release of the humanoid foundation model with a new reasoning head and an upgraded action expert), GR00T-Dreams (the single-image-to-synthetic-motion-data blueprint), and a new tier of Cosmos checkpoints — Predict 2, Transfer 2, and Reason 1 — explicitly positioned as the “data factory” feeding GR00T. Teams who treat these as alternatives end up paying twice for capabilities that are designed to compose. Teams who chain them correctly can train a humanoid policy in days on data that previously took fleet-months of teleoperation.
The economic stakes are also non-trivial. Real humanoid teleop in 2026 still costs roughly 80 to 120 USD per labeled hour at fleet scale once you account for operator wages, robot wear, supervision overhead, and quality assurance. A 1000-hour task collection therefore lands between 80,000 and 120,000 USD before any training compute. Cosmos dream generation on a managed cluster is closer to 6 USD per equivalent hour of motion data after filtering. The 10x to 20x cost compression is real — provided the policy that ships actually works. The rest of this article is about how to make sure it does.
The pillar concepts behind both stacks are covered in the broader physical AI vision-language-action models reference architecture. This post zooms into the specific GR00T vs Cosmos split.
Context: VLA policies, world models, and where each came from
A VLA policy is a model that takes language, vision, and proprioception in, and emits robot actions out. A world model takes a scene plus an action and emits the next frame of that scene. Foundation-scale VLAs (RT-2, Octo, OpenVLA, pi-0, GR00T) and foundation-scale world models (DreamerV3, GAIA-1, Sora-class video diffusion, Cosmos) emerged in parallel from 2023 onward but were trained on different data and serve different roles in the autonomy stack.
GR00T originated at NVIDIA Research and was first published at GTC March 2025 as Isaac GR00T N1, a 2.2B-parameter open VLA built on Eagle-2 VLM and a Diffusion Transformer action head with a System 1 / System 2 split — fast reactive control plus slower reasoning. The N1.5 update at Computex 2026 froze the VLM backbone (better language grounding), introduced FLARE-style future-latent alignment so the policy can train against video-only data, and shipped a meaningfully bigger action expert. The model is licensed under NVIDIA’s Open Model License and weights are on Hugging Face under nvidia/GR00T-N1.5-3B.
Cosmos was unveiled at CES January 2025 as a “world foundation model” family. It is trained on 20+ million hours of curated driving, robotics, and physical-interaction video. Cosmos has three branches: Predict (autoregressive and diffusion video generators that take past frames plus an action prompt and roll the world forward), Transfer (ControlNet-style structure-conditioned generators that re-render a sim scene with photoreal textures), and Reason (a multimodal LLM fine-tuned on physical-world QA and trajectory planning). All three are also under the NVIDIA Open Model License, with checkpoints from 4B up to 14B parameters.
For deep context on related on-robot policies and the Jetson hardware that hosts them, see the VLA foundation models walkthrough covering Pi-0 and RT-2 and the Jetson Thor humanoid robot reference architecture.
The names are the source of half the confusion. “Foundation model” in 2026 has come to mean any large pre-trained model with broad coverage, but the two stacks fit different rungs of the foundation-model ladder. GR00T fits the rung previously occupied by RT-2 and Pi-0 — a policy backbone that downstream teams fine-tune for a specific embodiment and task family. Cosmos fits the rung previously occupied by GAIA-1 and dashcam diffusion models — a generative model of the visual world that downstream teams use as data infrastructure. Treating them as the same kind of object is the first analytical mistake. The second is treating either as production-ready out of the box without task-specific fine-tuning, which neither vendor claims and neither stack delivers.
NVIDIA Isaac GR00T N1.5 vs Cosmos: the core distinction
GR00T N1.5 is a robot policy — it consumes scenes and emits actions. Cosmos is a world model — it consumes scenes plus actions and emits scenes. That single sentence is the difference, and every other architectural divergence (training data, footprint, latency budget, license intent) follows from it. Their relationship is dual: GR00T’s outputs are Cosmos’s inputs, and Cosmos’s outputs become GR00T’s training data.

The layered view above places the two stacks side by side. On the GR00T side the layers are sensor encoders, VLM backbone, action expert (a Diffusion Transformer), action chunk decoder, and the robot’s low-level controller. On the Cosmos side the layers are video tokenizer, world transformer (Predict) or ControlNet (Transfer) or reasoning LLM (Reason), and a detokenizer that emits either RGB video frames or a structured plan.
Model purpose, one sentence each
GR00T N1.5: given an instruction, the last second of camera frames, and proprioception, output the next 16 joint-velocity commands at 30 Hz. Cosmos Predict: given the last 9 frames and an action conditioning vector, output the next 57 frames. Cosmos Transfer: given a synthetic Isaac Sim scene plus depth, segmentation, and edge maps, output a photoreal video of the same scene from the same camera. Cosmos Reason: given an image and a question about it, output a chain-of-thought answer plus a structured plan.
Different I/O contracts. Different deployment surfaces. Different metrics. Lumping them together because both are 3-to-14B-parameter transformers is the mistake.
Training data and recipe
GR00T N1.5 was trained on a data pyramid that NVIDIA describes as three tiers: real teleoperated humanoid trajectories at the base (thousands of hours across LeRobot, Open X-Embodiment subsets, and partner-collected data from Agility, 1X, Boston Dynamics, Fourier, Unitree), simulation rollouts in Isaac Lab in the middle, and large-scale internet video at the top. The N1.5 training adds a fourth tier explicitly: Cosmos-synthesized neural trajectories. The recipe is supervised behavior cloning with a flow-matching objective on the action expert and a FLARE auxiliary loss that aligns the policy’s latent at time t with a future latent at time t+k, predicted from observed video. The published training compute on the model card is approximately 240,000 H100-hours.
Cosmos was trained on roughly 20 million hours of pre-curated video (NVIDIA’s published figure), filtered for “dynamic, physically grounded” content — driving, manipulation, indoor navigation, sports, industrial robotics. The Predict checkpoints use a 3D causal VAE with patchified spatiotemporal tokens. Predict-2 is a diffusion transformer in the 14B class. Transfer is a ControlNet-style adapter on top of Predict. Reason is bootstrapped from a Llama-3-class backbone post-trained with a 5M-sample physical-world QA set.
Two consequences: GR00T sees orders of magnitude less data but every sample is action-labeled. Cosmos sees orders of magnitude more data but most of it has no action labels — actions, where present, are inferred or come from driving CAN logs. They are duals, not substitutes.
The dataset pyramid for GR00T N1.5 is worth itemizing because it determines what the policy can and cannot do out of the box. The published mix on the model card is approximately: 4,500 hours of real teleop across 8 humanoid embodiments, 6,200 hours of Isaac Lab simulation rollouts, 38,000 hours of cross-embodiment data from Open X-Embodiment and LeRobot, and 280,000 hours of internet video used through the FLARE auxiliary objective. Note that the action-labeled fraction (the first three buckets) is roughly 49,000 hours, and the video-only fraction (the FLARE bucket) is about 6x larger. The policy “sees” more video than it has actions for, by design.
Cosmos’s training mix is split differently: roughly 60% driving and dashcam footage, 20% indoor manipulation and household activity, 10% industrial robotics and warehouse video, 10% sports and human motion. The driving bias is why Cosmos Predict is unusually strong at street scenes and somewhat weaker at fine-grained manipulation. Teams targeting manipulation should expect to fine-tune Cosmos Predict on their domain before using it as a dream generator.
Action vs scene representation
GR00T’s output is an action chunk: a tensor of shape [16, D] where D is the embodiment-specific joint count (28 for a typical humanoid, 7 for a robot arm) and 16 is the prediction horizon. The action expert is trained per-embodiment, so swapping from a Franka arm to a Unitree humanoid means a new action head but the same VLM backbone.
Cosmos’s output is a scene: either pixel-space video frames at 720p (Predict) or a structured plan as text (Reason). It has no notion of “joint” — actions enter Cosmos only as conditioning, typically a 6-DoF camera or end-effector pose trajectory, not joint angles. This means Cosmos cannot drive a robot directly. It can advise a planner, predict outcomes, or generate training data.
On-robot latency and footprint
GR00T N1.5 ships in a 3B-parameter configuration that fits in a single Jetson Thor module (128 GB unified LPDDR5X, ~2070 dense FP4 TFLOPS, see the Jetson Thor architecture coverage). With FP8 quantization and action chunking, NVIDIA’s published end-to-end policy latency on Thor is in the 30–40 ms range per 16-action chunk, which sustains 30 Hz closed-loop control comfortably.
Cosmos Predict-2 at 14B parameters in FP8 needs 28 GB of weights plus tens of GB of activation memory for a 57-frame rollout. Inference on a single H100 is wall-clock seconds per rollout, not milliseconds. Cosmos was not designed to run on the robot. It runs in the cloud or on a workstation. The only Cosmos variant that has any hope of on-robot operation is Cosmos Reason at the 4B tier as an episodic “thinker”, and even that is closer to a 1 Hz reasoner than a 30 Hz controller.
That latency gap is the single biggest decision driver. If your problem is “I need to make a decision in 33 ms”, GR00T is the only answer here. If your problem is “I need to generate 10,000 hours of training data overnight”, Cosmos is the only answer here.
Memory and bandwidth, not just FLOPs
A subtle point worth surfacing: the bottleneck for GR00T at the edge is not raw FLOPS but the unified-memory bandwidth between the VLM hidden state and the action expert. Thor’s 128 GB LPDDR5X with 273 GB/s bandwidth is comfortable for the 3B model at FP8, but adding a second model (say, an audio model for voice commands) starts thrashing the cache. Cosmos at 14B in FP8 needs more like 1.5 TB/s of HBM bandwidth to hit usable wall-clock — which is why it lives on H100 / H200 / B200, not Thor. The memory hierarchy, not the math throughput, is what forces Cosmos off the robot.

Deep dive: the GR00T-Dreams data engine
GR00T-Dreams is the blueprint NVIDIA published at Computex 2026 that turns a single image plus a language instruction into a complete action-labeled training trajectory. It is the operational glue between Cosmos and GR00T, and it is the most underexposed piece of this comparison. The blueprint runs in four stages: Cosmos Predict generates a video from the seed image conditioned on the instruction; an Inverse Dynamics Model recovers actions from consecutive frames; the trajectory is filtered for physical plausibility by Cosmos Reason; and the surviving samples are fine-tuned into a GR00T-N1.5 post-training checkpoint.
Stage 1: dream the video
Cosmos Predict-2 takes the seed image (a real or rendered photo of the target workspace) and a language instruction such as “pick up the red block and place it in the bin”. The model rolls 57 frames at 720p, around 5 seconds of robot-eye video. NVIDIA’s published numbers from the GR00T-Dreams blueprint repo claim approximately 14 seconds per dream on a single H100 at FP8 with batch size 1. At fleet scale on a DGX cluster, a 1000-H100 cluster generates roughly 6 million dreams per day.
# Pseudocode for the GR00T-Dreams stage 1 dream generation step
# from the published nvidia-cosmos/cosmos-predict2 examples
import torch
from cosmos_predict2 import CosmosPredict2Pipeline
pipe = CosmosPredict2Pipeline.from_pretrained(
"nvidia/Cosmos-Predict2-14B-Video2World",
torch_dtype=torch.float8_e4m3fn,
).to("cuda")
seed_image = load_image("workspace_red_block.png")
instruction = "Pick up the red block and place it in the bin on the right."
dream = pipe(
image=seed_image,
prompt=instruction,
num_frames=57,
guidance_scale=7.0,
num_inference_steps=35,
)
# dream.frames is a tensor [57, 3, 720, 1280]
dream.save("dream_001.mp4")
Stage 2: recover actions with an inverse dynamics model
The dreamed video has no action labels. NVIDIA trains an Inverse Dynamics Model (IDM) jointly with GR00T’s VLM backbone: given consecutive frames f_t and f_{t+1}, the IDM predicts the action that produced the transition. For a humanoid this is a 28-dim joint-velocity vector. The IDM is trained on the real teleoperated data already in the GR00T pyramid. Running it across all 57 frames of a dream produces a 56-step action trajectory.
The IDM is much smaller than the policy — a few hundred million parameters — because it only has to solve a one-step regression, not a sequence-modeling task. It runs at hundreds of frames per second on a single H100.
Stage 3: filter with Cosmos Reason
Generated dreams are not all useful. Cosmos hallucinates. A block can teleport. Hands can pass through tables. A second pass by Cosmos Reason scores each trajectory on a set of structured physical-plausibility questions: did the gripper close on a real object? Did the object follow a continuous trajectory? Was the goal achieved? Dreams that score below a threshold are dropped. NVIDIA’s published numbers from the GR00T N1.5 model card indicate filtering retention rates around 35–50% depending on the task family.
Stage 4: behavior cloning fine-tune
The surviving (instruction, dream, action) triples are mixed into a behavior cloning fine-tune of GR00T N1.5 at roughly a 30% blend with real teleop data. NVIDIA reports a 40% pick-success improvement on novel objects when GR00T N1.5 is fine-tuned with GR00T-Dreams data versus a pure-teleop baseline of equal compute on the Isaac Lab evaluation suite.
The architectural payoff is dramatic: a single seed image per task, on the order of an hour of GPU time per task fine-tune, in place of weeks of human teleoperation. The downside is that dreams can encode a model’s hallucinations as policy behavior — which leads us into the trade-off section.
What FLARE adds in N1.5 that N1 did not have
The other meaningful change between GR00T N1 and N1.5 is FLARE — Future LAtent Representation alignment. The original N1 could only learn from action-labeled trajectories. FLARE lets N1.5 learn from action-free video as well, by training the policy to predict a future latent that matches what a separate video encoder produces from the actual observed future frames. The auxiliary loss is added on top of the flow-matching action loss with a small weight (NVIDIA’s published recipe uses lambda ~ 0.1). The practical effect is that GR00T N1.5 can ingest YouTube-class video of humans performing tasks and extract usable representational signal for the policy, even though no joint commands are ever recovered from those clips. This closes part of the gap that Cosmos addresses with the dreaming pipeline, but the two techniques are additive rather than redundant — FLARE improves the backbone’s representation, GR00T-Dreams improves the action head’s coverage.
What the action expert actually does
The action expert in N1.5 is a Diffusion Transformer that conditions on the VLM’s last hidden state plus the proprioception encoding and denoises an action chunk over 8 inference steps. Compared to N1’s 4-step head, the doubled step count adds about 8 ms of latency but recovers measurable success-rate gains on contact-rich tasks (NVIDIA reports +6.5% on the Isaac Lab manipulation suite). The chunk size of 16 was chosen to align with a ~530 ms open-loop horizon at 30 Hz, which is short enough that drift is bounded but long enough that the SoC can idle between bursts. Teams that lower the chunk size to 4 to reduce open-loop risk pay roughly 4x the inference cost per second of robot operation and lose the thermal headroom that lets Thor run the model at sustained duty cycle.

Numbers worth memorizing
A few headline figures from the published model cards and the Computex 2026 keynote anchor every conversation about this stack. GR00T N1.5 has 3 billion parameters, of which approximately 2.1 billion are the frozen Eagle-2 VLM backbone and 0.9 billion are the action expert and projection heads. The training compute is 240,000 H100-hours, the action chunk is 16 steps at 30 Hz (about 530 ms of open-loop), and the published end-to-end policy latency on Jetson Thor in FP8 is 30 to 40 ms per chunk. Cosmos Predict-2 ships in 4B and 14B variants. The 14B variant generates 57 frames at 720p in roughly 14 seconds on a single H100 in FP8 with 35 sampling steps. Cosmos Reason 1 ships at 4B and 11B, with the 11B variant scoring approximately 65% on the published Cosmos-Reason benchmark of physical-plausibility QA. Memorize these numbers. They are the basis of every realistic deployment plan you will write for the next twelve months.
A useful sanity check: if you find a vendor pitching a deployment that requires GR00T to run at less than 30 ms per chunk on hardware below Jetson Thor, or Cosmos Predict-2 at frame rates that imply sub-second per-dream throughput on a single H100, ask for the benchmark log. Both numbers are tight against the published frontier and downward claims should be treated as marketing rather than measurement.
Chaining patterns: three reference architectures
The Isaac GR00T N1.5 vs Cosmos pairing supports three distinct integration patterns. Most teams will use exactly one. A small number of advanced teams running heterogeneous fleets will use all three.
Pattern A: Cosmos as offline data engine, GR00T at runtime
This is the canonical GR00T-Dreams pipeline. Cosmos lives in the cluster generating dreams overnight. GR00T runs on the robot at 30 Hz. The robot never sees Cosmos. Best fit for teams who have one or two task families and want to scale training data without scaling teleop fleets.
Pattern B: Sim-to-real bridge with Cosmos Transfer
Cosmos Transfer takes a synthetic Isaac Sim scene plus its depth, segmentation, and normal maps and re-renders it with photoreal textures and lighting. This is the next-gen replacement for domain randomization. The pipeline is: Isaac Sim generates 100,000 episodes of synthetic teleop with a procedural policy; Cosmos Transfer photorealistically restyles every frame; the restyled episodes feed the GR00T fine-tune. Best fit for teams who already have a strong Isaac Sim or USD-based digital-twin pipeline and want to close the sim-to-real gap without collecting more real data.
Pattern C: Cosmos Reason as a slow planner on the robot
This pattern uses Cosmos Reason at the 4B-parameter tier as a System 2 planner that runs at 1 Hz on a server connected to the robot or on a high-tier Jetson Thor. Reason consumes the camera feed plus the user instruction and emits a plan in structured JSON. GR00T N1.5 consumes that plan plus the camera feed and emits actions at 30 Hz. The plan refreshes every second; actions refresh every 33 ms. Best fit for long-horizon tasks where the policy alone cannot maintain context — multi-step assembly, cluttered tabletop tasks, kitchen tasks.
# Reference deployment YAML for Pattern C on a humanoid with one Jetson Thor on-robot
# and one A6000 workstation off-robot.
on_robot:
device: jetson-thor-128gb
models:
- name: groot-n1_5
precision: fp8
ckpt: nvidia/GR00T-N1.5-3B
rate_hz: 30
action_horizon: 16
off_robot:
device: workstation-rtx-a6000
link: zenoh-tcp-1gbe
models:
- name: cosmos-reason1-4b
precision: bf16
ckpt: nvidia/Cosmos-Reason1-4B
rate_hz: 1
max_new_tokens: 256
contract:
reason_to_policy: plan_json
policy_to_reason: vlm_thumbnail_every_1s
fallback_on_link_loss: policy_continues_on_last_plan_for_5s
Across all three patterns the underlying contract is the same: GR00T owns the millisecond loop, Cosmos owns either the training corpus or the second-scale plan. They never compete for the same compute slot.
A worked example: red-block pick on a Unitree H1
To make the chaining concrete, consider a published task family from the GR00T-Dreams blueprint: red-block pick on a Unitree H1. The team has 8 H1s, 40 hours of real teleop data across all 8 robots, and a Computex demo deadline. The pipeline they actually ran (per the NVIDIA developer-blog walkthrough) was: render 100 variations of the workspace in Isaac Sim with USD assets; capture one seed image per variation; for each seed, generate 50 Cosmos dreams with the same instruction; run the IDM to extract 56-step trajectories; filter with Cosmos Reason at a 0.6 threshold; mix the surviving ~2,300 dreams per task with the 40 hours of real teleop at a 30% blend; fine-tune GR00T N1.5’s action expert for 8 hours on 16 H100s. End-to-end wall clock was 38 hours. End-to-end real human time on the data side was about 2 hours of setup. Success rate on a held-out set of unseen workspaces (different lighting, different bin placement) climbed from 41% with teleop-only fine-tune to 73% with the Dreams-augmented fine-tune.
That headline gain — almost double the success rate on held-out workspaces — is what is driving the rapid adoption of the chained pattern across humanoid teams.
Trade-offs and failure modes
The Isaac GR00T N1.5 vs Cosmos pairing is powerful but each side has documented failure modes you have to design around. Treating either model as oracular is the fastest way to ship a humanoid that fails strangely.
Where GR00T N1.5 breaks
Out-of-distribution objects. A policy trained on red blocks rarely generalizes to translucent red bottles. FLARE helps but does not eliminate the problem. Embodiment transfer is also limited: NVIDIA’s published cross-embodiment numbers show meaningful degradation when moving from the 28-DoF humanoid spec to a 7-DoF arm without re-fine-tuning the action expert. And the action chunking strategy that gives GR00T its latency advantage is also its biggest open-loop liability — if the world changes mid-chunk, the robot only notices at the next chunk boundary.
The published evaluation suites are also generous. GR00T N1.5 is benchmarked on Isaac Lab’s MimicGen tasks and the LeRobot demo suite. Real-world data from independent partners has been thinner, and the policy’s success rate on contact-r
