NVIDIA GB300 NVL72: Blackwell Ultra Architecture Explained
Rack-scale AI clusters are no longer optional infrastructure — they’re table stakes for 2026. The NVIDIA GB300 NVL72 system, shipping now across hyperscale labs, represents the largest coherent memory fabric ever built, combining 72 Blackwell Ultra GPUs, 36 Grace CPUs, and NVLink 5 switching in a single 120 kW thermal envelope. This is the NVIDIA GB300 NVL72 architecture that’s reshaping how teams train 100B+ parameter models and run multi-trillion-token inference workloads. If you’re architecting a new data center or optimizing a GPU cluster for next-generation LLMs, understanding GB300 isn’t optional. This post breaks down the silicon, the fabric, the cooling, and the trade-offs.
What this post covers: the GB300 silicon step-up from Blackwell, the NVL72 rack topology and NVLink 5 coherence model, thermal and power challenges, real-world deployment fit, and a decision tree for when GB300 makes sense versus alternatives.
What GB300 NVL72 Actually Is
NVIDIA announced the GB300 family of Blackwell Ultra GPUs at GTC 2025, with first deployments arriving in Q1 2026 across major cloud providers and research labs. The NVL72 reference system is the full-rack configuration: 18 compute trays, each housing four Blackwell Ultra GPUs and two Grace CPUs, plus 9 NVLink switch trays providing fully-connected fabric at 1.8 TB/s per GPU.
The GB300 sits one tier above the GB200 in the Blackwell stack. Where GB200 offers 192 GB HBM3e per GPU and focuses on mainstream data center scaling, GB300 delivers 288 GB HBM3e — a 50% capacity jump — plus higher FP4 (4-bit floating point) throughput, a third-generation Transformer Engine for sparsity, and a decompression engine for on-the-fly model weight decompression. The upgrade targets workloads where memory bandwidth and model weight capacity are bottlenecks: long-context retrieval, sparse expert models (MoE), and mixed-precision inference at trillion-parameter scale.
The NVL72 is not just 72 random GPUs linked by Infiniband. Every GPU in the rack sees every other GPU as local memory via NVLink 5 — there is no PCIe egress for GPU-to-GPU data. This coherent fabric is critical: a 100-billion-parameter model can be sharded across all 72 GPUs without software-level partitioning, and inter-GPU gradient exchange during backprop flows through the 130 TB/s collective fabric. First customers (Meta’s xAI cluster, lambda Labs research pods, and a handful of OEM integrators) started deploying these racks in March 2026.
GB300 NVL72 Rack-Scale Reference Architecture
The NVL72 architecture is a precision exercise in co-design: compute, memory, switching, power, and cooling are tightly coupled.

Architecture overview: The rack is a 42-unit form factor. Inside sit 18 horizontal compute trays (20-inch form factor, stacked into 9 double-wide slots) and 9 vertical switch trays (also stacked 9-per-rack). Each compute tray carries four B300 GPUs, arranged in a quad topology, and two Grace Hopper CPUs with 192 GB HBM3e memory each. The Grace CPUs are not coprocessors — they handle OS, system software, and some data movement. The NVSwitch 5 tray, meanwhile, is a switching fabric tray: five NVSwitch 5 chips per tray, with each chip supporting up to 8 NVLink 5 PHYs (1.8 TB/s bidirectional per PHY). In the NVL72, a fully-connected fabric requires 10 hops minimum but is provisioned with 1:1 oversubscription at the inter-rack boundary (external 400G Ethernet NICs).
Compute Density and Memory Per Tray
Each compute tray packs 4 × B300 (totaling 1.152 TB HBM3e), 2 × Grace (384 GB HBM3e), a PCIe 6.0 root complex for storage and NIC attach, and up to 32 GB of LPDDR5X system memory. The quad of B300 GPUs share a 10 TB/s NV-HBI (die-to-die) bidirectional link to each neighbor, and four external NVLink 5 ports connect each B300 into the switch fabric. Per-tray power draw is ~6.7 kW sustained, meaning the full compute tier (18 trays) consumes approximately 120 kW. NVLink 5 PHYs are integrated directly on the GPU die, not separate controllers, which cuts latency and improves coherence.
NVLink 5 Fabric and Switch Topology
The 9 switch trays provide a non-blocking, fully-connected fabric for the 72 GPUs. Each GPU has 18 external NVLink 5 connections (plus intra-tray NV-HBI links), giving a theoretical maximum of 72 × 1.8 TB/s = 129.6 TB/s aggregate. The switching is accomplished via a Clos-like multi-stage fabric, but unlike traditional Clos networks, every stage leverages NVLink 5’s point-to-point coherence protocol — there’s no additional serialization or TCP overhead. All-reduce operations across all 72 GPUs complete in sub-microsecond latency, critical for distributed training.
The fully-connected property is a force multiplier for LLM training. A model using Fully Sharded Data Parallel (FSDP) can shard across all 72 GPUs without worrying about slow cross-rack hops. The model parallel dimension maps to GPU pairs, data parallel to groups of 8 or 16, and everything happens in-fabric.
Liquid Cooling and Power Architecture
The NVL72 uses direct-liquid cooling (DLC), not air cooling. Cold water at ~15°C enters a Coolant Distribution Unit (CDU) at the base of the rack, fed by facility chilled water. From the CDU, a primary loop branches into a secondary loop that feeds cold plates attached to every B300 GPU, every Grace CPU, every NVSwitch tray, and the power conversion modules. Return water exits each cold plate at ~35°C maximum. The CDU also has a redundant heat exchanger for facility isolation.
Power distribution uses a 415V 3-phase input, fed to four industrial-grade PSUs (each ~30 kW, N+1 redundancy). From there, a 48 VDC bus bar runs along the frame, and each compute tray receives 12 VDC rails via isolated DC-DC modules. Every single 12 V supply is monitored and fused. This is not consumer-grade — it’s mission-critical data center infrastructure.
Facility implications are substantial: you need sub-15°C facility chilled water supply, 120 kW of 3-phase 415V power, redundant power and cooling loops, and a drain system rated for 50 L/min flow. Most colocation facilities in APAC, EMEA, and North America can handle this, but it rules out smaller edge or branch offices.
Blackwell Ultra Silicon — What Changed vs Blackwell
The Blackwell Ultra (B300) die is fundamentally different from Blackwell (B200), even though both are manufactured on TSMC 4NP.

Dual-die, reticle-limited design: Each B300 is a two-die package, with each die maxing out the reticle limit on 4NP (~50 mm²). The dies are connected by a 10 TB/s NV-HBI (Nvidia’s proprietary coherent die-to-die link), allowing them to act as a single GPU with unified memory. This is in contrast to B200, which is single-die. The dual-die approach trades yield for raw capacity and thermal density. NVIDIA is absorbing the 15–20% yield loss in pricing.
HBM3e Capacity: Each B300 is bonded to 8 stacks of HBM3e memory (36 Gb each, for a total of 288 GB per GPU). That’s 50% more than GB200’s 192 GB. The bandwidth is also higher: HBM3e at 6.4 GHz (vs 5.6 GHz on GB200’s generation) delivers ~900 GB/s per GPU, or 64.8 TB/s at FP16 for a quad of B300s. The memory interface is 4096-bit wide, and NVIDIA has implemented aggressive prefetching to hide latency.
Fifth-generation Tensor Cores: The 5th-gen Tensor Cores in B300 introduce native FP4 support — single-cycle matrix operations on 4-bit signed floats. This is a game-changer for speculative inference and quantized weight storage. A single B300 can sustain ~1.5 Peta-ops/sec in FP4, versus GB200’s ~1.0 PetaOps/sec. For LLMs using 4-bit quantized weights and 8-bit activations, this unlocks 2x throughput.
Third-generation Transformer Engine: The third iteration of NVIDIA’s Transformer Engine handles dynamic sparsity automatically — if 30% of activations are zero, the engine skips those computations. For sparse expert models (Mixture of Experts) with activation sparsity >40%, this yields measurable wall-clock speedup. The engine also handles the common-case 16-bit → 8-bit reduction used in LoRA fine-tuning.
Decompression Engine: Brand new in B300, this dedicated silicon decompresses model weights on-the-fly from 4-bit or 8-bit storage into working precision (16-bit) without stalling the compute pipeline. Combined with Transformer Engine sparsity, this allows a 70B parameter model to fit entirely within a single B300’s 288 GB memory with room for KV cache and gradients.
Second-generation Tensor Memory Accelerator (TMA): The TMA is NVIDIA’s asynchronous tensor-memory-to-register engine, used to pipeline data movement and hide latency in tensor operations. B300’s TMA v2 doubles the prefetch bandwidth and supports more prefetch patterns, making it easier to hide HBM latency on long-sequence inference.
NVLink 5 Fabric & Memory Coherence
NVLink 5 is not just a faster bus — it’s a coherent memory protocol.

NVLink 5 Bandwidth: Each physical NVLink 5 PHY delivers 1.8 TB/s (900 GB/s per direction × 2 directions, bidirectional). A single B300 has 18 external NVLink 5 ports, totaling 32.4 TB/s per GPU, or 2.33 PB/s aggregate for all 72 GPUs. (The intra-tray NV-HBI links add another ~40 TB/s, but are not counted in the all-reduce budget.)
Fully-Connected Fabric: The 9 switch trays form a non-blocking fabric. Every GPU can talk to every other GPU at line rate (1.8 TB/s) with zero contention. This is the single largest advantage over HGX systems (which use 3-stage Clos fabrics with oversubscription). Collective operations (all-reduce, all-gather, reduce-scatter) complete in 1–3 microseconds depending on the pattern.
Coherence Model: NVIDIA uses a directory-based cache-coherence protocol over NVLink. When GPU-0 writes to a memory address backed by GPU-1’s HBM, GPU-1’s local L2 cache is invalidated, and GPU-0’s write is persisted to the HBM directly. This is write-through, not write-back, so latency is deterministic. For training, this means gradient writes from one GPU to a model-shard GPU are immediately visible — no software synchronization needed beyond the usual PyTorch .backward() and .step() calls.
All-Reduce Latency: Using the NVSwitch fabric, an all-reduce across all 72 GPUs with 1 GB tensor payload takes ~150–200 microseconds, dominated by the computation time for reduction, not the communication time. This is 100x faster than all-reduce over 100 Gbps Ethernet, enabling synchronous gradient averaging without batching delays.
Power, Thermals, and Liquid Cooling
A full NVL72 rack dissipates ~120 kW sustained. During peak training (matrix multiply-heavy), some thermal modeling shows transient peaks at 130 kW, but continuous operation is designed for 120 kW.

Cooling Capacity: Direct liquid cooling is mandatory. The CDU must supply at least 40 L/min of 15°C water, and the facility heat exchanger must reject 120 kW to ambient (or to a secondary loop). A typical facility will run a closed-loop CDU fed by chilled water at 10–12°C, accepting return water at 35°C. The thermal gradient is aggressive — we’re relying on good contact between GPU die and cold plate, and on high-conductivity thermal interface material (likely indium or silver-loaded epoxy).
Power Distribution: The 415V 3-phase supply feeds four PSUs, each rated for 30 kW at 415V input. No single PSU is sized to power the entire rack; the design assumes N+1 redundancy at the PSU level. The 48 VDC bus is then stepped down to 12 VDC at each tray, with per-tray monitoring and fusing. This architecture prevents a single component failure from cascading into a full-rack outage.
Facility Readiness: Deploying GB300 requires:
– Chilled water supply at 10–15°C, flow rate ≥50 L/min.
– 415V 3-phase power, ≥150 kW available capacity (including overheads).
– Floor-level drain rated for ≥50 L/min continuous flow (during coolant changes or emergencies).
– Ambient operating temperature ≤32°C (to maintain facility chilled-water equipment efficiency).
If your data center was built for 5–8 kW per-rack average, GB300 is a capacity upgrade trigger.
Trade-offs and Where GB300 Falls Short
Despite its strengths, GB300 and the NVL72 have real limitations.
Facility Readiness: Most colocation providers can provision the power and cooling, but not all. If you’re in a smaller metro or a branch office with shared infrastructure, GB300 may not fit. Liquid cooling adds operational overhead — CDU maintenance, coolant replacement every 3–5 years, and leak detection are non-trivial.
Cost: A fully-configured NVL72 rack runs $3–4 million USD, not including installation, integration labor, or facility upgrades. This is 2–3x the cost of an equivalent HGX B200 node cluster, amortized over 3-year deployment lifetimes. The ROI only works if you’re running workloads that saturate the fabric and memory (100B+ parameter models, trillion-token training runs, or multi-tenant inference pods with tight SLA coupling).
HBM Supply: NVIDIA’s HBM3e supply is constrained until late 2026. SKI (SK Hynix, NVIDIA’s primary HBM supplier) is ramping, but not fast enough to meet demand. Expect lead times of 6–12 months for new orders, and prioritization going to hyperscalers with large upfront commitments.
Vendor Lock-in: The NVLink 5 fabric is proprietary to NVIDIA — if you’ve sharded a model across NVLink, you can’t easily port it to an AMD MI300X cluster or Intel Gaudi 3 pod. This is a strategic risk if you need multi-vendor optionality for supply resilience.
Software Maturity: CUDA 13.1 (released March 2026) is required for full B300 support. Older codebases targeting CUDA 12.x need porting work. PyTorch and JAX support is solid, but some HPC libraries (OpenFOAM, GROMACS, etc.) lag behind. If you’re running legacy HPC workloads, expect 1–2 quarters of porting before GB300 is productive.
Practical Recommendations
GB300 NVL72 is the right choice if:
-
100B+ parameter training: You’re training models larger than 100 billion parameters and need per-GPU memory ≥288 GB for the model shard, optimizer state, and gradients. GB200 HGX nodes will run out of memory or require aggressive model parallelism.
-
Coherent fabric is critical: Your workload benefits from sub-microsecond all-reduce (synchronous data-parallel training across 64+ GPUs). Asynchronous or gradient-accumulation patterns work fine on Ethernet-connected clusters; coherent memory doesn’t help there.
-
Facility is ready: Your data center already has sub-15°C chilled water, 3-phase 415V power, and liquid-cooled infrastructure from prior deployments.
-
Capital is available: You can justify $3–4M per rack over a 36-month payback period.
If you’re running inference-only workloads (LLM serving, embedding models), unless you have extreme latency SLAs (<50ms for 1000-token decoding), a smaller HGX B200 pod cluster is 50% cheaper and 80% as performant for batch inference. If you’re doing fine-tuning or LoRA on <70B parameter models, a single or dual DGX H200/B100 is overkill — a pair of H100 nodes on Infiniband is cost-competitive.
For decision-making: use the workload decision tree to match model size, batch latency, and facility constraints to the right platform.

FAQ
Q: How is GB300 different from GB200?
A: GB300 adds 50% more HBM3e (288 GB vs. 192 GB), 1.5x native FP4 throughput, a decompression engine for on-the-fly weight unpacking, and a third-generation Transformer Engine with automatic sparsity. It’s a silicon-and-software refresh for models that hit memory or compute ceilings on GB200.
Q: What is NVLink 5 bandwidth per GPU?
A: Each GPU has 18 external NVLink 5 PHYs, each delivering 1.8 TB/s bidirectional, for 32.4 TB/s per GPU. Add intra-tray NV-HBI links (~10 TB/s per GPU pair), and the aggregate fabric is 129.6 TB/s across all 72 GPUs.
Q: How many GPUs and CPUs in an NVL72?
A: 72 Blackwell Ultra (B300) GPUs, 36 Grace CPUs (one per two GPUs), organized into 18 compute trays and 9 switch trays. The full rack is 42 RU.
Q: What is the memory per Blackwell Ultra GPU?
A: 288 GB of HBM3e, bonded via 8 stacks at 36 Gb per stack, delivering ~900 GB/s peak bandwidth at 6.4 GHz. Intra-GPU die-to-die bandwidth (NV-HBI) adds another 10 TB/s per dual-die pair.
Q: Can existing data centers host GB300?
A: Only if they have sub-15°C chilled water supply, 3-phase 415V power (≥150 kW capacity), and floor-level drainage. Most hyperscaler facilities (AWS, Google, Azure regions) qualify. Smaller colocation providers or branch offices may not. Liquid cooling is mandatory.
Further Reading
For deeper context on AI infrastructure and related platforms, see:
- NVIDIA Spectrum-X AI Ethernet Fabric for 100K GPU Clusters — how GB300 racks connect at scale via Spectrum-X for multi-rack clusters.
- Edge LLM Benchmark: Jetson Orin, Llama, Phi, Gemma (Q2 2026) — for inference workloads that fit on edge, a contrast to rack-scale training.
- Karpenter Node Autoscaling in Kubernetes: Production Deep Dive — orchestration patterns for managing GPU cluster fleets.
For authoritative technical specifications:
– NVIDIA GB300 Product Brief — official product documentation.
– SemiAnalysis and ServeTheHome GPU analysis — independent teardowns of Blackwell thermal, power, and rack integration (available via their latest published 2026 hardware reviews).
Author: Riju — see /about.
