AMD MI400 Instinct Architecture for AI Training (2026)

AMD MI400 Instinct Architecture for AI Training (2026)

AMD MI400 Instinct Architecture for AI Training (2026)

The AMD MI400 Instinct architecture is the first AMD accelerator designed from the ground up for rack-scale AI training, not just inference. Announced at AMD Advancing AI 2025, MI400X moves from a 2.5D chiplet design to a rack-scale “Helios” platform with 432 GB of HBM4 per package, ~19.6 TB/s of memory bandwidth, and a 1.8 TB/s UALink fabric that finally gives AMD a credible answer to Nvidia’s NVLink scale-up domain. The bet is bigger than a single chip: AMD is wagering that open standards — UALink for scale-up, Ultra Ethernet for scale-out, ROCm for software — will erode Nvidia’s NVLink/InfiniBand/CUDA moat by the time Rubin ships in volume.

Architecture at a glance

AMD MI400 Instinct Architecture for AI Training (2026) — architecture diagram
Architecture diagram — AMD MI400 Instinct Architecture for AI Training (2026)
AMD MI400 Instinct Architecture for AI Training (2026) — architecture diagram
Architecture diagram — AMD MI400 Instinct Architecture for AI Training (2026)
AMD MI400 Instinct Architecture for AI Training (2026) — architecture diagram
Architecture diagram — AMD MI400 Instinct Architecture for AI Training (2026)
AMD MI400 Instinct Architecture for AI Training (2026) — architecture diagram
Architecture diagram — AMD MI400 Instinct Architecture for AI Training (2026)
AMD MI400 Instinct Architecture for AI Training (2026) — architecture diagram
Architecture diagram — AMD MI400 Instinct Architecture for AI Training (2026)

This post breaks down what AMD actually disclosed for MI400, what is still estimated, how CDNA 4 (and the implied CDNA-Next on MI400) differs from MI300X, and where MI400 fits versus Nvidia Rubin and Google TPU v6 Trillium-2. It is written for architects sizing 2026 training clusters, not for buy-side analysts.

Context: how AMD got here

AMD’s Instinct line went from “viable second source” to “actually competitive” in 18 months, but only for memory-bound inference. MI300X (Dec 2023) shipped 192 GB of HBM3; MI325X (Oct 2024) refreshed to 256 GB HBM3E; MI355X (mid-2025) introduced CDNA 4 with FP4/FP6 and 288 GB HBM3E; MI400X (2026) jumps to HBM4 and a rack-scale fabric. Training share remains tiny — the gap was never raw FLOPS, it was scale-up bandwidth and ROCm maturity.

MI300X already won meaningful inference deployments. Microsoft Azure runs GPT-4-class workloads on ND MI300X v5 instances; Meta uses MI300X for Llama 3 inference; Oracle Cloud Infrastructure deploys MI300X bare metal shapes (BM.GPU.MI300X.8) for inference and fine-tuning. The thesis was always the same: 1.5x to 2.4x the HBM capacity of H100/H200 at a lower $/GB, which matters enormously for KV-cache-bound workloads like Mixture-of-Experts and long-context inference. According to AMD’s MI300X product brief, a single MI300X holds a 175B-parameter model in BF16 without tensor parallelism, something H100 80GB cannot do.

What MI300X did not win was multi-node training. NVLink 4 on H100 provides 900 GB/s of bidirectional all-to-all bandwidth across 8 GPUs; the Infinity Fabric on an MI300X 8-GPU OAM platform delivers ~896 GB/s aggregate per GPU but with a partial mesh, not a switched all-to-all. Worse, AMD had no scale-up answer to NVSwitch beyond 8 GPUs. For training a 400B-parameter dense model, that meant ROCm clusters needed more tensor parallelism, hit more communication overhead, and lost to NVLink-Switch GH200 NVL32 / Blackwell NVL72 systems on iteration time. MI400 is AMD’s response.

The shift in 2026 is structural. Training compute is moving from 8-GPU servers to 72-to-128-GPU coherent domains because frontier models are dense MoE with 1T+ parameters. If your scale-up domain is 8 GPUs, you pay a 5–10x latency penalty crossing into scale-out for every gradient sync. That tax compounds across millions of training steps. AMD’s UALink answer — and the rack-scale Helios system that wraps it — is the actual product, not the silicon.

MI400X reference architecture

AMD MI400 Instinct architecture package diagram with XCDs, AIDs, HBM4 stacks, and Infinity Fabric

Answer-first: MI400X is a 3.5D advanced-packaged accelerator built on TSMC N3P-class compute dies and N6-class active interposer dies, with eight HBM4 stacks delivering 432 GB of capacity and ~19.6 TB/s of bandwidth per package. AMD has disclosed FP4 peak of 40 PFLOPs and FP8 peak of 20 PFLOPs per MI400X, with 300 GB/s UALink ports per GPU aggregating to 1.8 TB/s of scale-up bandwidth, organized into a 72-GPU “Helios” rack.

The package follows MI300’s disaggregated topology but at higher integration. MI300X used 8 Accelerator Complex Dies (XCDs) on TSMC N5 stacked on 4 IO Dies (AIDs) on N6, with 8 stacks of HBM3 around the perimeter. MI400X retains the XCD-over-AID split but on newer nodes — AMD has not confirmed final geometry, though SemiAnalysis and AnandTech coverage of AMD Advancing AI 2025 indicate a higher XCD count per package and a redesigned Infinity Fabric AP (Advanced Package) link between dies. The active interposer continues to host shared L3 cache, memory controllers, and the fabric switch.

What changed at the silicon level for CDNA-Next on MI400 is the matrix engine. CDNA 4 on MI355X added native FP4 and FP6 matrix instructions and doubled the matrix throughput per CU at lower precisions, hitting 20.1 PFLOPs FP4 sparse on MI355X according to AMD’s MI355X spec sheet. MI400X roughly doubles that again to ~40 PFLOPs FP4, achieved through more CUs, higher clocks, and architectural improvements in the matrix pipeline — though AMD has not published the per-CU breakdown. The CU itself remains a wave64-based GCN-derived design with separate vector ALUs and dedicated matrix cores; CDNA does not borrow Nvidia’s tensor-core-as-mandatory model.

The HBM4 jump is the headline number. HBM4 doubles the per-stack channel count from 16 to 32 and raises per-pin signaling rate; per JEDEC’s JESD238A HBM4 standard ratified in April 2025, peak per-stack bandwidth reaches ~2.0 TB/s at 8 Gbps/pin. AMD’s 19.6 TB/s package figure implies ~2.45 TB/s per stack across 8 stacks, suggesting AMD is using HBM4 at the higher 9.6+ Gbps/pin grades that Samsung and SK hynix sampled in 2025. Capacity per stack moves from 36 GB on HBM3E 12-Hi to 48 GB or 64 GB on HBM4 16-Hi; 432 GB / 8 stacks = 54 GB/stack, meaning AMD is using mixed 12-Hi and 16-Hi configurations or 16-Hi parts at non-maximum die count.

Package thermal design power is where MI400X gets uncomfortable. MI300X was 750W; MI325X was 1000W; MI355X is 1400W with direct liquid cooling mandatory. MI400X is widely estimated at 1500–1800W per package, which forces rack-level liquid cooling — AMD’s Helios reference design uses cold-plate DLC on every GPU and every switch. The thermal envelope is similar to Nvidia’s Rubin GB300-class TDPs, and effectively means air-cooled deployments are dead for frontier training silicon in 2026.

The fabric story is the actual differentiator. Each MI400X exposes UALink ports totalling 1.8 TB/s bidirectional, terminating in Broadcom-designed UALink switches inside the Helios rack. This gives a 72-GPU coherent domain with cache coherency for atomics and load/store semantics — the same primitives NVLink offers, on an open spec. We will dig into the protocol in the next section.

UALink scale-out training step sequence diagram showing gradient all-reduce across 8 AMD MI400X accelerators

Answer-first: UALink is a load/store memory-semantic interconnect for scale-up GPU domains of up to 1,024 endpoints. It uses Ethernet PHYs (200G/lane) with a slimmed-down transaction layer borrowed from Infinity Fabric, providing cache-coherent memory operations at single-digit microsecond latencies. Ultra Ethernet handles scale-out across racks. Together they replace NVLink+InfiniBand with open multi-vendor standards backed by AMD, Broadcom, Cisco, HPE, Intel, Meta, Microsoft, and Google.

UALink Consortium 1.0 spec (April 2025) defines the protocol. The PHY is reused from Ethernet (IEEE 802.3df 200G-PAM4) so the SerDes economics and reach characteristics are well understood. The link layer drops Ethernet’s framing in favor of a tight transaction format optimized for memory ops — read, write, atomic, and DMA descriptors — with deterministic ordering and credit-based flow control. The transport layer guarantees in-order delivery without TCP overhead. Round-trip latency is target 1–2 microseconds for a 64B load across a single switch hop, an order of magnitude better than RoCE v2 over the same physical media.

The 72-GPU domain matters because of how modern training pipelines partition. With tensor parallelism set to 8 (typical for dense layers) and pipeline parallelism set to 8 (typical for long pipelines), you need 64 GPUs in one coherent domain to avoid pipeline parallelism crossing the scale-out boundary on every forward/backward pass. The remaining 8 GPUs in a Helios rack go to expert parallelism for the MoE feed-forward path. Nvidia’s NVL72 (72 Blackwell GPUs in one NVLink domain) and now Rubin’s NVL144 follow the same logic. AMD’s 72-GPU first deployment is competitive; the spec headroom to 1,024 is a future-proofing argument.

Ultra Ethernet — under Ultra Ethernet Consortium governance — handles the scale-out tier. UEC’s UET (Ultra Ethernet Transport) replaces RoCE v2’s lossless-Ethernet requirement with a modernized transport that supports packet spraying, multi-path adaptive routing, and end-to-end congestion control over standard Ethernet switches. AMD’s Pollara 400 NIC implements UET in hardware; for MI400 deployments, two Pollara 400 NICs per GPU provide 800 Gbps of scale-out bandwidth per accelerator. That is symmetric with Nvidia’s ConnectX-8 and Spectrum-X positioning.

Here is what the all-reduce dataflow looks like in practice for a 400B dense model with tensor parallelism 8 across 8 MI400X within one Helios chassis:

# Pseudocode for a ring all-reduce over UALink within an 8-GPU TP group
# NCCL-equivalent on ROCm is RCCL; same primitives, ROCm-native build
import torch
import torch.distributed as dist

# RCCL backend autoselects UALink when present
dist.init_process_group(backend="rccl", init_method="env://")
tp_group = dist.new_group(ranks=list(range(8)))  # tensor-parallel group

# Gradient buffer for an MLP weight matrix shard
# 4096x4096 BF16 = 32 MB shard; full = 256 MB across TP=8
grad_shard = torch.empty(4096, 4096, dtype=torch.bfloat16, device="cuda")

# Ring all-reduce: 2*(N-1)/N * payload = ~56 MB transferred per step
# At 300 GB/s per UALink port = 187 microseconds wire time
dist.all_reduce(grad_shard, op=dist.ReduceOp.SUM, group=tp_group)

The 187 microseconds is the wire-time floor; real RCCL all-reduce on a tuned UALink stack lands at 250–400 microseconds depending on chunking, which matches published NVLink 5 numbers on Blackwell B200 within 20%. RCCL 2.20+ contains UALink-aware transport selection in the ROCm 6.4 timeframe; ROCm 7.0 (expected H2 2026 alongside MI400 GA) is the version AMD has positioned as production for MI400.

The Helios rack itself is a reference design AMD has shared with ODMs (Supermicro, Quanta, Wiwynn, Foxconn, Gigabyte) — it is not an AMD-branded product. A Helios rack contains 18 1U trays with 4 MI400X each, 9 UALink switch trays (Broadcom Tomahawk Ultra UAL or Astera Labs equivalents), 144 Pollara 400 NICs in 36 NIC trays, two 415V three-phase 60A power feeds, and rear-door heat exchangers or rack-level CDU for DLC. Estimated rack power: 180–220 kW. That is roughly 2.5x the power density of a typical H100 SuperPOD rack and demands datacenter retrofit for almost every existing site.

CDNA 4 to CDNA-Next: what changed in the compute engine

CDNA 4 on MI355X introduced four changes that carry forward to MI400X. First, FP4 and FP6 native matrix instructions with structured 2:4 sparsity support, lifting throughput at sub-FP8 precisions by ~2x and ~1.5x respectively compared to FP8 on the same hardware. Second, a redesigned LDS (Local Data Share) with higher per-bank bandwidth that reduces stalls in matrix-vector ops common in attention. Third, transcendental function units in the SIMD pipeline that speed up softmax and layer norm. Fourth, larger L2 per CU and a global L3 of 256 MB per package, up from 192 MB on MI300.

The CDNA-Next architecture on MI400X — AMD has not formally named it CDNA 5 in disclosures, so we will call it CDNA-Next — adds tensor cores with 2x the matrix throughput per CU at FP8 and FP4. It also restructures the wavefront scheduler to better interleave matrix and vector work, addressing a pain point in ROCm 6.x where memory-bound attention kernels left matrix units idle. AMD claims a ~2.5x training throughput uplift per package over MI355X on Llama 3 70B BF16 training; this is the most-cited number from Advancing AI 2025 and should be treated as an AMD-published benchmark, not an independent measurement.

The other architectural shift is asymmetric die binning. MI400X package SKUs may ship with disabled CUs to improve yield, similar to how Nvidia ships full and cut-down GPU SKUs (B200 vs B100). Specific bin counts have not been disclosed.

Software: ROCm 7 and the workload story

ROCm software stack layers for AMD MI400 Instinct architecture from PyTorch through HIP and kernels to ISA

Answer-first: ROCm 7 is the production target for MI400X, shipping H2 2026. It adds UALink-aware RCCL collectives, hipBLASLt kernels tuned for CDNA-Next matrix units, native FP4/FP6 paths in MIOpen, and a maturing PyTorch upstream story — torch.compile inductor backends produce competitive code on ROCm for transformer blocks, though TorchTitan-class training recipes still need AMD-specific tuning.

The ROCm 6.x to 7.x transition is more than a version bump. ROCm 6.0 (Dec 2023) made HIP the unambiguous frontline API and shipped first-class PyTorch support. ROCm 6.2 (Aug 2024) added vLLM merges upstream so AMD GPUs are first-class in vLLM main without a fork. ROCm 6.3 brought MI325X enablement and SGLang upstream support. ROCm 7 lifts the stack to MI400 with new collectives, new precisions, and a redesigned compiler back-end (LLVM-derived) that closes the kernel quality gap with NVCC on key transformer kernels.

For training, the practical question is whether the PyTorch + Megatron-LM / DeepSpeed / TorchTitan / Nemotron-recipe stack runs cleanly on ROCm 7 with comparable step time to a CUDA stack on Hopper or Blackwell. Public reports through mid-2025 from Lamini, TogetherAI, and Hugging Face indicate ROCm 6.2+ trains Llama 3 8B/70B at 85–105% of H100 token-per-second per-FLOP-equivalent on MI300X 8-GPU servers. The variability is real — kernel coverage gaps still exist for some attention variants (FlashAttention-3, Ring Attention, certain MoE routing kernels) and require AMD engineering support to close. ROCm 7 is positioned to close most of these.

The vLLM story for inference is essentially solved. As of vLLM 0.6+, MI300X runs Llama 3.1 405B in FP8 on a single 8-GPU node with comparable throughput to H200 NVL8 for long-context serving, and lower throughput but similar latency on short prompts. The DeepSeek V3 / R1 family is heavily MoE and memory-capacity-bound, which is exactly where MI300X’s 192 GB shines — single-node inference of DeepSeek V3 (671B total, 37B activated) is feasible on MI300X NVL8 (~1.5 TB total HBM) and requires multi-node or significant offloading on H100 SXM5 8-GPU servers (640 GB total HBM).

The persistent ROCm risk is what I will call the “second-day-of-debugging” gap. On Nvidia, when something breaks, the answer is usually a known StackOverflow thread or a CUDA forum post from 2017. On ROCm, debugging often requires reading HIP source, opening GitHub issues, and waiting for an AMD engineer to respond. AMD has invested heavily in this — the ROCm GitHub org has visible response cadence — but the gap is real for teams without direct AMD support contracts. For hyperscaler-scale deployments with embedded AMD engineering this is a non-issue; for mid-size shops it is a budget line item.

MI400X versus Nvidia Rubin and Google TPU v6

AMD MI400X versus Nvidia Rubin versus Google TPU v6 Trillium-2 architecture comparison matrix

Answer-first: On a per-package basis MI400X leads Nvidia Rubin R100 on HBM capacity (432 GB vs ~288 GB estimated) and is broadly comparable on FP4 dense throughput. At rack scale, NVL144 Rubin still has a software ecosystem and operator-experience edge, while Helios MI400 has the open-fabric and TCO story. Google TPU v6e Trillium-2 plays a different game: integrated Pathways software, JAX-first, internal Google workloads, and OCP-style pod scaling.

Specific apples-to-apples comparison is hard because Nvidia has disclosed Rubin numbers across NVL144 racks rather than per-GPU exclusively, and Google does not break out per-chip specs cleanly. The most useful published reference points:

Metric AMD MI400X Nvidia Rubin R100 (est.) Google TPU v6e Trillium-2 (est.)
HBM capacity per package 432 GB HBM4 ~288 GB HBM4 ~128 GB HBM4 per chip
HBM bandwidth per package ~19.6 TB/s ~13 TB/s ~7 TB/s
FP4 peak (dense) 40 PFLOPs ~50 PFLOPs n/a (BF16/INT8 focus)
FP8 peak (dense) 20 PFLOPs ~25 PFLOPs ~4.6 PFLOPs BF16
Scale-up domain 72 GPUs UALink 144 GPUs NVLink 6 256 chips ICI
Scale-up BW per GPU 1.8 TB/s ~3.6 TB/s ~3.6 TB/s ICI
Scale-out NIC 800G Pollara (UET) 800G ConnectX-8 (RoCE/Spectrum-X) OCS-coupled ICI mesh
Software stack ROCm 7 + PyTorch/vLLM CUDA 13 + cuDNN + TensorRT-LLM XLA + JAX + Pathways

Treat all Rubin and TPU v6 numbers as estimates from public disclosures and credible analyst coverage (SemiAnalysis, AnandTech); AMD’s MI400X numbers are from AMD’s Advancing AI 2025 keynote materials. Public benchmarks from MLPerf Training v5 / v5.1 will be the first real-world data points; expect those late 2026.

The TCO argument for MI400X rests on two claims. First, larger HBM per package reduces the GPU count required for memory-capacity-bound workloads (frontier MoE, long-context inference, KV cache offload for agentic systems). Second, open fabric standards reduce switch and NIC supplier lock-in, which AMD argues drives down rack-level capex. Both arguments are real but unproven at scale; the next hyperscaler capex on AI compute cycle will tell us whether MI400 actually shifts hyperscaler allocation share or remains a 15–20% second-source play.

The fabric-economics piece is also where the silicon photonics and co-packaged optics datacenter shift becomes load-bearing for MI400 deployments. UALink at 200G/lane in copper has a ~2 meter reach limit, which keeps the 72-GPU domain physically tight. To go beyond a single rack into multi-rack coherent domains, AMD and Broadcom will need CPO at the switch level — and Broadcom’s Tomahawk roadmap is explicitly heading there.

The foundry story matters too. MI400X is on TSMC N3P; Rubin is on TSMC N3P; TPU v6 is on N5/N4 with N3 derivatives. The next-gen foundry node race between TSMC A14 and Intel 14A is what enables the MI500 generation in 2027–2028, but for MI400 the relevant supply constraint is HBM4, not logic wafers. SK hynix and Samsung have committed HBM4 capacity, but yield ramp on 16-Hi stacks is the gating factor for MI400X volume.

Trade-offs and failure modes

AMD MI300X to MI325X to MI355X to MI400X generational comparison roadmap

Answer-first: MI400X fails for workloads that are kernel-coverage-bound rather than memory-bound, deployments without rack-scale DLC, sites without UALink switch supply, and teams without direct AMD engineering support. Software risk is the dominant risk; silicon risk is modest; supply chain risk is real but tractable.

When NOT to choose MI400X:

  • Tight delivery windows with novel models. If you are training a new architecture that uses kernels not in MIOpen / hipBLASLt and you have a Q1 2026 ship date, the time-to-tune on ROCm 7 is a schedule risk. Nvidia’s mature CUDA ecosystem has fewer of these surprises.
  • Air-cooled datacenters. 1500–1800W per package mandates DLC. If your site has air cooling, you are choosing between a major retrofit (CDUs, cold plates, manifolds) or skipping MI400 entirely. MI355X is the last AMD generation that runs in air-cooled configs for limited deployments.
  • Sub-rack purchases. Helios is a 72-GPU minimum buy to get the UALink coherent domain benefit. If your workload fits on 8 GPUs, an MI355X 8-GPU OAM server is the right SKU, not a half-empty Helios rack.
  • MLPerf-driven RFPs. As of mid-2026 there will be limited published MLPerf Training v5/v5.1 numbers on MI400X. If your procurement requires MLPerf results, you may be forced to wait for v6 submissions.
  • Multi-vendor switch agnosticism today. UALink is open spec, but in 2026 only Broadcom (Tomahawk Ultra UAL) and Astera Labs have shipping switches. Cisco and Marvell parts are sampling. If you need true multi-vendor switch supply in 2026, NVLink-Switch is not better, but InfiniBand-classic still is.

Software gotchas:

  • FlashAttention-3 ports lag the CUDA reference by ~1 quarter; FA-2 is solid on ROCm.
  • Custom CUDA kernels (Triton, CUTLASS) require hipify or rewrite; Triton 3.x supports ROCm but kernel quality is workload-dependent.
  • DeepSpeed ZeRO-3 with parameter offload to host memory is mature; CPU-GPU PCIe path on MI400X is PCIe 6.0 x16 which helps.
  • Quantization toolchains (LLM Compressor, AutoAWQ, GPTQ) have ROCm support but model coverage trails CUDA by 1–2 quarters.

Hardware risks:

  • HBM4 stack supply is the binding constraint through 2026. SK hynix and Samsung yields on 16-Hi are the dominant supply risk for MI400X volume.
  • UALink switch silicon (Tomahawk Ultra UAL) is first-gen — expect firmware churn for 6–12 months.
  • Helios power density (180–220 kW/rack) is incompatible with most colocation contracts. New colo or self-build is often required.

Practical recommendations

For frontier-model training shops with hyperscaler-scale budgets: pilot a Helios rack in Q4 2026, plan production by Q2 2027 alongside ROCm 7.1+. Negotiate direct AMD engineering support contracts upfront; the value is in software co-engineering, not silicon. Match Pollara 400 NICs 2:1 with MI400X and budget for Tomahawk Ultra UAL switch supply chain risk.

For mid-size training shops (10–50 GPUs): MI355X 8-GPU servers are the better 2026 buy. MI400X minimum-viable cluster is 72 GPUs; that is too large a step. Revisit MI400 in 2027 when MI400 8-GPU OAM SKUs ship for sub-rack deployments.

For inference-only shops serving large MoE and long-context models: MI300X / MI325X NVL8 servers remain the price-performance leader in 2026 and will run DeepSeek, Llama-4, Qwen-3, and Mistral large-MoE workloads natively in vLLM. Wait for MI400X inference SKUs in 2027 when HBM4 capacity is no longer the supply bottleneck.

For research labs and academic groups: ROCm 6.4+ on MI300X is the realistic 2026 platform — strong PyTorch upstream, vLLM upstream, no kernel rewrites needed for standard transformer work. MI400X is a 2027–2028 conversation for labs without dedicated MLOps staff.

Quick procurement checklist:
– Confirm site has 415V three-phase 60A feeds and DLC plumbing budget before quoting Helios.
– Require AMD engineering staffing commitment in the contract for any ROCm 7.0 production deployment.
– Plan for two ROCm versions in flight: 6.4 for MI300/MI355 inference, 7.x for MI400 training.
– Budget for Tomahawk Ultra UAL second-source by Q3 2027; do not single-source on first-gen UALink switches in 2026.
– Reserve HBM4 supply via Tier-1 OEM contracts early — spot HBM4 in 2026 is not a thing.

FAQ

How does AMD MI400 Instinct architecture compare to Nvidia Rubin?
MI400X leads on HBM capacity (432 GB vs ~288 GB) and matches Rubin on per-package FP4 throughput within ~20%. Nvidia’s NVL144 has a larger coherent domain (144 vs 72) and a more mature software ecosystem. AMD’s UALink is open standard with multi-vendor switches; NVLink is single-vendor. Most deployment decisions hinge on software maturity for the target workload and on rack-level capex.

What is HBM4 memory and how does MI400 use it?
HBM4 is the JEDEC JESD238A standard ratified April 2025, doubling channel count per stack to 32 and raising per-pin signaling to 8+ Gbps. MI400X integrates 8 HBM4 stacks totaling 432 GB at ~19.6 TB/s of aggregate bandwidth per package. This is ~70% higher capacity and ~50% higher bandwidth than MI355X’s HBM3E, enabling single-package storage of larger MoE expert sets without sharding.

Is ROCm production-ready for AI training in 2026?
ROCm 6.4 is production-ready for MI300/MI325/MI355 inference and competitive for training Llama-scale models with AMD engineering support. ROCm 7 (H2 2026) targets MI400 production training. The remaining gaps are kernel coverage for FlashAttention-3 variants and certain MoE routing kernels. For standard transformer pretraining on PyTorch, ROCm matches CUDA throughput within 10–15% with proper tuning.

What is UALink and how does it differ from NVLink?
UALink is an open scale-up GPU interconnect standard from the UALink Consortium (AMD, Broadcom, Cisco, HPE, Intel, Meta, Microsoft, Google). It uses Ethernet PHYs with a memory-semantic transaction layer, supporting cache-coherent load/store across up to 1024 endpoints. NVLink is Nvidia-proprietary and single-vendor. UALink switches ship from Broadcom and Astera Labs in 2026; the spec is at version 1.0 (April 2025).

Which hyperscalers will deploy MI400?
Microsoft Azure, Meta, and Oracle Cloud Infrastructure have publicly committed to AMD Instinct deployments across MI300X and MI325X and have indicated MI400 evaluation programs. Google primarily uses TPUs internally but has deployed MI300X in GCP for select customers. AWS has not announced AMD Instinct deployment beyond Trainium-focused investments. Hyperscaler MI400 deployment in volume is expected H2 2026 into 2027.

What workloads is MI400 best suited for?
Memory-bandwidth-bound training of dense large language models (200B+ parameters), MoE training with large expert counts, long-context training with multi-MB activation tensors, and inference serving where KV cache size dominates throughput. MI400 is less optimal for compute-bound smaller models that fit entirely in CUDA-optimized kernels, where Nvidia’s per-FLOP optimization still leads.

Further reading

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *