Mixture-of-Experts (MoE) LLM Architecture Explained (2026)

Mixture-of-Experts (MoE) LLM Architecture Explained (2026)

Mixture-of-Experts (MoE) LLM Architecture Explained (2026)

Almost every frontier model you used this year was sparse. When a provider quotes a parameter count in the hundreds of billions but bills you as if the model were a fraction of that size, you are looking at a Mixture-of-Experts LLM architecture doing exactly what it was designed to do: store a lot of knowledge in weights, but touch only a small slice of those weights per token. That single decoupling — total capacity from per-token compute — is the most important architectural shift in large language models since the transformer block itself.

The trouble is that MoE is usually explained as a diagram with arrows and the phrase “the router picks the best experts,” which tells you nothing about why these models are hard to train, awkward to serve, and occasionally catastrophic when the routing goes wrong. The clever-trick framing also hides the central truth practitioners care about: MoE does not make a model cheaper, it makes it differently expensive. It pays for capacity with memory and network bandwidth instead of arithmetic, and whether that bargain helps you depends entirely on the hardware you serve it on. This post treats MoE as an engineering system with real failure modes and real trade-offs, not a free lunch.

What this post covers: the gating network and top-k routing, why total parameters dwarf active parameters, the auxiliary load-balancing loss, expert and tensor parallelism at serving time, the memory-versus-FLOPs trade-off, and the specific ways MoE breaks in production.

Context and background

Mixture-of-Experts is an old idea — conditional computation, where different inputs activate different sub-networks, dates back decades. What changed is that the transformer’s feed-forward layer turned out to be the perfect place to put it. The 2021 Switch Transformer work and the GShard scaling effort showed you could replace one dense FFN with many smaller “expert” FFNs and a lightweight router, then train the result to trillions of parameters without a proportional increase in compute. Google’s Switch Transformer paper is still the cleanest primary source on the core mechanics.

By 2026, sparse expert models are no longer exotic. The architecture underpins a large share of open-weight and proprietary frontier systems, with Mixtral’s well-known 8-experts-per-layer structure serving as the reference design most engineers learn first. The reason is economic, not academic. Dense scaling hits a wall where every new parameter costs you both training FLOPs and inference FLOPs. MoE breaks that link: you can grow capacity by adding experts while keeping per-token compute roughly flat, because each token still only visits a fixed small number of them.

That economic story is why MoE matters right now. Serving cost is the dominant line item for most LLM products, and the same pressure that drives teams toward speculative decoding for faster LLM inference drives them toward sparsity. But MoE relocates the cost rather than removing it — it trades FLOPs for memory and network bandwidth, and that trade is the whole subject of this article. Understanding where the cost moves is the difference between a model that serves cheaply and one that idles half your GPUs.

It also helps to be precise about what an “expert” is, because the word oversells it. An expert is not a specialist module that knows about, say, chemistry or French. It is just a feed-forward network — the same matrix-multiply-then-nonlinearity block a dense transformer uses — that happens to be one of several copies the router can choose between. Any specialization is emergent and statistical: over training, certain experts drift toward handling certain token patterns, but the assignment is learned, opaque, and rarely maps to anything a human would label. Treating experts as interpretable domain modules is the single most common conceptual mistake newcomers make, and it leads to bad intuitions about everything downstream, from routing behavior to fine-tuning risk.

The MoE reference architecture: router, experts, and sparse activation

A Mixture-of-Experts LLM architecture replaces the dense feed-forward sub-layer of a transformer block with a set of parallel expert FFNs plus a gating network that routes each token to a small number of them. Attention stays shared and dense. Only the FFN becomes conditional. This is the entire structural change, and everything else — load balancing, parallelism, failure modes — follows from it.

Mixture-of-Experts transformer block showing shared attention, a gating router, and tokens dispatched to a top-k subset of expert feed-forward networks
Figure 1: A single MoE layer. Attention is shared; the router sends each token’s hidden state to a top-k subset of experts, whose outputs are then weighted and summed.

The gating network and top-k routing

The gating network is deliberately tiny. For each token’s hidden state, it computes a score per expert — typically a single linear projection followed by a softmax — and then selects the top-k highest-scoring experts. Most production designs use k of 1 or 2. The token’s hidden state is sent only to those experts; the rest of the experts do no work for that token at all.

The router’s output is not just a selection but a set of weights. Each chosen expert produces an output, and those outputs are combined as a weighted sum, where the weights come from the (renormalized) gating scores. This matters: the router does not merely pick experts, it decides how much each one contributes. A token routed to two experts with gate weights of 0.7 and 0.3 gets a blended transformation, not a binary handoff.

Crucially, routing is per-token and per-layer. The same sequence can send token 1 to experts 3 and 5, token 2 to experts 1 and 8, and so on — and a different layer routes the same tokens completely differently. There is no notion of an expert “specializing” in a topic in any human-interpretable way; specialization, where it exists, is statistical and emerges from training, not assigned.

There is also a subtle design choice in where routing happens. Token-choice routing — the default described above — lets each token pick its top-k experts, which is intuitive but allows popular experts to be over-subscribed. An alternative, expert-choice routing, inverts the relationship: each expert picks the top tokens it wants up to its capacity, which guarantees perfect balance by construction but means some tokens may be selected by more experts than others, and some by none. Most well-known open models use token-choice with a balancing loss, but knowing both schemes exist explains why two MoE models with identical expert counts can behave very differently under load. The routing policy is an architectural decision, not an implementation detail.

Why total parameters dwarf active parameters

This is the headline property and the source of most confusion. Suppose a layer has 8 experts and routes each token to 2 of them. The layer stores 8 experts’ worth of weights, but any single token activates only 2. So total parameters are roughly four times the active parameters at that layer.

Active parameters determine FLOPs per token, and therefore latency and compute cost. Total parameters determine how much the model can know — its capacity — and therefore how much GPU memory you must reserve. A sparse expert model can advertise a very large total parameter count while doing the per-token math of a far smaller dense model. That is the trick, stated plainly: capacity is cheap, compute is metered.

The practical implication is that you cannot reason about an MoE model from one number. A figure like “total parameters” tells you the memory footprint; “active parameters” tells you the speed. The two are different questions with different answers, and vendors quote whichever flatters the comparison. Always ask for both.

It is worth being concrete about the ratio. With 8 experts and top-2 routing, roughly a quarter of the FFN weights are active per token, but you still store all of them. Push to 16 or 32 or 64 experts at the same top-k and the gap widens: active compute stays flat while total capacity — and therefore memory — keeps climbing. This is the lever that lets recent designs use many small, fine-grained experts rather than a few large ones. Fine-grained experts give the router more combinations to compose from, which tends to improve quality per active FLOP, at the cost of more routing overhead and more all-to-all traffic. There is no free lunch; you are moving cost around the system, and the sparsity ratio is the dial that decides where it lands.

Sparse activation and the residual path

Even when a token is routed, the MoE layer is wrapped in a residual connection, so the token’s prior representation is preserved and the expert output is added on top. This is what makes token dropping (covered later) survivable: if a token is dropped because an expert is full, it simply passes through via the residual with no FFN transformation at that layer, rather than producing garbage. Sparse activation is not all-or-nothing; the residual gives every token a floor.

A growing number of 2026 designs add a twist: one or more shared experts that every token passes through unconditionally, alongside the routed experts. The intuition is a division of labor. The shared expert absorbs the common, always-useful transformation that every token needs, so the routed experts are freed to specialize on the residual differences. This reduces redundancy — without it, the router tends to make several experts learn nearly the same general-purpose function — and it gives every token a guaranteed FFN path even if its routed expert is dropped. The shared expert is always-active, so it counts toward active parameters, but in practice it earns its keep by making the routed sparsity more effective.

Deeper analysis: load balancing, capacity, and parallel serving

The router has a self-destructive tendency: nothing in the basic loss function stops it from sending every token to the same one or two experts. Left alone, training collapses onto a few favorites while the rest of the experts never learn. The fixes — auxiliary loss, capacity limits, and parallelism-aware serving — are what make MoE actually work, and they are where the real engineering lives.

The auxiliary load-balancing loss

The standard remedy is an auxiliary load-balancing loss added to the main training objective. Conceptually, it penalizes imbalance: it measures, per batch, both the fraction of tokens routed to each expert and the average gate probability assigned to each expert, then pushes those distributions toward uniform. When one expert hogs traffic, the auxiliary term rises and gradient pressure spreads tokens out.

The weighting of this loss is a genuine hyperparameter with teeth. Too weak and the router collapses. Too strong and you force tokens to experts that handle them poorly, hurting quality to satisfy a balance constraint. Some 2026 designs reduce reliance on an explicit auxiliary loss by using bias terms that are adjusted to nudge under-loaded experts toward selection — an “auxiliary-loss-free” style of balancing — but the goal is identical: keep every expert busy without distorting what the router actually wants to do.

The reason a uniform target works is mostly mechanical, not semantic. Balanced load is what keeps the fixed-size expert buffers full but not overflowing, and it is what keeps every GPU in an expert-parallel deployment doing equal work. You are not balancing because uniform routing is inherently “correct” — it usually is not, since real token distributions are skewed — but because the hardware is built for fixed shapes and even partitioning. The auxiliary loss is, in effect, a regularizer that bends the router toward a distribution the hardware can serve efficiently. That framing matters when you tune it: you are negotiating between what the model wants to do and what your accelerators can afford, and the right coefficient is the one that loses the least quality for an acceptable balance.

A second stabilizer that often appears alongside it is small additive noise on the router logits during training, sometimes called noisy or jittered routing. By perturbing the gating scores, it prevents the router from locking onto a winner too early and encourages exploration of under-used experts before their weights atrophy. It is cheap, it is only active during training, and it pairs naturally with the auxiliary loss — one nudges the long-run distribution, the other keeps early training from foreclosing options.

# Pseudocode: top-k gating with an auxiliary load-balance term.
# Illustrative only — not optimized, not a drop-in layer.
def moe_layer(x, experts, W_gate, k=2, aux_coeff=0.01):
    logits = x @ W_gate                 # [tokens, num_experts]
    probs  = softmax(logits, axis=-1)
    topk_w, topk_idx = top_k(probs, k)  # weights + expert ids per token
    topk_w = topk_w / topk_w.sum(-1, keepdims=True)  # renormalize

    # Dispatch each token to its chosen experts, then weight + sum outputs.
    y = combine([experts[i](x) * w for i, w in zip(topk_idx, topk_w)])

    # Auxiliary loss: fraction of tokens per expert times mean prob per expert.
    frac = one_hot(topk_idx).mean(0)    # load fraction per expert
    mean_p = probs.mean(0)              # mean routing prob per expert
    aux = aux_coeff * num_experts * (frac * mean_p).sum()
    return y, aux

Capacity factor and token dropping

Hardware wants fixed-size tensors, but routing is dynamic — you do not know in advance how many tokens will pick a given expert. So implementations impose a capacity factor: each expert gets a fixed buffer sized as (tokens per batch / num experts) × capacity_factor. A factor of 1.25 gives 25% headroom. Tokens beyond an expert’s capacity are dropped for that layer — they skip the FFN and continue via the residual path.

Sequence diagram of MoE token routing showing the gating decision, a capacity-limited buffer per expert, dropped tokens taking the residual path, and gate-weighted expert outputs being merged
Figure 2: Token-level routing through a capacity-limited buffer. Tokens exceeding an expert’s capacity are dropped and rely on the residual connection rather than producing an expert output.

This is a direct speed-versus-quality dial. A low capacity factor saves memory and keeps tensors small but drops more tokens; a high one drops fewer tokens but wastes compute on padding when experts are under-filled. During inference, where batches are small and bursty, capacity behavior is far less predictable than during training, and naive capacity settings can silently degrade output quality on certain prompts.

The asymmetry between training and serving is worth dwelling on. During training you process huge, well-mixed batches, so the law of large numbers smooths routing and a modest capacity factor rarely drops much. At serving time, especially for interactive single-stream or low-batch requests, a single skewed prompt can send a disproportionate share of its tokens to one expert and blow past capacity, while other experts sit idle. The model did nothing wrong; the batch is simply too small to average out. This is why some inference engines relax or effectively remove capacity limits and instead pad dynamically, accepting variable compute per layer in exchange for never silently dropping tokens. Which behavior your stack uses is a real correctness question, not a tuning detail, and it is one of the first things to confirm when an MoE model behaves worse in production than it did in evaluation.

MoE vs dense models: the FLOPs picture

The clearest way to see the value of sparse expert models is to put them next to dense ones. A dense model spends FLOPs proportional to its full parameter count on every token. A sparse MoE model spends FLOPs proportional only to its active parameters, while carrying the memory cost of all experts.

Side-by-side comparison of a dense transformer with a single always-active feed-forward network versus a sparse MoE transformer where a router activates only a subset of experts, annotated with how FLOPs scale in each case
Figure 3: MoE vs dense models. Dense FLOPs scale with total parameters; sparse MoE FLOPs scale with active parameters only, at the cost of holding every expert in memory.

This is why MoE vs dense models is rarely an apples-to-apples comparison. For a fixed training or inference compute budget, MoE buys you more capacity. For a fixed memory budget, dense often wins because it wastes nothing on idle experts. The right choice depends on whether your binding constraint is FLOPs or VRAM — and in 2026, for large-scale serving, it is increasingly memory and interconnect bandwidth, not raw compute.

A useful way to make the comparison fair is to fix one axis and read off the other. Match active parameters, and the MoE model has far more total capacity than its dense twin while costing about the same per-token FLOPs — so it should be smarter for the same speed, if you can afford the memory. Match total parameters, and the dense model is uniformly cheaper to deploy because none of its weights sit idle, but it is slower per token because all of them are active. There is no configuration where MoE wins on every axis at once; it is a capacity-for-memory-and-bandwidth swap, and the only honest comparison states which axis you held fixed. Treat any benchmark that does not say so with suspicion.

Expert parallelism for inference

At serving time you cannot fit a large sparse model on one GPU, and you would not want to — the experts are the natural unit to split. Expert parallelism places different experts on different GPUs. Tensor parallelism (splitting individual matrices across GPUs) and pipeline parallelism still apply to the shared attention layers, so production MoE serving typically combines all three.

Expert parallelism across four GPUs showing router replicas, experts partitioned per device, and an all-to-all exchange that dispatches tokens to remote experts and gathers their outputs
Figure 4: Expert parallelism inference. Each GPU holds a slice of the experts; an all-to-all collective dispatches tokens to the right device and gathers expert outputs back.

The cost of expert parallelism inference is the all-to-all communication it forces. Because any token can route to any expert, and experts live on different GPUs, every MoE layer requires two all-to-all exchanges: one to dispatch tokens to the GPUs holding their chosen experts, and one to gather the outputs back. All-to-all is bandwidth-hungry and latency-sensitive, and it scales poorly across nodes connected by slower links. This is why MoE serving is so sensitive to interconnect — NVLink-class intra-node bandwidth versus slower inter-node fabric can dominate end-to-end latency. If you are benchmarking serving stacks, the vLLM, SGLang, and TensorRT-LLM comparison on H100 is the place to see how different engines handle this dispatch overhead.

The painful part is that all-to-all happens twice per MoE layer, and modern models have many such layers, so the communication is not a one-time tax but a per-layer one that accumulates with depth. It also interacts badly with imbalance: if routing is skewed, the GPUs holding the popular experts become stragglers, and because all-to-all is a synchronizing collective, the whole layer waits for the slowest device. Balanced load is therefore not just a quality concern but a throughput one — an unbalanced MoE wastes both compute and communication. Engineering effort in 2026 MoE serving goes heavily into overlapping this communication with computation, fusing the dispatch and combine steps, and keeping the busiest expert groups inside the fastest interconnect domain. None of these tricks remove the cost; they hide it behind work that would happen anyway.

The memory-versus-FLOPs trade-off, stated cleanly

If you remember one thing about the Mixture-of-Experts LLM architecture, make it this: MoE converts a FLOPs problem into a memory-and-bandwidth problem. A dense model of equal quality would demand more arithmetic per token; the MoE version demands less arithmetic but insists you hold every expert resident and shuffle tokens between them. You have not eliminated cost, you have changed its currency — and which currency is cheaper depends entirely on your hardware. On accelerators that are compute-bound, MoE is a gift. On systems where memory capacity or interconnect bandwidth is the scarce resource — which describes most large-scale 2026 serving fleets — MoE can be the harder model to run economically, even though it does fewer FLOPs. The architecture does not promise lower cost in the abstract; it promises a specific reallocation of cost, and your job is to know whether that reallocation helps or hurts on the boxes you actually own.

Trade-offs, gotchas, and what goes wrong

MoE’s failure modes are not edge cases — they are the default behavior the architecture fights against. The biggest is router collapse: early in training the gating network finds a few experts that work and pours everything into them, starving the rest. The auxiliary loss exists specifically to prevent this, but it is a balancing act that can fail in either direction. Collapse is insidious because it can look fine on aggregate metrics for a while — loss keeps dropping — even as most of your expensive capacity quietly goes to waste, never receiving enough gradient to become useful. By the time it shows up as a quality ceiling, the model has spent its training budget unevenly and the under-trained experts are hard to rehabilitate.

Failure-mode map of MoE training and serving showing router collapse leading to load imbalance, token dropping, and idle experts, with the auxiliary loss, noisy routing, and capacity tuning shown as mitigations
Figure 5: How MoE breaks. Router collapse cascades into load imbalance, token dropping, and untrained idle experts; the mitigations are the auxiliary loss, routing noise, and capacity-factor tuning.

The everyday gotchas, each of which has bitten real teams:

  • Load imbalance at inference. A model balanced during training can still hot-spot at serving time, because real traffic is not your training distribution. A burst of similar prompts — many requests in one language, one domain, one format — can concentrate routing on a handful of experts. One GPU’s experts saturate while others idle, and because all-to-all synchronizes the layer, the idle GPUs wait for the busy one, capping throughput well below the spec sheet.
  • Token dropping in production. With small, bursty inference batches, capacity overflows are unpredictable. Dropped tokens silently lose an FFN transformation, degrading quality on exactly the prompts that triggered the imbalance. The failure is silent — there is no error, just a marginally worse answer — which makes it one of the hardest MoE issues to diagnose from the outside.
  • Memory you cannot escape. You pay VRAM for all experts even though most are idle per token. MoE saves FLOPs, never memory. A sparse model with a large total parameter count can need as many GPUs to hold it as a dense model of the same total size, even though it runs much faster once loaded.
  • All-to-all as the bottleneck. Across nodes, communication — not compute — often sets your latency floor. A model that looks cheap on FLOPs can be expensive to actually serve, and the gap only appears once you cross the boundary from a single fast-interconnect node to a multi-node deployment.
  • Fine-tuning fragility. Sparse expert models can be more sensitive to fine-tuning than dense ones; small or narrow datasets can re-collapse routing or unbalance experts learned during pretraining. A naive fine-tune that ignores the auxiliary loss can quietly undo the balance the base model spent its whole pretraining budget achieving.
  • Reproducibility and batching effects. Because routing depends on the batch (capacity is per-batch), the exact tokens an expert sees — and whether any are dropped — can shift with batch composition. The same prompt can get subtly different treatment depending on what it was batched with, which complicates debugging and strict reproducibility.

A blunt corollary: MoE is not always the right tool. If your binding constraint is GPU memory rather than compute, if you serve mostly small or single-stream requests where batching cannot amortize the routing overhead, or if you lack fast multi-GPU interconnect, a well-tuned dense model of equivalent quality may be cheaper and far simpler to operate. Sparsity is a scaling technique; below a certain scale, its overheads outweigh its savings.

Practical recommendations

If you are evaluating or deploying a Mixture-of-Experts LLM architecture, work from the binding constraint backward, not from the parameter count forward.

  • Always get two numbers. Total parameters (memory) and active parameters (speed). Never reason from one.
  • Size for memory first. All experts must be resident. Budget VRAM for the full model, plan throughput around active params.
  • Treat interconnect as a first-class requirement. Keep experts within a high-bandwidth domain (intra-node NVLink-class links) where possible; inter-node all-to-all is where MoE serving degrades.
  • Tune capacity factor against your real traffic, not training defaults. Watch drop rates per layer in production, not just in eval.
  • Pick a serving engine that handles expert parallelism well and measure dispatch overhead on your own workload — benchmark, do not trust marketing FLOPs.
  • Monitor expert utilization continuously. A flat, balanced load histogram is a health signal; a spiky one predicts quality regressions.
  • For agentic, long-context systems, pair MoE serving decisions with your LLM agent memory architecture, since context length and routing both pull on the same memory budget.

Frequently asked questions

What is the difference between total and active parameters in an MoE model?

Total parameters are every weight the model stores, including all experts; they set the memory footprint. Active parameters are the subset actually used per token — shared layers plus the top-k experts that token is routed to — and they set the FLOPs, and therefore latency and compute cost. A sparse model can have very high total parameters but low active parameters, which is the entire point of the architecture.

How does MoE routing and load balancing actually work?

A small gating network scores every expert for each token, applies softmax, and selects the top-k. The token goes only to those experts, whose outputs are combined as a gate-weighted sum. To stop the router favoring a few experts, training adds an auxiliary load-balancing loss that pushes token distribution toward uniform across experts, and a capacity factor caps how many tokens each expert accepts per batch.

Are MoE models always better than dense models?

No. MoE vs dense models is a trade-off, not an upgrade. MoE gives more capacity per FLOP, so it wins when compute is the constraint. But it must hold every expert in memory and pays all-to-all communication cost at serving time, so dense models can be simpler and cheaper when memory or interconnect is the binding constraint. The right answer depends on your hardware.

What is router collapse and why does it matter?

Router collapse is when the gating network learns to route almost all tokens to a few experts, leaving the rest untrained and effectively wasted. It matters because it destroys the capacity advantage MoE was built for — you pay for many experts but train only a handful. The auxiliary load-balancing loss, routing noise during training, and bias-based balancing schemes are the standard defenses.

Why is expert parallelism inference so sensitive to network speed?

Because any token can route to any expert, and experts are spread across GPUs, every MoE layer triggers two all-to-all exchanges — one to dispatch tokens, one to gather outputs. All-to-all is bandwidth-heavy and latency-sensitive. Within a node, fast NVLink-class links absorb it; across nodes on slower fabric, this communication often becomes the latency bottleneck rather than the GPU compute itself.

Do MoE models drop tokens at inference, and does it hurt quality?

They can. A fixed capacity factor sets how many tokens each expert accepts per batch; overflow tokens skip the expert FFN and pass through the residual connection instead. During inference, with small bursty batches, capacity overflows are less predictable than in training, so quality can degrade on prompts that happen to overload an expert. Monitoring per-layer drop rates in production is the practical safeguard.

What is a shared expert and why do modern MoE designs use one?

A shared expert is a feed-forward network every token passes through unconditionally, in addition to the routed top-k experts. It absorbs the common transformation all tokens need, so the routed experts can specialize on the differences instead of each re-learning the same general-purpose function. This reduces redundancy across experts and guarantees every token a feed-forward path even if its routed expert is full. The shared expert is always active, so it counts toward active parameters and per-token FLOPs.

How many experts should an MoE model have?

There is no universal answer; it is a trade-off you tune. More experts at the same top-k raise total capacity and memory while keeping active compute flat, and the recent trend favors many small fine-grained experts over a few large ones, because more combinations give the router richer composition. But more experts mean heavier all-to-all traffic, more routing overhead, and a harder balancing problem. The right count depends on your quality target, memory budget, and interconnect — benchmark candidates on your own workload rather than copying another model’s layout.

Further reading

By Riju — about.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *