Llama 4 Explained: Scout, Maverick, and the Behemoth Behind Them
Llama 4 explained in one line: it is Meta’s open-weight flagship family, and the first Llama built on a mixture-of-experts (MoE) architecture that is natively multimodal. Meta’s GenAI team shipped the first two members, Scout and Maverick, in April 2025, with the much larger Behemoth still in training as a teacher model. The why-it-matters is simple. Llama 4 moved the open-weight frontier from dense transformers to sparse MoE, so you get the quality of a very large model while only paying to activate 17 billion parameters per token.
That shift changes the economics of self-hosting a strong model. It also brings a 10-million-token context claim and early-fusion vision into open weights for the first time.
What this covers: lineage versus Llama 3, the MoE and iRoPE architecture, training and codistillation, real benchmark scores, the Community License, deployment and VRAM, honest limitations, and how the family compares to peers.
Lineage and Context (what changed vs Llama 3)
Llama 3 was a dense family. Every parameter fired on every token, which made the 70B and 405B variants accurate but expensive to serve. Llama 4 breaks that pattern entirely. It is the first Llama generation to use mixture-of-experts, and the first to be natively multimodal rather than bolting vision on afterward.
Two structural changes define the jump. First, MoE: only a small fraction of the network activates per token, so a 400B-class model can run at the cost of a 17B one. Second, early fusion: text and image tokens enter the same backbone from the start, instead of passing through a separate vision adapter.
Meta also widened the data. Scout and Maverick were trained on up to 40 trillion tokens spanning roughly 200 languages, a large step up from Llama 3’s corpus. The result is a family designed for long context and multimodal inputs from day one.
There is also a multimodal shift worth naming explicitly. In Llama 3, vision support arrived through separate adapter models trained after the language model. Llama 4 folds image understanding into the core training run via early fusion, so the same weights that answer a text question can reason about a chart, a screenshot, or a document scan. For teams building document-AI or visual-search products, that removes an entire integration layer.
If you have read our GPT-5.6 SOL deep dive, treat this as the open-weight sibling in the same series. GPT-5.6 represents the closed frontier; Llama 4 represents what you can download and run yourself. Comparing the two is the practical question most teams now face.
The naming also tells a story. Llama 4 ships as a “herd” of three: Scout, the long-context workhorse; Maverick, the high-quality generalist; and Behemoth, the enormous teacher. Only the first two are released as weights today.
There is a deeper reason the MoE move matters for open weights specifically. Closed labs can hide inference cost behind an API and amortize it across millions of users. When you self-host, every parameter you activate is your electricity bill and your GPU lease. A dense 405B model is brutal to serve on a budget; a sparse 400B model that activates 17B per token is suddenly tractable for a single team.
So Llama 4 is not just a quality release, it is a serving-economics release. The headline is not “bigger.” The headline is “the same active footprint as a mid-size model, with the knowledge capacity of a giant one.” That reframing is the thread that runs through every section below, from training to deployment.
One more lineage note. Llama 3 topped out around 405B dense parameters and a 128K context window. Llama 4 keeps active compute far smaller while pushing total capacity higher and context dramatically longer. The generation did not iterate on the same recipe; it changed the recipe.
Architecture (fig 1)
Llama 4 pairs a mixture-of-experts feed-forward design with interleaved attention. Each layer routes a token to a small set of experts plus a shared expert, so total capacity is large but active compute stays near 17B parameters. Vision and text share one backbone through early fusion.

Start with the MoE block. In a dense transformer, the feed-forward layer is one big matrix that every token passes through. In Llama 4, that layer becomes many smaller “expert” networks plus a router. The router scores experts per token and sends each token to a small subset, typically combined with one shared expert that always runs.
This is where the active-versus-total split comes from. Scout has 16 experts, about 109B total parameters, and 17B active per token. Maverick has 128 experts, roughly 400B total, and the same 17B active. In Maverick, MoE and dense layers alternate, so experts apply in only half the layers.
The payoff is leverage. You store a 400B-parameter model in memory, but each token only triggers a 17B-parameter forward pass. That is why Maverick can rival much larger dense models while staying affordable to serve.
Attention is the second pillar. Llama 4 uses iRoPE, where “i” stands for interleaved attention layers and “RoPE” is rotary position embedding. Some layers use rotary embeddings; others use no explicit positional encoding at all. Meta credits this interleaving, combined with inference-time temperature scaling on attention, for the model’s extreme length generalization.
Why does interleaving help length? Standard rotary embeddings bake position into every layer, which can make a model brittle when sequences run far past what it saw in training. By interleaving layers that carry no positional encoding, Llama 4 leaves part of the network free to attend based on content alone, which Meta argues generalizes better to lengths never seen during training. Inference-time temperature scaling on attention then keeps the distribution stable as context grows.
That brings us to context. Llama 4 Scout advertises an industry-leading 10-million-token context window, the headline number of the launch. Maverick ships with a smaller but still large window. Treat the 10M figure as an architectural capability validated on retrieval and negative-log-likelihood tests, not a guarantee of strong reasoning at that length.
To put 10 million tokens in perspective, that is roughly the scale of a large codebase or a shelf of books held in working memory at once. The appeal for engineering teams is obvious: you could, in principle, drop an entire repository into context and ask questions across files without building a retrieval pipeline. The caveat from independent testing, covered below, is that holding text in context is not the same as reasoning over all of it reliably.
Multimodality is built in. Through early fusion, text tokens and image tokens flow through the same backbone, so the model reasons across modalities without a separate vision tower. Meta reports training with multiple images per example and strong image-grounding behavior.
The tokenizer and vocabulary were also refreshed for the multilingual, multimodal corpus. The net effect is one architecture that handles long documents, code, and images inside a single sparse network, rather than three specialized systems stitched together.
It helps to think about what the router actually decides. For every token at every MoE layer, a small gating network produces a score for each expert and picks the top candidates. Tokens about code might consistently route to one cluster of experts; tokens about a particular language might route to another. This specialization is learned, not hand-designed, and it is why a sparse model can hold so much knowledge without paying for all of it on each pass.
The shared expert is the safety net. Because one expert always runs regardless of routing, the model retains a common backbone of general capability even when the gating network makes an odd choice. This design reduces the risk that a poorly routed token falls through the cracks, and it tends to stabilize training compared to pure top-k routing with no shared component.
A practical consequence for engineers: latency on MoE models is not uniform. Throughput depends on how evenly tokens spread across experts in a batch. A batch that happens to hammer a few experts can bottleneck on those weights while others sit idle. Serving frameworks address this with expert-parallel layouts, but it is a real consideration when you size hardware and set batch policies.
Training (fig 2)
Llama 4’s training pipeline is unusual because the smaller released models learn from a larger unreleased one. Behemoth acts as the teacher; Scout and Maverick are codistilled from it during pre-training, then refined with a light post-training recipe focused on hard prompts.

Pre-training scale is the foundation. Meta reports training Scout and Maverick on up to 40 trillion tokens of text and image data covering roughly 200 languages. Early fusion means the model sees text and images together throughout pre-training, not in a separate stage.
The compute and exact data mix are not fully disclosed. Meta describes using FP8 precision and large GPU clusters, but precise GPU-hour counts should be treated as reported or estimated rather than confirmed. We will not invent a number Meta did not publish.
Behemoth is central to the story even though you cannot download it. It is a roughly two-trillion-parameter MoE model with about 288B active parameters and 16 experts, still in training at launch as a teacher. Meta uses it to codistill knowledge into the smaller models so they punch above their active-parameter weight.
Post-training is deliberately lightweight. Meta describes a sequence of lightweight supervised fine-tuning, online reinforcement learning, and direct preference optimization (DPO). The notable choice was pruning easy examples from SFT and concentrating RL on harder prompts, which Meta says preserved exploration and improved reasoning.
Codistillation deserves a closer look because it is the mechanism behind the family’s efficiency. In ordinary distillation, a smaller “student” learns to mimic a larger “teacher” after the teacher is finished. Codistillation runs the process during pre-training, so the students absorb the teacher’s signal while it is still learning. Meta describes using a loss that blends soft targets from Behemoth with hard targets, weighted dynamically.
The result is leverage you cannot get by training a small model alone. Scout and Maverick inherit reasoning patterns from a two-trillion-parameter teacher without ever needing to be that large at inference. This is the clearest reason Maverick can post GPT-4o-class scores with only 17B active parameters, and it is why Behemoth matters even though you cannot download it.
The post-training curriculum choice is worth dwelling on too. Meta reports deliberately removing easy prompts from the supervised set so the model was not over-trained on trivial examples, then concentrating reinforcement learning on genuinely hard prompts. The stated goal was to preserve the model’s ability to explore difficult problems rather than collapsing onto safe, generic answers.
The honest caveat: codistillation, FP8 details, and the RL curriculum are described at a high level in Meta’s materials, but reproducible recipes are not fully open. Label these as reported. What is verifiable is the outcome, which we turn to next.
Capabilities and Benchmarks (fig 3)
On standard reasoning and coding benchmarks, Llama 4 Maverick is competitive with strong closed models from the prior generation while activating only 17B parameters. The figures below come from Meta’s published evaluations and independent coverage; treat all benchmark numbers as point-in-time and subject to contamination caveats.

Maverick is the headline performer. Meta reports Maverick at 80.5 on MMLU-Pro versus GPT-4o’s 72.2, and 69.8 on GPQA Diamond versus GPT-4o’s 53.6. On LiveCodeBench, Maverick reached 43.4 against GPT-4o’s 32.3. Those are meaningful margins for an open-weight model with 17B active parameters.
Scout is the efficiency story. Reported scores put Scout at 74.3 on MMLU-Pro, 57.2 on GPQA Diamond, and 32.8 on LiveCodeBench. Lower than Maverick, but strong for a model that fits on a single GPU.
Here is a consolidated view of the reported numbers:
| Benchmark | Scout | Maverick | GPT-4o (ref) |
|---|---|---|---|
| MMLU-Pro | 74.3 | 80.5 | 72.2 |
| GPQA Diamond | 57.2 | 69.8 | 53.6 |
| LiveCodeBench | 32.8 | 43.4 | 32.3 |
| Active params | 17B | 17B | undisclosed |
| Total params | ~109B | ~400B | undisclosed |
LMArena needs a careful caveat. An experimental, chat-tuned version of Maverick scored an Elo of about 1417 on LMArena, near the top of the board at launch. But that experimental build differed from the released weights, and the standard Maverick release ranked far lower. This gap drew criticism and is worth remembering when you read leaderboard headlines.
Long context is the other caveat. Independent testing found that while Llama 4 performs well on standard tests, it struggled on some long-context tasks relative to the 10M-token headline. A large context window is not the same as strong reasoning across that whole window.
Contamination is the caveat that applies to every number in that table. Popular benchmarks leak into training corpora over time, so a model can score well partly because it has seen similar questions. This is not a Llama-specific problem, but it means you should weight benchmark deltas less than your own evaluation on private data. A two-point gap on MMLU-Pro is within the noise of contamination and prompt formatting.
The practical read: Maverick is a credible GPT-4o-class open model on knowledge and coding, Scout trades some quality for single-GPU deployability, and the 10M-token claim is an architectural capability you should validate on your own retrieval workloads before trusting it.
For most teams, the most useful evaluation is not a public leaderboard at all. Build a small golden set of 50 to 200 examples that mirror your real workload, run both Scout and Maverick against it at the quantization you plan to ship, and measure accuracy, latency, and cost together. Leaderboards tell you the model is in the right class; your golden set tells you whether it is right for you.
Access and Deployment (fig 4)
Llama 4 ships as open weights under the Llama 4 Community License. You can download both Scout and Maverick, quantize them, and serve them yourself, or rent them through cloud and managed APIs. The license is permissive for most teams but carries real conditions.

Start with where to get it. The weights are published on Hugging Face and on llama.com, and the models are also offered through major cloud providers and inference platforms. For most builders, Hugging Face plus a serving stack like vLLM or TGI is the fastest path.
The Community License is open weights with conditions, not pure open source. Three terms matter most. Companies with more than 700 million monthly active users must request a separate license from Meta, granted at Meta’s sole discretion. You must display “Built with Llama” prominently. And derivative models must include “Llama” in their name.
Hardware is the most practical question. Scout is engineered to fit on a single NVIDIA H100 GPU with Int4 quantization, which is the standout deployment claim of the family. Meta provides on-the-fly Int4 quantization code to minimize quality loss. In BF16, Scout needs more memory and typically a multi-GPU setup.
Maverick is heavier. With roughly 400B total parameters, it needs a multi-GPU node even though only 17B activate per token, because all experts must be resident in memory. Plan for an H100 node, not a single card, for the full-precision release.
Quantization is your main lever. Int4 makes Scout single-GPU friendly; community GGUF builds push it onto smaller hardware with further quality trade-offs. Always benchmark a quantized build on your own tasks before committing.
On cost, managed APIs are inexpensive. Reported Maverick pricing runs roughly $0.15 to $0.63 per million input tokens depending on provider, with some hosted options near $0.08 per million input tokens. For low-to-moderate volume, a managed API is often cheaper than running your own H100 node.
Latency behaves differently than on dense models. Time-to-first-token depends on prefill over your prompt, which for long Llama 4 contexts can dominate. Per-token generation speed, by contrast, tracks the 17B active footprint and is fast for the model’s apparent size. If your prompts are long, optimize prefill and consider prompt caching; if they are short, you will mostly feel the cheap decode step.
A simple decision rule helps here. If your monthly token volume is low or spiky, a managed API almost always wins on total cost because you avoid idle GPU time. If your volume is high and steady, and especially if data residency matters, a self-hosted Scout or Maverick node amortizes well and gives you full control. Many teams run a hybrid: managed API for burst traffic, self-hosted for the steady base load.
If you are wiring Llama 4 into a product, route it through a gateway rather than calling providers directly. Our LLM gateway architecture guide covers fallback, caching, and cost controls that apply directly to open-weight deployments.
Limitations, Safety, and Failure Modes
Llama 4 is strong, but it has specific weak spots worth naming. The most discussed is long-context reasoning. The 10M-token window is a real architectural achievement, yet independent tests showed degraded performance on demanding long-context tasks. Retrieval near the limit is not the same as reasoning across it.
The LMArena episode is a credibility lesson. The high-scoring leaderboard entry was an experimental c
Licensing Nuances and Practical Fine-Tuning
The Llama 4 Community License is open weights with conditions, and the conditions matter for production planning. Organizations above the very large monthly-active-user threshold must request a separate license from Meta, derivative models must carry the Llama name, and a Built with Llama attribution is required. At launch, access was also restricted for entities domiciled in the European Union, a constraint that shaped early adoption and that teams must verify against current terms before deploying.
For practitioners, the mixture-of-experts design changes the fine-tuning calculus. Because only a fraction of parameters are active per token, full fine-tuning still requires holding the entire expert set in memory, so most teams reach for parameter-efficient methods such as LoRA or QLoRA on the routed and shared experts. Quantization to int4 makes Scout tractable on a single high-memory GPU, while Maverick remains a multi-GPU proposition. Evaluate any fine-tune on long-context and routing-sensitive tasks, since adapter training can perturb expert routing in ways that a short benchmark will not reveal.
