DeepSeek V4 Explained: Architecture, Sparse Attention, Benchmarks, and Deployment

The deepseek v4 architecture is the clearest signal yet that the frontier of open-weights language models is no longer about raw parameter counts — it is about how few of those parameters you actually have to touch per token. DeepSeek V4 is a Mixture-of-Experts (MoE) family released by the Chinese lab DeepSeek on April 24, 2026, in two variants: a flagship V4-Pro reported at roughly 1.6 trillion total parameters with about 49 billion activated per token, and a leaner V4-Flash at about 284 billion total with roughly 13 billion activated. Both ship with a one-million-token context window and up to 384K output tokens, and both are released as downloadable open weights on Hugging Face. What makes V4 matter is not the trillion-parameter headline but the engineering underneath: a hybrid attention scheme that reportedly slashes the memory and compute cost of long-context inference to a fraction of the previous generation, while landing top-tier agentic-coding scores.

What this covers: the DeepSeek lineage from V2 to R1 to V4; the MoE and hybrid-attention architecture in detail; the training pipeline; capabilities and benchmarks with their caveats; access, hardware, and deployment realities; limitations and failure modes; and a head-to-head decision matrix against its 2026 peers.

Lineage and Context

DeepSeek V4 did not appear from nowhere. It is the fifth major architectural iteration in a family that has spent two years methodically attacking the two most expensive parts of transformer inference — attention memory and dense feed-forward compute — and V4 is the point where those threads converge.

The story starts with DeepSeek V2 (2024), which introduced two ideas the lab has carried forward ever since. The first was Multi-head Latent Attention (MLA), a technique that compresses the key-value (KV) cache into a low-rank latent space so that long sequences consume far less memory than standard multi-head attention. The second was DeepSeekMoE, a Mixture-of-Experts feed-forward design that splits the network into many fine-grained experts plus a set of always-on “shared” experts, routing each token to only a small subset. Together they let DeepSeek train a large model but pay for only a slice of it at inference time.

DeepSeek V3 (late 2024 / early 2025) scaled that recipe to 671 billion total parameters with roughly 37 billion activated per token, refined the MoE load-balancing so experts were used evenly without an auxiliary-loss penalty, and adopted FP8 mixed-precision training to cut compute cost. The incremental V3.1 and V3.2 releases through 2025 pushed context length, tool-use reliability, and inference efficiency further — V3.2 in particular became the efficiency baseline DeepSeek later measured V4 against.

Running in parallel was the R-series. DeepSeek R1 (2025) was a reasoning-specialised model trained heavily with reinforcement learning — notably Group Relative Policy Optimization (GRPO) and reinforcement learning with verifiable rewards (RLVR) — to produce long, self-checking chains of thought. R1 mattered because it showed a relatively small lab could match closed reasoning models, and because its reasoning traces became a distillation source for later general models. If you want the broader competitive framing of how these reasoning models stack up in practice, see our industrial reasoning benchmark comparison of Llama 4, DeepSeek V3, and Claude Sonnet.

DeepSeek V4, then, is the merger: it takes the MLA/MoE efficiency lineage of the V-series, folds in the reasoning gains distilled from the R-series, and adds a new hybrid attention mechanism built for the 1M-token era. For context on how the field frames MoE routing and open-weights economics more broadly, the Mixture-of-Experts overview on Hugging Face’s documentation is a useful primer on why activated-parameter counts, not totals, drive cost.

Architecture

At a high level, the deepseek v4 architecture is a decoder-only transformer whose two most expensive subsystems — attention and the feed-forward network — have both been made sparse. Attention is handled by a hybrid of two compression schemes that shrink the KV cache; the feed-forward layer is a fine-grained Mixture-of-Experts that activates only a fraction of its parameters per token. The net effect, per DeepSeek, is a model that behaves like a trillion-parameter network in quality but costs closer to a mid-sized dense model at inference.

Mixture-of-Experts: total versus activated parameters

The single most important number to internalise about V4 is the gap between total and activated parameters. V4-Pro is reported at roughly 1.6 trillion total parameters but activates only about 49 billion per token; V4-Flash carries about 284 billion total and activates around 13 billion. That ratio — activating on the order of 3–5% of the network per token — is the whole point of MoE. You store an enormous, knowledge-dense model on disk and in GPU memory, but each forward pass only lights up a handful of experts.

Mechanically, each MoE layer replaces the standard dense feed-forward block with a large pool of smaller expert networks plus a lightweight router. For every token, the router computes a score over the experts and selects the top-k (a small number, typically single digits). Only those selected experts run; the rest sit idle for that token. DeepSeek’s design, inherited from DeepSeekMoE, also keeps a set of shared experts that process every token regardless of routing. The shared experts capture common, general-purpose transformations, which frees the routed experts to specialise on narrower patterns — a division of labour that improves both quality and routing stability.

The router is also where MoE models are hardest to train. If routing is unbalanced, a few experts get overused (and become bottlenecks) while others are starved and never learn. DeepSeek’s V3-era work on auxiliary-loss-free load balancing carries into V4: the model nudges tokens toward under-used experts through bias terms rather than a penalty that fights the main objective. The exact expert count, top-k value, and shared-expert ratio for V4 are implementation details that are not fully officially disclosed at the level of a published config for every layer, but the architectural family is well established.

Hybrid attention: CSA plus HCA

The genuinely new piece in V4 is its hybrid attention, which combines two mechanisms: Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). Both are descendants of the MLA idea — compress what you store — but they attack the long-context problem from complementary angles.

Standard attention has two costs that explode with sequence length. The first is compute: attention is quadratic, so a 1M-token prompt naively implies an astronomical number of pairwise interactions. The second is memory: every token’s keys and values must be cached so later tokens can attend to them, and that KV cache grows linearly with context and can dwarf the model weights themselves at long context. Compressed Sparse Attention addresses the compute side by making attention sparse — each token attends to a compressed, selected subset of prior positions rather than every single one — while Heavily Compressed Attention addresses the memory side by storing a much more aggressively compressed representation of the KV state, trading a little fidelity for a large reduction in cache footprint. Blending the two lets the model keep fine-grained local attention where precision matters and heavily compressed global attention where a coarse summary suffices.

The headline efficiency claim, which should be read as reported by DeepSeek rather than independently verified, is striking: in the 1M-token setting, DeepSeek-V4-Pro reportedly requires only about 27% of the single-token inference FLOPs and roughly 10% of the KV cache compared to DeepSeek-V3.2. If that holds up under third-party measurement, it is the difference between long-context inference being a research curiosity and a production-affordable feature. KV-cache size is frequently the binding constraint on how many concurrent long-context requests a GPU can serve, so a 10x reduction there directly multiplies throughput. We go deeper on why the KV cache dominates long-context serving economics in our guide to KV-cache optimization for LLM inference.

Context window, tokenizer, and modalities

Both V4 variants support a 1M-token context window and can emit up to 384K output tokens in a single generation — enough to ingest an entire codebase, a book-length document set, or a long multi-turn agent trajectory, and to produce very large structured outputs like full file rewrites. It is worth being precise here: a 1M window is a capacity ceiling, not a promise that the model reasons with equal sharpness across the whole span. (More on that in the limitations section.)

The tokenizer and precise vocabulary details for V4 follow the DeepSeek lineage but are best read from the official model card rather than assumed; where the config is not published, treat specifics as not officially disclosed. DeepSeek’s public positioning of V4 centres on text and code — it is a language and reasoning model first. Any multimodal claims beyond that are not something to state as fact without the model card confirming them, so this deep-dive treats V4 as a text/code model and flags richer modality support as undisclosed unless DeepSeek documents it.

Training

DeepSeek has historically published unusually candid technical reports, but at the time of writing the full V4 report should be treated as the authoritative source for anything below the architectural level. What follows describes the pipeline shape the DeepSeek family uses, with vendor-specific numbers labelled as reported or undisclosed.

Pretraining. V4 is pretrained on a very large corpus spanning web text, code, and mathematics, with the MoE network learning a standard next-token prediction objective. The exact token count and data mixture for V4 are not officially disclosed in a single headline figure the way parameter counts are; DeepSeek’s prior generation trained on the order of trillions of tokens, and V4 is understood to extend that. A defining feature of DeepSeek’s pretraining engineering has been FP8 mixed-precision training, which cuts memory and compute per step and was a big part of how the lab trained frontier-scale models on a comparatively modest cluster. Training a 1.6T-parameter MoE stably requires careful attention to router balancing, precision, and gradient stability across experts; the fact that DeepSeek can do this at all on constrained hardware is a large part of the model’s strategic significance.

Compute. DeepSeek’s earlier releases were notable precisely because their reported training cost was far lower than Western frontier labs’, achieved through the MoE-plus-FP8 efficiency stack and heavy systems optimisation. For V4 specifically, treat any exact GPU-hour or dollar figure as reported/estimated unless the technical report states it — the sparse-activation design means the effective compute per token is much lower than the total parameter count implies, but the absolute training budget for a 1.6T model is still substantial.

Post-training. This is where V4 inherits the R-series gains. The pipeline layers several stages on top of the pretrained base:

Supervised fine-tuning (SFT) on curated instruction-following and problem-solving data to make the base model usable as an assistant.
Reinforcement learning, using DeepSeek’s GRPO (Group Relative Policy Optimization) and RLVR (reinforcement learning with verifiable rewards). RLVR is particularly effective for domains like math and coding where correctness can be checked automatically — the reward comes from whether the answer actually passes tests or evaluates correctly, not from a learned preference model that can be gamed.
Reasoning distillation from the R-series. The long, self-correcting reasoning traces that R1 and its successors learned through RL can be distilled into a general model, giving V4 stronger step-by-step reasoning without forcing every response into a verbose chain of thought.
Safety and alignment tuning via preference optimisation to shape refusals, tone, and behaviour.

The relative weighting of these stages, the datasets used, and the precise RL recipe for V4 are the kinds of details that are either in the technical report or not officially disclosed — this deep-dive deliberately avoids inventing specifics. What is safe to say architecturally is that V4’s quality is a product of both the efficient base and this multi-stage reasoning-focused post-training, not of scale alone.

Capabilities and Benchmarks

V4’s headline capability claim is in agentic coding. DeepSeek-V4-Pro (in its strongest configuration, reported as V4-Pro-Max) reportedly scores about 80.6% on SWE-bench Verified, which — if accurate — was described at release as the highest for any open-weights model and roughly tied with Gemini 3.1 Pro at the time. That is a genuinely notable claim, because SWE-bench Verified measures whether a model can resolve real GitHub issues end to end (understand the bug, edit the right files, produce a patch that passes the repository’s tests), which is far closer to real engineering work than a multiple-choice benchmark.

A few disciplined caveats are essential here, and they apply to every number in this section:

These are reported figures, not independently verified fact. An ~80.6% SWE-bench Verified score is a claim attached to a specific evaluation harness and configuration. Small changes in the agent scaffold, the retry budget, or the test-time compute can move SWE-bench numbers by several points, so cross-model comparisons are only meaningful when the harness is held constant. Read “80.6%” as “reported under DeepSeek’s evaluation setup,” not as a physical constant.
Benchmark contamination is a real risk. SWE-bench Verified issues come from public GitHub repositories, and any model trained on large web crawls may have seen related code. The “Verified” subset was curated partly to reduce ambiguity, but no public coding benchmark is fully immune to leakage. A high score is evidence of capability, not proof of it.
A single benchmark is a narrow lens. SWE-bench rewards a particular style of contained, test-backed bug-fixing. It says less about greenfield architecture, ambiguous requirements, or multi-service debugging where there is no tidy test to pass.

Beyond coding, V4’s design points to two other strengths. Its reasoning should benefit directly from the R-series distillation — math, logic, and multi-step problem-solving are exactly what GRPO/RLVR post-training targets, and the family’s R-lineage is its clearest differentiator versus pure pretraining-scaled peers. Its long-context capability is the marquee architectural feature: a 1M-token window plus the CSA/HCA efficiency means V4 can plausibly serve use cases — whole-repository code understanding, long document synthesis, extended agent runs — that shorter-context models simply cannot attempt in one shot.

The honest framing for a technical reader is this: V4 is, by the reported numbers, a top-tier open-weights model for coding and reasoning, and its long-context efficiency is a real architectural advance. But treat the leaderboard position as a snapshot from a fast-moving field measured under the vendor’s own harness, and validate on your workload before betting on it. The gap between a benchmark score and production reliability is where most model migrations go wrong.

Access and Deployment

V4 is released as open weights on Hugging Face, which is the strategic core of DeepSeek’s whole approach — the model is downloadable, inspectable, and self-hostable rather than locked behind an API. That gives teams two deployment paths, and the right choice depends almost entirely on scale, latency needs, and how much infrastructure pain you are willing to absorb.

The API path. DeepSeek offers a hosted API, and it is the fastest way to production: no GPUs to procure, no serving stack to tune, and pricing that has consistently undercut Western frontier APIs. Pricing for V4 has been widely reported around ~$0.30 per million input tokens and sub-$1 per million output tokens, though exact figures vary by source and tier, so treat those as reported rather than a fixed rate card. For most teams that just want strong coding and reasoning at low cost, the API is the pragmatic answer.

The self-host reality. Running a 1.6T-parameter MoE yourself is a serious infrastructure exercise, and the total-versus-activated distinction cuts both ways. On one hand, only ~49B parameters activate per token, so the compute per forward pass is modest for a model this capable. On the other hand, you still have to store all 1.6T parameters somewhere fast enough to route to, because you never know in advance which experts a token will need. That storage is the binding constraint:

Quantization is mandatory at this scale. At full BF16, 1.6T parameters is on the order of ~3.2 TB just for weights — far beyond any single node. FP8 roughly halves that, and aggressive INT4 quantization of the expert weights can bring it down further, at some quality cost. V4-Flash (284B total) is dramatically more approachable and is the variant most self-hosters will actually run.
Expert offload lets you keep hot/shared experts in GPU memory while parking cold expert weights in CPU RAM or on NVMe, streaming them in as routing demands. This trades latency for capacity and is how people fit very large MoE models on limited GPU counts.
Multi-GPU sharding combines tensor parallelism (splitting individual layers across GPUs) with expert parallelism (placing different experts on different GPUs). Serving stacks like vLLM and SGLang have added first-class support for large MoE models, including the KV-cache and attention optimisations that a CSA/HCA model needs to hit its efficiency potential.

Latency and throughput. The CSA/HCA compression is what makes long-context serving economically viable — a reportedly ~10x smaller KV cache means far more concurrent long-context requests per GPU, which is throughput you can actually bill for. For short prompts the efficiency story is less dramatic; the architecture’s advantage grows with context length. If you are weighing self-host against API purely on cost, remember that the API price already reflects DeepSeek’s own hyper-optimised serving, so self-hosting typically pays off only at high, steady volume or when data residency and control are non-negotiable. For the broader decision of when long context beats retrieval or fine-tuning, see our comparison of fine-tuning vs RAG vs long-context.

Limitations, Safety, and Failure Modes

No amount of architectural cleverness makes V4 a finished, drop-in oracle, and a responsible deep-dive has to be specific about where it breaks.

Long-context degradation. The 1M-token window is a capacity ceiling, not uniform competence. Like every long-context model, V4 tends to reason most reliably about information near the start and end of the context and can lose sharpness in the vast middle — the classic “lost in the middle” effect. The CSA/HCA compression that makes 1M tokens affordable is, by construction, throwing away detail; heavily compressed attention over distant tokens is a coarse summary, not perfect recall. In practice this means you should still chunk, rank, and place the most important material deliberately rather than dumping a million tokens and trusting the model to find the needle.

Hallucination. V4 is a probabilistic language model and will state false things fluently, especially for obscure facts, precise figures, and anything outside its training window. The reasoning post-training helps it catch some of its own errors, but it does not eliminate confident fabrication. Anything factual that matters must be grounded (retrieval, tools, verification) rather than trusted on the model’s word.

Censorship and alignment. As a model from a Chinese lab, V4’s alignment reflects that regulatory and cultural context, and it is well documented across the DeepSeek family that politically sensitive topics — particularly those sensitive to the Chinese state — are refused or steered. Because the weights are open, self-hosters can fine-tune behaviour, but the base alignment ships with these characteristics, and teams building user-facing products should test refusal behaviour against their own content policy rather than assume neutrality.

Self-host difficulty. As covered above, the flagship V4-Pro is genuinely hard to operate: multi-node, quantization-dependent, and sensitive to serving-stack maturity. The gap between “downloadable” and “reliably serving production traffic at low latency” is large for a 1.6T MoE, and underestimating it is the most common way these deployments fail.

How It Compares

There is no single “best” model in 2026 — only best-fit for a given constraint. The matrix below is a directional guide, not a scorecard; treat all standings as reported and validate on your own workload.

Use case	DeepSeek V4	Gemini 3.x Pro	Llama 4	Qwen 3.x
Agentic coding	Top-tier open-weights; ~80.6% SWE-bench Verified reported	Very strong; roughly tied at the top	Solid, generally a step behind the leaders	Strong and improving fast
Long-context (1M)	Native 1M with CSA/HCA efficiency; strong cost profile	Long context, closed weights	Shorter effective context in practice	Competitive context, varies by size
Self-host cost/control	Open weights; huge but self-hostable (Flash is practical)	Closed — API only	Open weights; easier to run at smaller sizes	Open weights; wide size range, easy to self-host
Reasoning	Strong via R-series distillation (GRPO/RLVR)	Strong, closed reasoning stack	Good, improving	Strong on math/reasoning tracks

The quick read: if you need open weights plus frontier coding plus genuine 1M-token efficiency, V4 is arguably the standout, with V4-Flash as the practical self-host choice. If you want a turnkey closed API and don’t need the weights, Gemini 3.x Pro is the natural comparison. If your priority is easy, cheap self-hosting at smaller scale, Llama 4 or a mid-sized Qwen 3.x model will be far less painful to operate than a 1.6T MoE. Match the model to the binding constraint, not to the leaderboard.

Frequently Asked Questions

What is DeepSeek V4 and who made it?

DeepSeek V4 is an open-weights large language model family released by the Chinese AI lab DeepSeek on April 24, 2026. It ships in two variants: V4-Pro (reported ~1.6T total parameters, ~49B activated) and V4-Flash (~284B total, ~13B activated). Both use a Mixture-of-Experts design with hybrid sparse attention, support a 1M-token context window and up to 384K output tokens, and are downloadable from Hugging Face for self-hosting or usable through DeepSeek’s hosted API.

What makes the DeepSeek V4 architecture efficient?

Two things. First, its Mixture-of-Experts feed-forward network activates only a small fraction of the total parameters per token — roughly 49B of 1.6T for V4-Pro — so quality tracks the large total while cost tracks the small active set. Second, its hybrid attention (Compressed Sparse Attention plus Heavily Compressed Attention) shrinks the compute and KV-cache cost of long context. DeepSeek reports that at 1M tokens, V4-Pro needs only ~27% of the per-token FLOPs and ~10% of the KV cache versus V3.2.

How good is DeepSeek V4 at coding?

By reported figures, very good. DeepSeek-V4-Pro (Max configuration) reportedly scores about 80.6% on SWE-bench Verified, described at release as the highest among open-weights models and roughly tied with Gemini 3.1 Pro. SWE-bench Verified tests real GitHub issue resolution, so it is closer to genuine engineering than multiple-choice benchmarks. Treat the number as a vendor-reported result under a specific harness, and validate on your own repositories before relying on it, since scaffold and contamination effects can shift results.

How much does DeepSeek V4 cost?

DeepSeek’s API pricing for V4 has been widely reported at around $0.30 per million input tokens and under $1 per million output tokens, though exact figures vary by source and tier and should be confirmed against the official rate card. That is substantially cheaper than most Western frontier APIs. Self-hosting has no per-token fee but requires significant GPU infrastructure, especially for the 1.6T-parameter V4-Pro.

Can I run DeepSeek V4 on my own hardware?

Yes, because the weights are open, but the flagship V4-Pro is demanding. At 1.6T parameters you must quantize (FP8 or INT4), likely use expert offload to CPU/NVMe, and shard across multiple GPUs with a serving stack like vLLM or SGLang. The smaller V4-Flash (284B total, ~13B activated) is far more approachable and is what most self-hosters will actually run. For turnkey use with no infrastructure, the hosted API is the pragmatic path.

What are DeepSeek V4’s main limitations?

Despite the 1M-token window, V4 loses sharpness on information buried in the middle of very long contexts, since its compressed attention discards detail by design. Like all LLMs it can hallucinate confidently, so factual outputs need grounding. As a model from a Chinese lab, its alignment refuses or steers politically sensitive topics, which product teams should test against their own policy. And self-hosting the full model is genuinely hard.

DeepSeek V4 Explained: Architecture, Sparse Attention, Benchmarks, and Deployment (2026)