GLM-5.2 Benchmark: The New Open-Weight Leader (2026)

On 16 June 2026, an open-weight model under a no-strings MIT license started beating a closed frontier model from OpenAI on long-horizon coding tasks — at roughly one-sixth the price per token. That single sentence reorders how a lot of engineering teams should think about their model spend for the rest of the year. The GLM-5.2 benchmark results from Z.ai, paired with downloadable weights on Hugging Face, mean the question is no longer “can open weights compete?” but “for which workloads should you stop renting a closed API?” This piece is a positioning analysis, not a press release: we separate what is genuinely strong, what is contamination-sensitive, and what the 753B mixture-of-experts design actually costs to serve.

What this covers: the MoE architecture and why 40B active parameters drive inference economics, where GLM-5.2 sits versus GPT-5.5, Claude Opus 4.8, MiniMax M3 and DeepSeek V4.1, a methodology-caveats section, self-host economics, and a concrete adopt-or-not decision framework.

Context and Background

The first two weeks of June 2026 produced something close to a dozen frontier or near-frontier open-weight releases. MiniMax M3 took the top open-weight slot on SWE-Bench Pro at roughly 59%. DeepSeek V4.1 and Qwen 3.7 anchored the cost-performance frontier. Kimi K2.6 kept pushing reasoning quality. Into that crowded fortnight, Z.ai shipped GLM-5.2 and, per Artificial Analysis, it became the new leading open-weight model on the Intelligence Index — scoring 51, ahead of MiniMax M3 and DeepSeek V4 Pro at 44 and Kimi K2.6 at 43.

That ranking matters because the gap between open and closed has been the central story of the year. For most of 2024 and 2025, the best open weights trailed the closed frontier by a generation. The flood of June 2026 releases compressed that gap, and GLM-5.2 is the sharpest data point yet that it is nearly closed for specific task families — particularly agentic coding and terminal execution. The strategic implication is large: when an open model under a permissive license matches a closed one on your highest-value workload, the pricing power of closed-API vendors erodes for that workload, and teams gain a credible fallback that did not exist a year ago.

GLM-5.2 is text-only, ships a one-million-token context window (up from 200K in GLM-5.1), and is released under MIT. The license is not a footnote. A permissive license lets enterprises download, fine-tune, and self-host with only compute and electricity as the marginal cost — no usage caps, no data leaving the perimeter, no per-call metering. For a deeper read on this trend, see our analysis of the June 2026 open-weight model flood, which frames GLM-5.2 as one node in a much larger shift.

The GLM-5.2 Architecture and What 40B Active Parameters Mean

GLM-5.2 is a mixture-of-experts (MoE) transformer: approximately 753 billion total parameters, but only about 40 billion of them activate for any single token. That ratio — total to active — is the most important number for understanding both the GLM-5.2 benchmark scores and the bill you will pay to run it.

A direct answer for the snippet: a mixture-of-experts model stores hundreds of specialized feed-forward “experts” but routes each token through only a small subset. GLM-5.2 keeps 753B parameters resident in memory while computing as if it were a roughly 40B dense model. You pay a large memory cost but a small compute cost per token, which is why MoE models can be both very capable and relatively cheap to serve.

Figure 1: Simplified GLM-5.2 MoE transformer block. Attention is shared and dense; a router selects the top-k experts per token, and only those experts contribute to the output.

In each transformer block, attention runs densely across all tokens, but the feed-forward layer is replaced by a router plus a bank of experts. The router scores the experts for the current token and dispatches it to the top few. The combine step weights only those active experts. Multiply that by depth, and a token touches only its share of the 753B — the 40B figure — even though every parameter must be loaded somewhere to be eligible for routing.

Why the ratio drives inference economics

Throughput and latency track the active parameter count, not the total. A token forward pass through GLM-5.2 does arithmetic closer to a 40B dense model than a 753B one. That is the mechanism behind the headline pricing: Z.ai’s hosted API lists around \$1.40 per million input tokens and \$4.40 per million output, a combined \$5.80, versus GPT-5.5’s \$5.00 input and \$30.00 output. Roughly one-sixth the per-token cost for comparable or better coding output, as multiple write-ups have noted.

Why memory still bills like a giant

The catch: all 753B parameters must be resident to be routable. You cannot stream only 40B from disk per token at any reasonable latency. So memory cost scales with the total parameter count while compute scales with the active count — a genuinely asymmetric profile.

Figure 2: The asymmetry. Total parameters set the memory footprint and serving hardware; active parameters set throughput, latency, and per-token cost.

For self-hosters, that asymmetry decides everything. The economics flip from “rent intelligence by the call” to “buy and amortize a serving cluster,” and the break-even depends almost entirely on sustained token volume. If your monthly token throughput is low, the hosted API wins; if it is high and steady, owning the weights starts to dominate. For the general MoE design pattern behind this, our mixture-of-experts LLM architecture explainer walks through routing, load balancing, and the failure modes in detail.

Sizing the VRAM bill in concrete terms

It helps to put rough numbers on what “memory bills like a giant” means. At BF16, 753B parameters need roughly two bytes each — on the order of 1.5 terabytes of weights alone, before the KV cache. Quantizing to FP8 roughly halves that to the high-hundreds-of-gigabytes range, and INT4 schemes can push it lower again at some quality cost. Either way you are firmly in multi-GPU territory: a single 80GB accelerator does not come close, and even a single 8-GPU node is tight once you add the KV cache for a one-million-token context. That long context is a second, often-overlooked memory consumer — attention state grows with sequence length, and a model advertised at 1M tokens can spend as much memory on cache as a smaller model spends on weights.

The practical upshot is that GLM-5.2 self-hosting is a systems project, not a download. You need tensor and expert parallelism across GPUs, an MoE-aware scheduler that keeps experts balanced so no single GPU becomes a hot spot, and enough headroom for concurrent requests. The reward for that effort is throughput: because only ~40B parameters do arithmetic per token, a well-tuned cluster can sustain high tokens-per-second across many parallel sessions, which is exactly the profile autonomous coding agents generate.

Why the MIT license is the real story

It is easy to focus on the scores and miss that the license is what makes them consequential. “Open weight” spans a wide spectrum, from research-only and non-commercial clauses to bespoke vendor licenses with revenue thresholds and acceptable-use restrictions. MIT sits at the permissive extreme: no field-of-use limits, no commercial gate, no obligation to publish derivatives. You may fine-tune GLM-5.2 on proprietary data, ship the result inside a closed product, and never expose either the data or the adapted weights.

For regulated industries — finance, healthcare, defense, anything with data-residency or sovereignty requirements — that property frequently outranks raw capability. A closed API that is two points better on a benchmark is still a non-starter if customer data cannot leave the building. GLM-5.2 lets such teams keep a frontier-adjacent model entirely inside their perimeter, audit it, pin a version forever, and avoid the operational risk of a vendor deprecating or silently changing a model under their workloads. The MIT terms also enable a derivatives ecosystem: distilled variants, domain fine-tunes, and quantized community builds can appear without legal friction, which historically accelerates a model’s real-world footprint far beyond its launch-day benchmark.

Benchmark Positioning: GLM-5.2 vs GPT-5.5, Opus 4.8, and Open Peers

Where GLM-5.2 actually lands depends entirely on the axis you measure. On raw aggregate intelligence it trails the closed frontier; on open-weight rankings it leads; on long-horizon coding it pulls ahead of GPT-5.5; on cost it is not close. Treat every number below as reported and attributed — most originate from Z.ai’s own model card or third-party index runs, and benchmark figures should be read as approximate.

Figure 3: The same model lands in different places depending on the axis. GLM-5.2 is frontier-adjacent overall and strongest at agentic coding.

The single clearest framing: GLM-5.2 is the leading open-weight model on the Artificial Analysis Intelligence Index, while still sitting below GPT-5.5 and Claude Opus 4.8 on the overall index. The interesting reversal happens on coding-specific and long-horizon agentic benchmarks, where it reportedly edges past GPT-5.5.

Dimension	GLM-5.2 (Z.ai)	GPT-5.5 (closed)	Claude Opus 4.8 (closed)	MiniMax M3 (open)	DeepSeek V4.1 / Qwen 3.7 (open)
License	MIT, weights public	Proprietary API	Proprietary API	Open weight	Open weight
Architecture	MoE, approx 753B total / approx 40B active	Undisclosed	Undisclosed	MoE, open	MoE, open
Intelligence Index (AA, reported)	51 — top open weight	Above GLM-5.2	Above GLM-5.2	approx 44	approx 44 (V4 Pro)
SWE-Bench Pro (reported)	approx 62.1 (Z.ai run)	approx 58.6	Near top	approx 59.0 — prior open lead	On the frontier
FrontierSWE (reported)	approx 74.4 (Z.ai run)	approx 72.6	approx 75.1	Not stated	Not stated
Terminal-Bench (reported)	approx 81.0 — first open model past 80	Not stated	Not stated	Below	Below
Approx cost per 1M tokens	approx \$5.80 blended	approx \$35 blended	Premium	Low	Lowest tier
Best fit	Agentic coding, self-host	General frontier	Hardest reasoning	SWE tasks, cost	Cost-performance

A few reads on this GLM-5.2 benchmark table. First, the coding lead is real but narrow: on FrontierSWE — multi-hour autonomous engineering runs — GLM-5.2 is reported at approximately 74.4 against Opus 4.8’s 75.1 and GPT-5.5’s 72.6. That is statistically a tie with the best closed model and ahead of GPT-5.5, an outcome that would have read as implausible six months ago. Second, on Terminal-Bench it reportedly became the first open model to cross 80%, scoring around 81.0 — a meaningful signal for agents that drive a shell.

Third, the cost column is the one most teams will act on. A blended ~\$5.80 per million tokens against ~\$35 for GPT-5.5 is not a tweak; it is a different budget category. If your workload is dominated by autonomous coding agents that burn large output volumes, the cost-per-completed-task delta compounds fast. For the broader cost-versus-latency tradeoff across model sizes, see our small versus large LLM agentic benchmark.

One honest asterisk on the wins: GLM-5.2 spends a lot of tokens to get there. Artificial Analysis noted it uses roughly 43K output tokens per Intelligence Index task, of which ~37K is reasoning — up from ~26K in GLM-5.1 and well above MiniMax M3 (~24K) and Kimi K2.6 (~35K). Cheaper per token does not automatically mean cheaper per task if the model thinks longer. We return to this in the trade-offs section.

Reading the open-weight field, not just the leader

It is tempting to collapse the June 2026 releases into a single ranking, but the open-weight frontier is multi-dimensional and each model occupies a slightly different niche. MiniMax M3 built its reputation on SWE-Bench Pro and remains a strong, token-efficient coding choice — relevant because efficiency, not just ceiling, drives production cost. DeepSeek V4.1 and Qwen 3.7 sit on the cost-performance frontier: they may not top any single leaderboard, but they deliver a large fraction of frontier capability at the lowest serving cost, which is the right pick when volume dwarfs difficulty. Kimi K2.6 pushes reasoning quality with a more moderate token budget than GLM-5.2.

GLM-5.2’s distinctive claim is the combination: it leads the open-weight Intelligence Index and posts the strongest agentic-coding numbers of the group, under the most permissive license. That bundle — top open intelligence, best-in-class coding, MIT terms — is what makes it the headline release rather than just another strong model. But “headline” is not “default.” A team running millions of low-difficulty completions might rationally choose a cheaper, more token-frugal peer and never touch GLM-5.2. The right framing is portfolio, not podium: GLM-5.2 is the open model to reach for when the task is hard, agentic, and worth a long reasoning budget.

Translating positioning into a cost-per-task model

The benchmark that should actually move budget is not on any leaderboard — it is the cost-per-completed-task on your own work, and it combines three numbers the public scores rarely surface together. The first is the per-token price, where GLM-5.2’s ~\$5.80 blended rate gives it a roughly 6x edge over GPT-5.5. The second is tokens consumed per task, where GLM-5.2’s ~43K (mostly reasoning) works against it. The third is the success rate, which determines how many retries each completed task really costs.

Work an illustrative example. Suppose a coding task takes GPT-5.5 about 15K output tokens at its ~\$35 blended rate and GLM-5.2 about 43K at ~\$5.80. The GPT-5.5 task costs on the order of half a dollar in output; the GLM-5.2 task, despite nearly tripling the token count, lands well under a fifth of that because the per-token gap dwarfs the token-volume gap. Now flip the success rate: if GLM-5.2 needed two attempts where GPT-5.5 needed one, its effective cost doubles but is still far cheaper. These are illustrative arithmetic, not measured results — the point is the structure. The per-token advantage is large enough that GLM-5.2 usually wins on cost-per-task even after paying the long-reasoning tax, but the only way to know for your workload is to instrument both models and measure. Teams that skip that step and reason from list price alone routinely guess wrong in both directions.

Methodology Caveats and the Benchmark-Contamination Problem

Before anyone reorganizes a model budget around a single launch, the methodology behind every GLM-5.2 benchmark figure deserves scrutiny — and a benchmark post that skips this is marketing, not analysis.

Start with provenance. Several of the strongest GLM-5.2 numbers — SWE-Bench Pro at ~62.1, FrontierSWE at ~74.4 — come from Z.ai’s own evaluation runs reported on the model card. Vendor-run scores are not fraudulent, but they are not independent. Self-reported coding benchmarks have a long history of optimistic harness configuration, generous retry budgets, and scaffolding that a neutral runner would not grant. The independent SWE-Bench leaderboard and third-party reproductions are the figures to trust most; treat vendor runs as upper bounds until reproduced.

Then there is contamination. SWE-Bench and its variants draw from public GitHub issues. Any model trained on a large web crawl through 2026 has plausibly seen some of those issues, their discussion threads, or the merged fixes. The risk is not always overt memorization — it can be subtle distributional leakage that inflates scores without improving genuine capability. This is why “leads on benchmark X” and “is better at your actual codebase” are different claims that should never be conflated.

A third caveat is the eval-vs-reality gap. Benchmarks reward terminating with a passing test on a curated task. Production agentic work involves ambiguous requirements, missing context, flaky CI, and the need to know when to stop and ask. A model that tops Terminal-Bench can still flail on your monorepo’s undocumented build system. The honest position: benchmarks rank capability ceilings under ideal conditions; they do not predict your throughput on messy real work. For methodology rigor in industrial settings, our reasoning model industrial benchmark details how to construct evaluations that resist these failure modes.

A fourth, subtler issue is harness sensitivity. The same model can swing many points on the same benchmark depending on the agent scaffold around it — how many retries it gets, whether it can run tests iteratively, how the prompt frames the task, and what tools the harness exposes. A reported FrontierSWE score is really a “model plus harness” score, and two labs rarely use identical harnesses. When you see GLM-5.2 at ~74.4 and Opus 4.8 at ~75.1, part of that near-tie may be harness configuration rather than pure model quality. The only way to control for it is to evaluate competing models inside your harness, on your tasks, with the budgets you would actually grant in production.

Finally, weigh the token-efficiency axis explicitly when interpreting any cost-flavored benchmark. GLM-5.2’s long reasoning chains help it score, but a benchmark that measures only pass-rate hides the fact that a more frugal model might reach a slightly lower score for a fraction of the tokens. Capability-per-dollar and capability-per-second are different rankings than capability alone, and the leader can change depending on which one your business actually optimizes.

Trade-offs, Gotchas, and What Goes Wrong

Every GLM-5.2 benchmark headli

GLM-5.2 Benchmark: The New Open-Weight Leader (2026)

GLM-5.2 Benchmark: The New Open-Weight Leader (2026)

Context and Background

The GLM-5.2 Architecture and What 40B Active Parameters Mean

Why the ratio drives inference economics

Why memory still bills like a giant

Sizing the VRAM bill in concrete terms

Why the MIT license is the real story

Benchmark Positioning: GLM-5.2 vs GPT-5.5, Opus 4.8, and Open Peers

Reading the open-weight field, not just the leader

Translating positioning into a cost-per-task model

Methodology Caveats and the Benchmark-Contamination Problem

Trade-offs, Gotchas, and What Goes Wrong

Related

Comments

Leave a Reply Cancel reply

Tag Cloud

Categories