Qwen3.6 Explained: Hybrid MoE Architecture, 1M Context, and Benchmarks

Qwen3.6 Explained: Hybrid MoE Architecture, 1M Context, and Benchmarks

Qwen3.6 Explained: Hybrid MoE Architecture, 1M Context, and Benchmarks

Qwen3.6 explained in one sentence: it is Alibaba’s April 2026 model family that pairs a hybrid Gated DeltaNet plus Gated Attention backbone with sparse mixture-of-experts routing, ships a closed-weights flagship alongside genuinely permissive open-weight variants, and pushes native context toward a million tokens. Built by the Qwen team at Alibaba Group, Qwen3.6 is not a single checkpoint but a tier: a hosted flagship (marketed as Qwen3.6-Max-Preview and, in its long-context configuration, Qwen3.6-Plus) plus open-weight releases including the Qwen3.6-27B dense model and the Qwen3.6-35B-A3B mixture-of-experts model. Why it matters: the open variants reportedly match or beat models an order of magnitude larger on agentic coding, while the architecture quietly retires quadratic attention as the default operator. That combination — frontier-adjacent quality, small active parameter counts, and an Apache-style license on the open models — is what makes this release worth a careful read rather than a headline skim.

A caveat before we start: several details below come from model cards, vendor blogs, and third-party benchmark aggregators rather than a single peer-reviewed source. Where a number is a vendor claim or a community measurement, it is labelled reported. Where it is an order-of-magnitude inference rather than a published figure, it is labelled estimated. Treat any unlabelled precise number as one I could corroborate across at least two independent sources.

What this covers: the Qwen lineage and what changed versus Qwen3.5; the hybrid MoE architecture in detail; training and post-training; sourced benchmarks with contamination caveats; access, licensing, and self-hosting economics; honest limitations and safety notes; a decision matrix against Llama 4, DeepSeek, and the GPT-5.x / Claude tier; and an FAQ.

Lineage and Context

To understand Qwen3.6 you have to understand the slope it sits on. The Qwen series has iterated faster than almost any other open model family. Qwen2.5 trained on roughly 18 trillion tokens. Qwen3 roughly doubled that to a reported ~36 trillion tokens spanning 119 languages and dialects, and introduced a hybrid “thinking / non-thinking” mode controllable per request. Qwen3.5 was the inflection point architecturally: it unified the previously separate text (Qwen3) and vision (Qwen3-VL) backbones into a single vision-language model trained with early fusion of text and multimodal tokens, and — critically — it promoted Gated DeltaNet, a linear-attention variant, to the main attention operator for long contexts, replacing full quadratic attention in the majority of layers.

Qwen3.6 is best read as a consolidation release on top of that Qwen3.5 foundation rather than a from-scratch redesign. The Qwen team’s framing, per their blog and the model cards, is “stability and real-world utility”: a more responsive coding experience shaped by direct community feedback, stronger repository-level reasoning, better front-end workflows, and a new “thinking preservation” mechanism that retains reasoning context across turns of a conversation instead of discarding it after each answer. In other words, where Qwen3.5 proved the hybrid-attention-plus-MoE thesis could work, Qwen3.6 is the version aimed at people who actually ship.

Two structural shifts distinguish this generation from its predecessors. First, the bifurcation of the top tier: the Qwen3.6-Max line is closed-weights, marking Alibaba’s first proprietary flagship, while the rest of the family stays open. That is a strategic departure from the all-open posture that built Qwen’s developer mindshare, and it mirrors the playbook several labs have run — give away the strong mid-size models, monetize the frontier through an API. Second, the small-active-parameter MoE story matured: the 35B-A3B model activates only ~3B parameters per token, which is what lets a “35B” model run on hardware that would choke on a dense model of similar quality.

If you are coming to this cold, two sibling deep-dives on this site set useful context: our Llama 4 vs DeepSeek-V3 vs Claude Sonnet reasoning benchmark covers the competitive field Qwen3.6 lands in, and our Claude Opus 4.6 architecture deep-dive gives a comparison point for how a frontier closed model is built.

Architecture

Qwen3.6 is a vision-language transformer whose layers alternate, at a 4:1 ratio, between Gated DeltaNet (a linear-attention operator with a fixed-size recurrent state) and Gated Attention (full softmax attention), with every layer followed by a 256-expert mixture-of-experts feed-forward network. The open MoE variant carries 35B total parameters but activates only ~3B per token, and serves 262,144 tokens of context natively, extensible toward ~1M.

Qwen3.6 architecture
Figure 1: Qwen3.6 hybrid block. Three Gated DeltaNet layers and one Gated Attention layer form a repeating block; each layer feeds a 256-expert MoE FFN that routes every token to 8 experts plus 1 always-on shared expert.

Let me unpack the pieces, because the design choices here are deliberate and they interact.

The hybrid attention stack. The core idea is that full softmax attention is expensive — its compute and KV-cache memory grow quadratically with sequence length — but it is also unmatched at precise, long-range token-to-token recall. Linear attention variants like Gated DeltaNet are the opposite: they compress history into a fixed-size recurrent state, so cost grows linearly with sequence length and memory stays flat, but they lose some of the sharp associative recall that full attention provides. Qwen3.6’s answer is not to pick one but to interleave them. Per the 35B-A3B model card and community write-ups, the model stacks ten four-layer blocks; within each block there are three Gated DeltaNet layers and one Gated Attention layer, working out to 30 DeltaNet layers and 10 Gated Attention layers overall. The 4:1 ratio is the tuning knob: enough linear layers to keep long-context inference cheap, enough full-attention layers to preserve the precise recall that long-context tasks actually depend on.

The “Gated” in both names refers to learned gating that modulates how information flows into and out of each operator’s state — in DeltaNet’s case, controlling how the recurrent state is updated and decayed (the “delta rule”), which is what makes a linear-attention model competitive with softmax attention on associative-recall tasks where naive linear attention historically failed.

The mixture-of-experts feed-forward. Each attention layer is followed by an MoE FFN with a pool of 256 experts. A router scores the experts for each token and dispatches the token to its top 8, plus one shared expert that every token always passes through. The shared expert is a small but important detail: it gives the model a common pool of always-on knowledge so the routed experts can specialize without each one having to relearn the basics. With 8 routed experts active out of 256 (plus the shared one), the model touches roughly 3B of its 35B parameters per token — that is the “A3B” (active 3B) in the name. The economic consequence is large: you pay dense-quality results at sparse-activation compute cost.

Dense versus MoE per variant. Not every Qwen3.6 model is an MoE. The lineup, per the Hugging Face cards, breaks down roughly like this:

  • Qwen3.6-27B — a dense open-weight model, vision-capable, ~262K native context, reported to be released under a permissive Apache-style license. This is the “predictable, no-router-surprises” workhorse.
  • Qwen3.6-35B-A3B — the MoE open-weight model, 35B total / ~3B active, vision-capable, 256K native context extensible toward ~1M. This is the efficiency play.
  • Qwen3.6-Max-Preview / Qwen3.6-Plus — the hosted flagship tier. Sources describe the flagship MoE as a much larger sparse model; the Plus configuration is the one quoted with a 1M-token native context window and up to 65,536 output tokens, with always-on chain-of-thought. Note a real naming inconsistency in the wild: some sources also describe a “Max-Preview” with a 256K context window. I read “Max-Preview” and “Plus” as configurations or marketing labels of the same closed flagship line rather than two unrelated models — but treat that mapping as reported, not confirmed.

Attention variant, context, modalities, tokenizer. The attention variant is the hybrid stack described above; there is no single “attention type” answer for this model, which is the point. Context is 262,144 tokens native on the open variants, with RoPE-scaling-style extension validated toward ~1,010,000 tokens; the flagship Plus configuration advertises a 1M-token native window. Modalities are text and vision on the open variants — the model inherits the early-fusion vision-language backbone from Qwen3.5, with a SigLIP2-family vision encoder feeding image patches into the same embedding space as text tokens. The tokenizer is the Qwen BPE tokenizer with a vocabulary in the ~151K range (consistent across recent Qwen generations); if you want the mechanics of why vocabulary size and tokenizer choice matter for cost and multilingual coverage, see our LLM tokenization deep-dive.

Training

Qwen3.6’s training pipeline, as reported across the model cards and Qwen blog, runs from large-scale multimodal pretraining through long-context extension and a multi-stage post-training stack (SFT, reinforcement learning with verifiable rewards, preference optimization, and distillation into the smaller open variants). Exact token counts, compute budgets, and data mixtures for 3.6 specifically are not fully disclosed; the figures below are labelled accordingly.

Qwen3.6 training pipeline
Figure 2: Reported Qwen3.6 training pipeline — multimodal pretraining with early fusion, long-context extension, supervised fine-tuning, RL with verifiable rewards, preference alignment, then distillation into the open variants.

Pretraining data and scale. Qwen3 was trained on a reported ~36 trillion tokens; Qwen3.5 and 3.6 moved to early-fusion multimodal pretraining, meaning text and image tokens are mixed into the same pretraining stream rather than bolted on afterward via a separate vision adapter. The Qwen team describes “trillions of multimodal tokens” for the fused stage. A precise token count for Qwen3.6 specifically is undisclosed; assuming it is at least in the same tens-of-trillions order as Qwen3 is a reasonable estimate but not a published number. The data is heavily multilingual (the family supports well over 100 languages) and, given the agentic-coding focus, almost certainly oversampled on code and tool-use trajectories — though the exact mixture is undisclosed.

Compute. Alibaba has not published a FLOP budget or cluster size for the Qwen3.6 run. Any specific number you see is estimated. What we can say with confidence is qualitative: training a 1M-context, vision-capable MoE family of this quality implies a large-scale cluster run measured in many thousands of accelerator-days, with the long-context extension stage being a meaningful additional cost on top of base pretraining because attention-context curricula and sequence-packing at 256K–1M lengths are expensive.

Pretraining objective. The base objective is standard next-token prediction (autoregressive language modelling), extended to the multimodal setting so the model also learns to predict text conditioned on interleaved image patches. The long-context capability is built in a dedicated extension phase — the model is trained or annealed on progressively longer sequences so the recurrent DeltaNet state and the periodic full-attention layers learn to use the extended window rather than just tolerate it.

Post-training. This is where the “real-world utility” framing earns its keep. The reported post-training stack includes:

  • Supervised fine-tuning (SFT) on instruction-following, coding, and tool-use data to give the base model its conversational and agentic behaviour.
  • Reinforcement learning with verifiable rewards (RLVR) — increasingly the workhorse for coding and math models, because correctness can be checked automatically (does the test suite pass? is the math answer right?) rather than relying on a learned reward model. This is almost certainly a large contributor to the strong SWE-bench and AIME numbers.
  • RLHF / DPO-style preference alignment to shape tone, helpfulness, refusal behaviour, and safety.
  • Distillation of the flagship’s capabilities into the smaller open variants — a standard move that helps explain how a 27B dense or 35B-A3B model can punch so far above its weight class on agentic coding.

The “thinking preservation” mechanism — retaining reasoning context across conversation turns rather than discarding the chain-of-thought after each answer — is a post-training and inference-time design rather than a pure architecture feature; it is what lets the model carry forward intermediate reasoning in long agentic sessions. Treat the precise implementation as undisclosed.

Capabilities and Benchmarks

On the benchmarks that practitioners actually weight in 2026 — agentic coding and competition math — Qwen3.6 is reported to land in or near the frontier, with the open variants notably overperforming their active-parameter class. The headline sourced figures: Qwen3.6-Plus reportedly scores 78.8% on SWE-bench Verified and 61.6% on Terminal-Bench 2.0, while the open 35B-A3B reportedly scores 73.4% on SWE-bench Verified, 86.0% on GPQA, and 92.7% on AIME 2026, and the dense Qwen3.6-27B reportedly hits 77.2% on SWE-bench Verified and 59.3% on Terminal-Bench 2.0.

Qwen3.6 benchmark comparison
Figure 3: Reported benchmark groupings across the Qwen3.6 tier. All numbers are vendor- or aggregator-reported; see the contamination caveat below.

Here is the consolidated table. Every cell is a reported or aggregator figure — read the notes.

Benchmark Qwen3.6-Plus (flagship) Qwen3.6-35B-A3B (MoE, open) Qwen3.6-27B (dense, open) Notes
SWE-bench Verified 78.8% (reported) 73.4% (reported) 77.2% (reported) Agentic coding; harness-sensitive
Terminal-Bench 2.0 61.6% (reported) 59.3% (reported) 27B reportedly matches Claude-tier here
GPQA (Diamond) 86.0% (reported) Graduate-level science QA
AIME 2026 92.7% (reported) Competition math; high contamination risk
MMLU-Pro ~92% tier (reported, saturated) Saturated across frontier models
Artificial Analysis Intelligence Index ~40 (estimated; sources vary 40–52) Composite; version-dependent

A few things to read carefully here.

The overperformance is real but harness-dependent. The genuinely interesting result is that an open 27B dense model reportedly matches or beats a ~397B-parameter MoE predecessor (Qwen3.5-397B-A17B) on several agentic-coding tasks, and that the 35B-A3B — with ~3B active — competes with models many times its active size. That is the practical headline. But SWE-bench Verified scores are extremely sensitive to the agent harness wrapped around the model (retrieval, retry logic, test-running scaffolding), so a “78.8%” from one vendor’s harness is not directly comparable to another vendor’s number on the same benchmark. When comparing, hold the harness constant or treat cross-vendor deltas under ~3–4 points as noise.

Contamination caveats apply, especially to math. AIME and similar competition-math benchmarks are a known contamination risk: problems and solutions circulate online and can leak into pretraining corpora, inflating scores in ways that do not reflect true reasoning generalization. A 92.7% AIME number is impressive but should be read as “very strong on this distribution,” not “solved competition math.” GPQA Diamond is somewhat more contamination-resistant by design, but no public benchmark is immune. The honest reading is directional: Qwen3.6 is a strong reasoning-and-coding model, clearly ahead of its own predecessors, and competitive with the frontier — but exact rank-ordering against GPT-5.x and the latest Gemini/Claude depends on which leaderboard, which date, and which harness you trust. On GPQA Diamond specifically, reporting still puts a Gemini 3.1 Pro ahead of the Qwen tier, and MMLU is effectively saturated (~92%) across all frontier models, so it no longer discriminates.

Where it shines vs where it is ordinary. The clear strengths are agentic coding, repository-level reasoning, front-end work, tool use, and long-context retrieval — exactly what the architecture and RLVR post-training are tuned for. On open-ended creative writing, nuanced instruction-following in low-resource languages, and tasks requiring up-to-the-minute world knowledge, expect competent-but-not-class-leading results, consistent with most models in its tier. For a side-by-side of how reasoning models in this class behave on industrial tasks, our Llama 4 / DeepSeek / Claude reasoning benchmark is the companion read.

Access and Deployment

You can consume Qwen3.6 two ways: through a hosted API (Alibaba Cloud Model Studio for the flagship, plus aggregators like OpenRouter and Together) or by downloading the open weights and self-hosting the 27B dense or 35B-A3B MoE. The license split matters: the open variants are reportedly released under a permissive Apache-style license, while the Max/Plus flagship is closed-weights and API-only.

Qwen3.6 deployment options
Figure 4: Decision path — hosted API for the closed flagship, or download the open 27B/35B-A3B weights, quantize, and serve with vLLM or llama.cpp.

Hosted API and pricing. For the flagship, reported pricing on Artificial Analysis is around $1.30 per million input tokens and $7.80 per million output tokens for Qwen3.6-Max-Preview. The Plus long-context configuration is quoted on aggregators at roughly $0.50 in / $3.00 out per million via Together and roughly $0.325 in / $1.95 out per million via OpenRouter — note that aggregator pricing varies by provider and changes frequently, so verify at purchase time rather than trusting any single quoted figure. The takeaway: even the flagship sits in the “moderately priced” band for its intelligence tier, and the long-context Plus configuration is aggressively priced for a 1M-context model.

Open weights and license. Both open variants are downloadable from Hugging Face. The reported license is permissive (Apache-style) for the open models, which is the practically important fact — it permits commercial use, fine-tuning, redistribution, and quantization without a restrictive acceptable-use or revenue-gate clause. Confirm the exact license text on each model card before you build a product on it; “reportedly Apache-2.0” is what the community write-ups say, but the canonical answer is whatever the LICENSE file in the repo states on the day you pull it.

Hardware and VRAM to self-host. This is where the small-active-parameter design pays off. Reported community figures:

  • Qwen3.6-27B (dense), Q4_K_M GGUF — about 16.8 GB of weights, runnable on a single 24 GB card (RTX 4090) or a 24 GB unified-memory Mac, or any system with ~18 GB combined RAM+VRAM using CPU offload.
  • Qwen3.6-35B-A3B (MoE), NVFP4 — about 14 GB for the weights, plus ~2 GB for the vision encoder, plus KV cache. Because only ~3B parameters activate per token, the MoE is unusually friendly to CPU/GPU offload — community reports note it can run on dual 16 GB cards without the aggressive expert-offload flags that a comparable dense model would require.

Quantization. The official releases reportedly focus on FP8 (fine-grained, 128-block, with near-original quality) and GGUF (for llama.cpp / Ollama). The community has added AWQ and NVFP4 (the latter targeting Blackwell-class SM120 hardware, halving the weight footprint). For most self-hosters, Q4_K_M GGUF is the path of least resistance on a single consumer GPU; FP8 is the choice for vLLM/SGLang production serving on data-center cards.

Serving stack, latency, throughput. Reported guidance: use vLLM ≥ 0.19.0 for the 3.6 family; SGLang and KTransformers are also called out for high-throughput production. A practical flag to know: in vLLM you can pass a language-model-only / skip-vision flag to drop the vision encoder and reclaim that memory for KV cache when you only need text. Concrete latency and tokens-per-second numbers depend entirely on your hardware, batch size, quantization, and context length, so I will not quote a single throughput figure — but the architectural expectation holds: because long-context inference runs mostly through linear DeltaNet layers and only ~3B parameters activate per token, the MoE variant delivers materially better long-context throughput per dollar than a dense model of comparable quality. That is the whole point of the design.

Limitations, Safety, and Failure Modes

No honest deep-dive ends at the benchmark table. Here are the failure modes and caveats you should plan around.

Hallucination and confident error. Like every model in its class, Qwen3.6 will produce fluent, confident, wrong answers — particularly on niche factual lookups, recent events past its training cutoff, and precise figures (dates, version numbers, API signatures). The agentic-coding strength does not immunize it against fabricating a plausible-looking function from a library that does not exist. Use retrieval and tool-grounding for anything fact-sensitive, and keep a human or a test suite in the loop for generated code.

Long-context degradation. A 256K–1M token window is an advertised capacity, not a guarantee of uniform recall across that window. Hybrid linear/full attention helps, but “lost in the middle” effects — where information buried in the center of a very long context is recalled less reliably than information at the start or end — are a general property of long-context transformers and there is no reason to assume Qwen3.6 is fully exempt. For long-context RAG, still rank and place your most important evidence deliberately rather than dumping a million tokens and hoping.

Evaluation and language caveats. The benchmark numbers are reported figures subject to harness sensitivity and contamination as discussed above. The model is strongly multilingual, but quality is not uniform across all 100-plus supported languages; expect the best results in the high-resource languages (English, Chinese, major European languages) and more variability in low-resource ones.

Censorship and policy notes (China-origin model). Qwen is developed by Alibaba and trained under Chinese content regulations. Expect the hosted flagship in particular to refuse or deflect on politically sensitive topics relative to a baseline Western model, and expect some alignment behaviour to reflect those constraints. The open weights are more malleable — fine-tuning and system-prompting can shift refusal behaviour — but the base model’s priors still carry the imprint of its training and alignment data. For enterprise use, evaluate refusal and content behaviour against your own policy rather than assuming parity with whatever model you are migrating from.

Jailbreak and red-team posture. As with all current LLMs, safety alignment is not robust against determined adversarial prompting; published red-team patterns that work on peer models generally transfer to some degree. If you are deploying in a context where adversarial users will probe the model, you need your own guardrails (input/output filtering, policy classifiers, rate limiting) rather than relying on the model’s built-in alignment alone.

The closed-flagship caveat. Because Qwen3.6-Max/Plus is closed-weights, you cannot audit it, you are subject to API availability and pricing changes, and your data handling is governed by Alibaba Cloud’s terms. If those are dealbreakers, the open 27B/35B-A3B variants are your path — at some cost in peak capability and the loss of the 1M-context Plus configuration.

How Qwen3.6 Compares

Here is how I would route a decision across Qwen3.6 and its 2026 peers — Llama 4 (open), the DeepSeek-V/R line (open), and the GPT-5.x / Claude frontier (closed) — for the four use cases that come up most. Ratings are practitioner judgement informed by the reported benchmarks, not a single leaderboard.

Use case Qwen3.6 (27B / 35B-A3B / Plus) Llama 4 (open) DeepSeek-V/R (open) GPT-5.x / Claude (closed frontier)
Agentic coding Excellent — open variants reportedly match far larger models; strong RLVR tuning Good, broad ecosystem Very good, especially the reasoning line Class-leading, but closed and pricier
Long-context RAG Excellent — 256K open, ~1M on Plus, linear-attention throughput edge Good, large windows Good Excellent, but cost scales with context
Agentic / tool use Excellent — thinking-preservation + tool-use post-training Good Very good Class-leading
Self-hosting Excellent — Apache-style license, ~3B active runs on consumer GPUs Excellent — permissive, mature tooling Excellent — strong open posture Not available (API-only)

The short version: if you need to self-host a model that does agentic coding and long-context work without renting frontier API time, Qwen3.6’s open variants are among the strongest options on the board in mid-2026, and the linear-attention throughput advantage is a real differentiator at long context. If you need the absolute peak of capability and are comfortable with a closed API, the GPT-5.x / Claude tier still leads on the hardest reasoning — and Qwen’s own closed Plus flagship is the in-family option there. Llama 4 and DeepSeek remain the natural cross-shop for an open-weights coding/RAG stack; choose on license fit, ecosystem tooling, and your own eval on your tasks rather than on a headline benchmark.

Frequently Asked Questions

Is Qwen3.6 open source?

Partly. The open-weight variants — Qwen3.6-27B (dense) and Qwen3.6-35B-A3B (MoE) — are reportedly released under a permissive Apache-style license that allows commercial use, fine-tuning, and redistribution. The flagship Qwen3.6-Max / Qwen3.6-Plus is closed-weights and API-only. Always confirm the exact license in each model’s repository before building on it.

What is the Qwen3.6 context window?

The open variants serve 262,144 tokens natively, extensible toward roughly 1,010,000 tokens. The hosted flagship’s Plus configuration advertises a 1 million-token native context window with up to 65,536 output tokens. Note that “native window” is a capacity figure — recall quality across the full window still degrades for information buried mid-context, as with all long-context models.

What hardware do I need to run Qwen3.6 locally?

For the dense Qwen3.6-27B at Q4_K_M GGUF (~16.8 GB), a single 24 GB GPU (e.g., RTX 4090) or a 24 GB-unified-memory Mac is enough, or ~18 GB combined RAM+VRAM with CPU offload. The 35B-A3B MoE at NVFP4 needs ~14 GB for weights plus the vision encoder and KV cache, and because only ~3B parameters activate per token it is unusually offload-friendly — reportedly runnable on dual 16 GB cards. Use vLLM ≥ 0.19.0 (or llama.cpp / Ollama for GGUF).

How much does the Qwen3.6 API cost?

Reported figures: the Max-Preview flagship is around $1.30 / 1M input and $7.80 / 1M output. The long-context Plus configuration is quoted at roughly $0.50 in / $3.00 out via Together and $0.325 in / $1.95 out via OpenRouter. Aggregator pricing changes often and varies by provider — verify at purchase time.

How does Qwen3.6 differ from Qwen3.5?

Qwen3.6 is a consolidation release on the Qwen3.5 hybrid-attention-plus-MoE foundation, prioritizing real-world utility: better agentic coding and repository-level reasoning, improved front-end workflows, a new “thinking preservation” mechanism that carries reasoning context across turns, and the strategic introduction of a closed-weights Max/Plus flagship alongside the open variants. The core architecture — Gated DeltaNet + Gated Attention in a 4:1 ratio with 256-expert MoE FFNs — carries forward from Qwen3.5.

What is Gated DeltaNet and why does it matter?

Gated DeltaNet is a linear-attention operator: instead of attending over every previous token (quadratic cost), it compresses history into a fixed-size recurrent state updated via a gated “delta rule.” That makes long-context compute and memory grow linearly rather than quadratically. Qwen3.6 interleaves three DeltaNet layers with one full Gated Attention layer per block, getting most of the cost savings while keeping enough full-attention layers to preserve precise long-range recall.

Can Qwen3.6 process images?

Yes — the open variants are vision-capable, inheriting the early-fusion vision-language backbone from Qwen3.5 with a SigLIP2-family vision encoder. Image patches are projected into the same embedding space as text tokens, so the model reasons over interleaved text and images. In some serving setups you can disable the vision path (e.g., a language-model-only flag in vLLM) to reclaim memory for KV cache when you only need text.

Further Reading

By Riju — about

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *