FitMyGPU
Back to calculator

Qwen

Qwen 3.6 35B A3B

Qwen3.6 MoE release tuned for real-world coding agents, with a 35B multimodal checkpoint and a smaller 3B active path for lower token-time compute.

Overview and architecture

What it is

Company

Qwen

Family

Qwen

Release date

Apr 15, 2026

Architecture

Hybrid multimodal MoE transformer

License

Apache 2.0

Modality

Multimodal (text-only estimate)

Context window

262,144

Total params

35B

Active params

3B

Layers

40

Hidden size

2,048

Attention heads

16

KV heads

2

KV-bearing layers

10

Training scope

Built as a unified vision-language foundation with pre-training and post-training on multimodal tokens rather than a separate late-fusion stack.

Hybrid layout

10 of 40 layers use gated attention while the rest use Gated DeltaNet blocks, so the stack is not a full-attention transformer end to end.

Context design

Published with a native 262K context window and an architecture intended to stretch beyond that range in longer-context settings.

Research highlight

What improved

Agentic coding upgrade

Qwen3.6 is framed around better coding-agent behavior, especially frontend workflows and repository-level reasoning rather than a broad architecture reset.

Thinking preservation

The release adds an option to preserve reasoning context across prior messages, which matters for iterative development workflows and multi-turn tool use.

Stability over novelty

Qwen presents 3.6 as the first open-weight follow-up to Qwen3.5 built from community feedback, with more emphasis on dependable real-world utility than on introducing a new model family.

Training and release context

How it was released

Release lineage

Qwen3.6 is a direct successor to the February Qwen3.5 series rather than a separate architecture branch, and it keeps the same unified multimodal release format.

Architecture continuity

The line still uses the hybrid DeltaNet-plus-attention recipe, so the serving geometry stays governed by partial KV layers plus static sequence state rather than by full-attention on every layer.

Deployment target

Qwen explicitly packages the release for Transformers, vLLM, SGLang, and related serving stacks, which signals an operationally mature release rather than a research-only drop.

Where it is strong

Where it is strong

Coding agents

The line is tuned most visibly for repository work, frontend changes, tool use, and multi-step coding-agent flows.

Iterative reasoning

Thinking preservation makes the release better suited to long back-and-forth development sessions where reasoning context should not be rebuilt from scratch every turn.

Long-context hybrid serving

It keeps the hybrid long-context advantage of Qwen3.5 while shifting the capability story toward developer productivity and stability.

Memory behavior

What dominates VRAM

This text-only estimate still keeps the full multimodal checkpoint resident, so VRAM tracks the whole 35B parameter pool rather than only the active 3B routing path.

Only 10 of 40 layers carry a standard KV cache. The remaining layers contribute a fixed sequence-state term instead, which keeps long-context growth lower than a full-attention MoE stack.

MoE routing lowers active compute per token much more than it lowers the resident memory floor, so long-context serving is governed by total checkpoint size plus hybrid cache and state growth.

FitMyGPU currently treats this as a text-only estimate. Resident multimodal weights remain counted, but media-token overhead is excluded.

Sources

Where this page is grounded