Qwen
Qwen 3.6 35B A3B
Qwen3.6 MoE release tuned for real-world coding agents, with a 35B multimodal checkpoint and a smaller 3B active path for lower token-time compute.
Overview and architecture
What it is
Company
Family
Release date
Architecture
License
Modality
Context window
Total params
Active params
Layers
Hidden size
Attention heads
KV heads
KV-bearing layers
Training scope
Built as a unified vision-language foundation with pre-training and post-training on multimodal tokens rather than a separate late-fusion stack.
Hybrid layout
10 of 40 layers use gated attention while the rest use Gated DeltaNet blocks, so the stack is not a full-attention transformer end to end.
Context design
Published with a native 262K context window and an architecture intended to stretch beyond that range in longer-context settings.
Research highlight
What improved
Agentic coding upgrade
Qwen3.6 is framed around better coding-agent behavior, especially frontend workflows and repository-level reasoning rather than a broad architecture reset.
Thinking preservation
The release adds an option to preserve reasoning context across prior messages, which matters for iterative development workflows and multi-turn tool use.
Stability over novelty
Qwen presents 3.6 as the first open-weight follow-up to Qwen3.5 built from community feedback, with more emphasis on dependable real-world utility than on introducing a new model family.
Training and release context
How it was released
Release lineage
Qwen3.6 is a direct successor to the February Qwen3.5 series rather than a separate architecture branch, and it keeps the same unified multimodal release format.
Architecture continuity
The line still uses the hybrid DeltaNet-plus-attention recipe, so the serving geometry stays governed by partial KV layers plus static sequence state rather than by full-attention on every layer.
Deployment target
Qwen explicitly packages the release for Transformers, vLLM, SGLang, and related serving stacks, which signals an operationally mature release rather than a research-only drop.
Where it is strong
Where it is strong
Coding agents
The line is tuned most visibly for repository work, frontend changes, tool use, and multi-step coding-agent flows.
Iterative reasoning
Thinking preservation makes the release better suited to long back-and-forth development sessions where reasoning context should not be rebuilt from scratch every turn.
Long-context hybrid serving
It keeps the hybrid long-context advantage of Qwen3.5 while shifting the capability story toward developer productivity and stability.
Memory behavior
What dominates VRAM
This text-only estimate still keeps the full multimodal checkpoint resident, so VRAM tracks the whole 35B parameter pool rather than only the active 3B routing path.
Only 10 of 40 layers carry a standard KV cache. The remaining layers contribute a fixed sequence-state term instead, which keeps long-context growth lower than a full-attention MoE stack.
MoE routing lowers active compute per token much more than it lowers the resident memory floor, so long-context serving is governed by total checkpoint size plus hybrid cache and state growth.
FitMyGPU currently treats this as a text-only estimate. Resident multimodal weights remain counted, but media-token overhead is excluded.
Sources