Qwen
Qwen 3.5 122B A10B
High-capacity Qwen3.5 MoE release for users who want the family’s hybrid multimodal architecture at a much larger scale without paying dense 122B compute per token.
Overview and architecture
What it is
Company
Family
Release date
Architecture
License
Modality
Context window
Total params
Active params
Layers
Hidden size
Attention heads
KV heads
KV-bearing layers
Training scope
Built as a unified vision-language foundation with pre-training and post-training on multimodal tokens rather than a separate late-fusion stack.
Hybrid layout
12 of 48 layers use gated attention while the rest use Gated DeltaNet blocks, so the stack is not a full-attention transformer end to end.
Context design
Published with a native 262K context window and an architecture intended to stretch beyond that range in longer-context settings.
Research highlight
What improved
Unified vision-language foundation
The family is trained as one multimodal base rather than as separate text and vision branches bolted together late, which is why text-only serving still keeps the resident vision-side weights on card.
Efficient hybrid architecture
Gated DeltaNet layers carry sequence state while periodic gated-attention layers handle KV-heavy reasoning, so the stack aims for long-context throughput without paying dense-attention KV cost on every layer.
Scalable RL generalization
Qwen frames reinforcement learning and large agent-environment scaling as core to the family, with training aimed at more robust adaptation across reasoning, coding, and agent workflows.
Global coverage
The release emphasizes support for 201 languages and dialects, which matters for deployment quality and reinforces that the family is meant as a broad general-purpose foundation.
Training infrastructure
The release emphasizes near-text-only multimodal training efficiency and asynchronous RL infrastructure, signaling that the stack was built to scale rather than as a small multimodal add-on.
Training and release context
How it was released
Unified release format
Qwen3.5 is released as a single multimodal foundation rather than as separate text and vision checkpoints stitched together later.
Architecture shift
The family changes the serving geometry by mixing DeltaNet-style state layers with periodic attention layers instead of staying a plain dense-attention stack like Qwen2.5.
Training stack
Qwen emphasizes multimodal training efficiency and large-scale RL infrastructure as part of the release process, not just as a benchmark claim.
Where it is strong
Where it is strong
Multimodal reasoning
Designed for a unified text-plus-vision capability profile rather than separate specialist variants.
Long-context serving
The hybrid layout is explicitly aimed at making long-context serving cheaper than a dense full-attention stack.
Agents and coding
Qwen positions the family as competitive across coding, reasoning, and agent-style workflows.
Memory behavior
What dominates VRAM
This text-only estimate still keeps the full multimodal checkpoint resident, so VRAM tracks the whole 122B parameter pool rather than only the active 10B routing path.
Only 12 of 48 layers carry a standard KV cache. The remaining layers contribute a fixed sequence-state term instead, which keeps long-context growth lower than a full-attention MoE stack.
MoE routing lowers active compute per token much more than it lowers the resident memory floor, so long-context serving is governed by total checkpoint size plus hybrid cache and state growth.
FitMyGPU currently treats this as a text-only estimate. Resident multimodal weights remain counted, but media-token overhead is excluded.
Sources