Qwen

Qwen 3.5 122B A10B

High-capacity Qwen3.5 MoE release for users who want the family’s hybrid multimodal architecture at a much larger scale without paying dense 122B compute per token.

Overview and architecture

What it is

Company

Qwen

Family

Qwen

Release date

Feb 24, 2026

Architecture

Hybrid multimodal MoE transformer

License

Apache 2.0

Modality

Multimodal (text-only estimate)

Context window

262,144

Total params

122B

Active params

10B

Layers

Hidden size

3,072

Attention heads

KV heads

KV-bearing layers

Training scope

Built as a unified vision-language foundation with pre-training and post-training on multimodal tokens rather than a separate late-fusion stack.

Hybrid layout

12 of 48 layers use gated attention while the rest use Gated DeltaNet blocks, so the stack is not a full-attention transformer end to end.

Context design

Published with a native 262K context window and an architecture intended to stretch beyond that range in longer-context settings.

Research highlight

What improved

Unified vision-language foundation

The family is trained as one multimodal base rather than as separate text and vision branches bolted together late, which is why text-only serving still keeps the resident vision-side weights on card.

Efficient hybrid architecture

Gated DeltaNet layers carry sequence state while periodic gated-attention layers handle KV-heavy reasoning, so the stack aims for long-context throughput without paying dense-attention KV cost on every layer.

Scalable RL generalization

Qwen frames reinforcement learning and large agent-environment scaling as core to the family, with training aimed at more robust adaptation across reasoning, coding, and agent workflows.

Global coverage

The release emphasizes support for 201 languages and dialects, which matters for deployment quality and reinforces that the family is meant as a broad general-purpose foundation.

Training infrastructure

The release emphasizes near-text-only multimodal training efficiency and asynchronous RL infrastructure, signaling that the stack was built to scale rather than as a small multimodal add-on.

Training and release context

How it was released

Unified release format

Qwen3.5 is released as a single multimodal foundation rather than as separate text and vision checkpoints stitched together later.

Architecture shift

The family changes the serving geometry by mixing DeltaNet-style state layers with periodic attention layers instead of staying a plain dense-attention stack like Qwen2.5.

Training stack

Qwen emphasizes multimodal training efficiency and large-scale RL infrastructure as part of the release process, not just as a benchmark claim.

Where it is strong

Multimodal reasoning

Designed for a unified text-plus-vision capability profile rather than separate specialist variants.

Long-context serving

The hybrid layout is explicitly aimed at making long-context serving cheaper than a dense full-attention stack.

Agents and coding

Qwen positions the family as competitive across coding, reasoning, and agent-style workflows.

Memory behavior

What dominates VRAM

This text-only estimate still keeps the full multimodal checkpoint resident, so VRAM tracks the whole 122B parameter pool rather than only the active 10B routing path.

Only 12 of 48 layers carry a standard KV cache. The remaining layers contribute a fixed sequence-state term instead, which keeps long-context growth lower than a full-attention MoE stack.

MoE routing lowers active compute per token much more than it lowers the resident memory floor, so long-context serving is governed by total checkpoint size plus hybrid cache and state growth.

FitMyGPU currently treats this as a text-only estimate. Resident multimodal weights remain counted, but media-token overhead is excluded.

Sources

Where this page is grounded

https://huggingface.co/Qwen/Qwen3.5-122B-A10Bopen