Qwen
Qwen 3.5 27B
Large dense Qwen3.5 release that keeps the hybrid multimodal stack but pushes into a much heavier single-model serving class than the 9B tier.
Overview and architecture
What it is
Company
Family
Release date
Architecture
License
Modality
Context window
Total params
Active params
Layers
Hidden size
Attention heads
KV heads
KV-bearing layers
Training scope
Built as a unified vision-language foundation with pre-training and post-training on multimodal tokens rather than a separate late-fusion stack.
Hybrid layout
16 of 64 layers use gated attention while the rest use Gated DeltaNet blocks, so the stack is not a full-attention transformer end to end.
Context design
Published with a native 262K context window and an architecture intended to stretch beyond that range in longer-context settings.
Research highlight
What improved
Unified vision-language foundation
The family is trained as one multimodal base rather than as separate text and vision branches bolted together late, which is why text-only serving still keeps the resident vision-side weights on card.
Efficient hybrid architecture
Gated DeltaNet layers carry sequence state while periodic gated-attention layers handle KV-heavy reasoning, so the stack aims for long-context throughput without paying dense-attention KV cost on every layer.
Scalable RL generalization
Qwen frames reinforcement learning and large agent-environment scaling as core to the family, with training aimed at more robust adaptation across reasoning, coding, and agent workflows.
Global coverage
The release emphasizes support for 201 languages and dialects, which matters for deployment quality and reinforces that the family is meant as a broad general-purpose foundation.
Training infrastructure
The release emphasizes near-text-only multimodal training efficiency and asynchronous RL infrastructure, signaling that the stack was built to scale rather than as a small multimodal add-on.
Training and release context
How it was released
Unified release format
Qwen3.5 is released as a single multimodal foundation rather than as separate text and vision checkpoints stitched together later.
Architecture shift
The family changes the serving geometry by mixing DeltaNet-style state layers with periodic attention layers instead of staying a plain dense-attention stack like Qwen2.5.
Training stack
Qwen emphasizes multimodal training efficiency and large-scale RL infrastructure as part of the release process, not just as a benchmark claim.
Where it is strong
Where it is strong
Multimodal reasoning
Designed for a unified text-plus-vision capability profile rather than separate specialist variants.
Long-context serving
The hybrid layout is explicitly aimed at making long-context serving cheaper than a dense full-attention stack.
Agents and coding
Qwen positions the family as competitive across coding, reasoning, and agent-style workflows.
Memory behavior
What dominates VRAM
This text-only estimate still keeps the resident multimodal checkpoint weights on card, so the floor is higher than a pure language-only artifact of similar active size.
Only 16 of 64 layers carry a standard KV cache. The remaining layers contribute a fixed sequence-state term instead, which makes long-context growth less aggressive than a dense full-attention stack.
Longer context and higher concurrency still increase memory monotonically, but more of the footprint shifts into mixed KV-plus-state behavior instead of pure transformer cache expansion.
FitMyGPU currently treats this as a text-only estimate. Resident multimodal weights remain counted, but media-token overhead is excluded.
Sources