Back to calculator

Model notes

Qwen 3.5 4B

Mid-sized Qwen3.5 checkpoint with a larger resident multimodal footprint but still practical for careful single-GPU text-only serving.

5B dense • 262,144 context • 4 KV heads

Architecture

Model spec

Architecture

Hybrid multimodal transformer

Total params

5B

Active params

Dense model

Layers

32

Hidden size

2,560

Attention heads

16

KV heads

4

KV-bearing layers

8

Context length

262,144

Modality

Multimodal, text-only estimate

License

Apache 2.0

Why it matters

Why memory behaves this way

Research highlight

The 4B language model sits inside a roughly 5B resident multimodal artifact and uses only 8 gated-attention layers for KV-heavy generation.

Memory note

The hybrid layout keeps cache growth lower than dense 32-layer models, but the extra multimodal resident weights raise the single-card floor.

Checkpoints

Official profiles

Official BF16 checkpoint

BF16 checkpoint

Current

Qwen publishes Qwen3.5-4B in Hugging Face Transformers format with explicit Transformers and vLLM guidance, including a text-only serving mode in vLLM.

vLLMTransformers
Open checkpoint

Sources

Reference links