FitMyGPU
Back to calculator

Qwen

Qwen 3 30B A3B

Qwen3 MoE release with 30.5B total parameters and 3.3B active parameters, built for lower active compute than a comparable dense model.

Overview and architecture

What it is

Company

Qwen

Family

Qwen

Release date

Apr 27, 2025

Architecture

Mixture-of-experts transformer

License

Apache 2.0

Modality

Text

Context window

131,072

Total params

30.5B

Active params

3.3B

Layers

48

Hidden size

2,048

Attention heads

32

KV heads

4

KV-bearing layers

48

Research highlight

What improved

MoE branch of Qwen3

This model moves Qwen3 into a sparse MoE serving geometry while keeping the same user-facing thinking/non-thinking framing.

Low active path

Only 3.3B parameters are activated per token out of 30.5B total, which is the central deployment distinction versus the dense Qwen3 line.

Agent and reasoning focus

Qwen still positions the model for reasoning, instruction following, and complex agent workflows rather than only general chat.

Training and release context

How it was released

MoE family branch

Qwen3 includes dedicated MoE models alongside the dense line, keeping the same user-facing thinking/non-thinking framing while changing the serving geometry materially.

Sparse activation

The MoE releases expose total and activated parameter counts separately, which is the key deployment distinction versus the dense Qwen3 models.

Long-context packaging

The base MoE releases are published with 32K native context and 131K support with YaRN, while the 2507 update is packaged at 256K native context.

Where it is strong

Where it is strong

Reasoning with lower active compute

The MoE line is for users who want larger total capacity without paying dense-model active compute per token.

Agent and tool use

Qwen still positions the MoE branch around agent workflows, tool calling, and mixed reasoning/general dialogue use.

Large multilingual serving

Useful when you want very large-capacity multilingual serving without moving to a purely dense 70B+ model.

Memory behavior

What dominates VRAM

Resident VRAM tracks the full 30.5B parameter pool even though token compute is closer to the 3.3B activated path, so MoE changes compute pressure more than the weight floor.

Sources

Where this page is grounded