FitMyGPU
Back to calculator

Qwen

Qwen 3 8B

Dense Qwen3 release for stronger general-purpose reasoning, agent, and multilingual assistant use with switchable thinking modes.

Overview and architecture

What it is

Company

Qwen

Family

Qwen

Release date

Apr 27, 2025

Architecture

Dense decoder-only transformer

License

Apache 2.0

Modality

Text

Context window

131,072

Total params

8.2B

Active params

Dense model

Layers

36

Hidden size

4,096

Attention heads

32

KV heads

8

KV-bearing layers

36

Research highlight

What improved

Thinking-mode switch

The 8B model preserves Qwen3’s ability to move between deeper reasoning mode and faster non-thinking dialogue.

Reasoning and instruction uplift

Qwen positions the line as stronger than earlier Qwen2.5 instruct releases across mathematics, code, commonsense reasoning, and instruction following.

Extended context with YaRN

The checkpoint keeps 32K native context and extends to 131K with YaRN, which matters for long-prompt deployment planning.

Training and release context

How it was released

Family release

Qwen3 is released as a dense and MoE model family centered on switching between thinking and non-thinking modes within the same model.

Training stage

Qwen describes the release as a pretraining plus post-training model rather than a small instruction-only adaptation.

Context packaging

The 8B model is published with 32K native context, and the larger dense variants explicitly extend to 131K with YaRN.

Where it is strong

Where it is strong

Thinking and non-thinking use

The 8B release is built to switch between deeper reasoning mode and faster general dialogue mode without changing models.

Agent workflows

Qwen positions the family for tool use and agent-style tasks in both thinking and non-thinking modes.

Multilingual assistant work

The family is published with support for 100+ languages and dialects, making it a broad multilingual assistant line rather than a narrow specialist release.

Memory behavior

What dominates VRAM

Weights dominate the dense 8B footprint, but the model still stays manageable enough that runtime choice and context length both visibly affect total VRAM.

Sources

Where this page is grounded