FitMyGPU
Back to calculator

DeepSeek

DeepSeek R1 Distill Qwen 1.5B

Smallest DeepSeek R1 distill, carrying R1-style reasoning into a compact Qwen dense backbone that is easy to run locally.

Overview and architecture

What it is

Company

DeepSeek

Family

DeepSeek R1 Distill

Release date

Jan 20, 2025

Architecture

Dense decoder-only transformer

License

MIT + Apache 2.0

Modality

Text

Context window

131,072

Total params

1.5B

Active params

Dense model

Layers

28

Hidden size

1,536

Attention heads

12

KV heads

2

KV-bearing layers

28

Research highlight

What improved

R1 reasoning distillation

DeepSeek frames the 1.5B distill models around one central result: reasoning behavior learned in a very large RL-trained model can be transferred into smaller dense backbones effectively.

Dense models can stay competitive

The release argues that smaller dense checkpoints distilled from R1 can outperform many comparably sized reasoning baselines without needing frontier-scale MoE deployment.

Post-training over architecture change

These models matter because of the R1 distillation pipeline and reasoning data, not because they introduce a new attention or cache architecture.

Training and release context

How it was released

Release lineage

The DeepSeek-R1 distill line is derived from samples generated by DeepSeek-R1 rather than trained as a separate base-model family from scratch.

Backbone choice

This checkpoint keeps the serving geometry of Qwen2.5-Math-1.5B, so its VRAM behavior follows that underlying dense architecture rather than the giant DeepSeek-R1 MoE backbone.

Usage guidance

DeepSeek recommends temperature-controlled reasoning usage, avoiding system prompts, and explicitly steering the model into think-first behavior for best results.

Where it is strong

Where it is strong

Reasoning per GPU

The 1.5B distill line is for users who want much of the R1 reasoning style without deploying the full 671B-class DeepSeek model.

Math and code

DeepSeek emphasizes strong gains on math, code, and structured reasoning benchmarks across the distilled checkpoints.

Normal serving stack

Because these stay on familiar Qwen or Llama dense backbones, they fit standard local inference tooling far more easily than the full R1 architecture.

Memory behavior

What dominates VRAM

Because this is a dense Qwen-derived checkpoint, VRAM scales in the familiar way: resident weights stay modest, while long-context KV growth dominates once the model is already loaded.

Sources

Where this page is grounded