FitMyGPU

Will It Fit?

Estimate text inference VRAM across Transformers and vLLM with a compact, explainable breakdown.

Model pages and release notes are included here too.

235B total • 22B active • 131,072 context • 4 KV heads

vLLM estimates use the selected GPU memory utilization. Transformers stays a fixed 4K single-request baseline.

Does not fit on selected GPU

Qwen 3 235B A22B · vLLM · Official BF16 checkpoint

Required GPU VRAM (0.9 budget)

575.2 GB

Core estimate: 517.6 GB. Against RTX 4090 24GB, this leaves 549.4 GB of deficit.

At 4,096 tokens, the estimated max concurrency is 0 concurrent requests.

Quick read

Selected GPU

RTX 4090 24GB

Runtime

vLLM

Load dtype

FP16

GPU utilization

0.9

GPU count

1

KV cache dtype

BF16

Context length

4,096

Current concurrency

1

Max concurrency @ context

0

Class

Consumer

Bandwidth

1,008 GB/s

Nominal VRAM

24 GB

Core estimate

517.6 GB

Required GPU VRAM (0.9 budget)

575.2 GB

Deficit

549.4 GB

GPU compare

What fits this setup

GPUStatusHeadroomMax conc.
GB200 NVL72 GPU 186GBOOM-375.4 GB0
B200 180GBOOM-381.9 GB0
H200 141GBOOM-423.8 GB0
RTX PRO 6000 96GBOOM-472.1 GB0
A100 80GBOOM-489.3 GB0
H100 80GBOOM-489.3 GB0
L40 48GBOOM-523.6 GB0
A100 40GBOOM-532.2 GB0
RTX 5090 32GBOOM-540.8 GB0
RTX 3090 24GBOOM-549.4 GB0
RTX 4090 24GBOOM-549.4 GB0

Breakdown

Where the memory goes

Weights

235B resident parameters at 2.00 bytes each. Calibrated from the official checkpoint profile.

470.2 GB

KV cache

Concurrency 1, context 4,096, 94 KV-bearing layers, 4 KV heads, BF16 cache storage.

0.4 GB

Runtime / safety overhead

Conservative buffer for allocator fragmentation, kernels, and runtime scratch space.

47.1 GB

Weights = parameter count × bytes per parameter.

KV cache grows with context length, KV-bearing layers, concurrent requests, and the selected KV cache dtype.

Show the substituted formulas

Weights

parameter count × bytes per weight

235B × 2.00 = 470.2 GB

The official Qwen3-235B-A22B safetensor weights total about 470.19 GB on Hugging Face.

KV cache

batch × effective KV tokens across attention layers × 2 × KV heads × head dim × bytes per KV element

1 × 385,024 × 2 × 4 × 64 × 2 = 0.4 GB

Longer context, deeper models, and larger batches all expand the cache linearly. BF16 controls the bytes per stored KV element.

Overhead

max(1.5 GB, 10% of weights + KV cache + linear state)

max(1.5 GB, 10% of 470.6 GB) = 47.1 GB

This leaves room for runtime buffers instead of claiming an unrealistically exact fit.

Model

Selected model

Qwen 3 235B A22B

235B total • 22B active • 131,072 context • 4 KV heads

About model

Total params

235B

Active params

22B

Layers

94

Hidden size

4,096

Attention heads

64

KV heads

4

KV-bearing layers

94