Will It Fit?

Estimate single-GPU text inference VRAM across Transformers and vLLM with a compact, explainable breakdown.

Model

Runtime

KV cache dtype

Checkpoint profile

21B total • 3.6B active • 128,000 context • 8 KV heads

Official profile: Mixed MXFP4 + BF16 checkpoint. OpenAI's GPT-OSS model card lists a 12.8 GiB checkpoint for gpt-oss-20b. The estimator uses that published mixed MXFP4 + BF16 resident checkpoint size directly.

GPU

Quick mental model: Transformers stays a fixed single-request baseline, while vLLM exposes serving context and concurrency. Runtime presets still change the required card VRAM, and FP8 KV cache cuts the KV term roughly in half versus BF16.