Estimate single-GPU text inference VRAM across Transformers and vLLM with a compact, explainable breakdown.
21B total • 3.6B active • 128,000 context • 8 KV heads
Official profile: Mixed MXFP4 + BF16 checkpoint. OpenAI's GPT-OSS model card lists a 12.8 GiB checkpoint for gpt-oss-20b. The estimator uses that published mixed MXFP4 + BF16 resident checkpoint size directly.
Serving options
Context length and concurrent requests. Extra fields appear when they are relevant.
Quick mental model: Transformers stays a fixed single-request baseline, while vLLM exposes serving context and concurrency. Runtime presets still change the required card VRAM, and FP8 KV cache cuts the KV term roughly in half versus BF16.