OpenAI
GPT-OSS 120B
Production GPT-OSS release for general-purpose and higher-reasoning workloads that can fit on a single 80 GB class GPU.
Overview and architecture
What it is
Company
Family
Release date
Architecture
License
Modality
Context window
Total params
Active params
Layers
Hidden size
Attention heads
KV heads
KV-bearing layers
Research highlight
What improved
Single-80GB target
The release is explicitly positioned around fitting production-grade reasoning into one 80 GB class GPU such as an H100 or MI300X, which is the main operational change versus a typical 100B+ open model.
Configurable reasoning effort
OpenAI exposes low, medium, and high reasoning effort settings so latency and reasoning depth can be traded off at inference time instead of using one fixed behavior.
Native agent features
The model is released with native support for function calling, web browsing, Python execution, and structured outputs rather than treating those as wrapper-level add-ons.
Full chain-of-thought access
The release provides access to the model's reasoning trace for debugging and auditability, though OpenAI notes it is not intended for direct end-user display.
Training and release context
How it was released
Harmony-only format
Both GPT-OSS models were trained on OpenAI's Harmony response format and are expected to be used with that format rather than a generic chat template.
Model geometry
gpt-oss-120b uses 36 layers, 117B total parameters, 5.1B active parameters per token, 128 total experts, 4 active experts per token, and a 128K context window.
Quantized MoE release
The MoE weights were post-trained in MXFP4, which is the packaging decision that makes the 120B checkpoint practical on a single 80 GB GPU.
Training data and tokenizer
OpenAI describes the training mix as mostly English, text-only data with emphasis on STEM, coding, and general knowledge, tokenized with the open-sourced o200k_harmony tokenizer.
Where it is strong
Where it is strong
Production general-purpose serving
Best fit when you want one open model that can cover broad assistant, coding, and reasoning workloads without moving to multi-GPU serving first.
High-reasoning workloads
Strong match for use cases that benefit from controllable deeper reasoning rather than the fastest possible low-latency answers.
Fine-tuning and customization
OpenAI explicitly positions the model as fine-tunable, which matters if you want to adapt one large reasoning-capable checkpoint to a narrower production task.
Commercial deployment
The Apache 2.0 license makes it unusually straightforward to experiment, customize, and deploy commercially without copyleft friction.
Memory behavior
What dominates VRAM
More than 90% of GPT-OSS 120B's parameters sit in MXFP4-quantized MoE weights, while the remaining shared weights stay in BF16.
Sources