Mistral

Mistral Nemo 12B

Long-context dense Mistral checkpoint that remains practical on a single 24 GB card with quantization.

Overview and architecture

What it is

Company

Mistral

Family

Mistral

Release date

Jul 17, 2024

Architecture

Dense decoder-only transformer

License

Apache 2.0

Modality

Text

Context window

128,000

Total params

12.2B

Active params

Dense model

Layers

Hidden size

5,120

Attention heads

KV heads

KV-bearing layers

Research highlight

What improved

Mistral-NVIDIA joint release

Mistral Nemo is presented as a joint Mistral-NVIDIA model rather than as a routine checkpoint refresh, which is part of why the launch emphasized deployment practicality.

128K context and new tokenizer

The release highlights a native 128K context window and the Tekken tokenizer, both of which materially affect how the model is positioned for long-form and multilingual use.

Data mix upgrade

Mistral describes the model as trained with more multilingual and code-oriented data, so the upgrade story is as much about training mix as about architecture.

Deployment-friendly packaging

The family is explicitly pitched as a strong dense long-context model that still fits realistic single-node inference workflows, especially once lower-precision checkpoints are used.

Training and release context

How it was released

Joint release

Mistral Nemo is a joint Mistral-NVIDIA release, which is part of why the launch emphasized production deployment characteristics.

Tokenizer and context packaging

The family introduces the Tekken tokenizer and a 128K context window as concrete release-level changes rather than as later add-ons.

Backbone continuity

The release keeps a normal dense-transformer deployment story, with the upgrade driven more by tokenizer, data mix, and context than by architecture novelty.

Where it is strong

Long context

Strong option for long prompts and document-heavy workflows in a dense 12B class.

Multilingual use

The release explicitly leans into broader multilingual coverage than older smaller Mistral lines.

Code-heavy applications

The training mix and positioning make it a common choice for code-oriented serving.

Memory behavior

What dominates VRAM

Dense weights set the baseline footprint; long-context use makes KV cache the next thing to watch after quantization.

Sources

Where this page is grounded

https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407open https://huggingface.co/mistralai/Mistral-Nemo-Instruct-FP8-2407/tree/mainopen