Mistral
Mixtral 8x7B
Sparse MoE model where runtime compute is closer to one expert pair, but VRAM still pays for resident weights.
Overview and architecture
What it is
Company
Family
Release date
Architecture
License
Modality
Context window
Total params
Active params
Layers
Hidden size
Attention heads
KV heads
KV-bearing layers
Research highlight
What improved
Sparse top-2 expert routing
Mixtral's core research change is sparse MoE routing: only a small subset of experts is active per token even though a much larger parameter pool stays resident.
Dense-quality alternative path
The release matters because it offered a practical open sparse model path at a time when most comparable open checkpoints were still fully dense.
Compute and capacity decoupling
Mixtral is important less because of a new attention design and more because it separates total parameter capacity from per-token compute in a way that users can feel operationally.
Training and release context
How it was released
Sparse release milestone
Mixtral mattered as one of the first widely adopted open sparse MoE checkpoints to feel practical outside research demos.
Architecture-first release
The release is fundamentally about sparse routing, not about a new tokenizer, longer context, or multimodal packaging.
Open deployment angle
Mistral shipped it as an openly deployable sparse alternative for users who wanted higher resident capacity without dense-model compute scaling.
Where it is strong
Where it is strong
Capability per token
Active compute stays much lower than total resident capacity, which is the main reason people reach for Mixtral.
Instruction use
The instruct release is a strong open general assistant baseline when VRAM is available.
MoE experimentation
Useful for teams exploring sparse routing behavior without moving to frontier closed models.
Memory behavior
What dominates VRAM
Even though only a subset of experts is active per token, single-GPU VRAM still carries the resident experts in memory.
Sources