FitMyGPU
Back to calculator

Mistral

Mixtral 8x7B

Sparse MoE model where runtime compute is closer to one expert pair, but VRAM still pays for resident weights.

Overview and architecture

What it is

Company

Mistral

Family

Mixtral

Release date

Dec 10, 2023

Architecture

Mixture-of-experts transformer

License

Apache 2.0

Modality

Text

Context window

32,768

Total params

46.7B

Active params

12.9B

Layers

32

Hidden size

4,096

Attention heads

32

KV heads

8

KV-bearing layers

32

Research highlight

What improved

Sparse top-2 expert routing

Mixtral's core research change is sparse MoE routing: only a small subset of experts is active per token even though a much larger parameter pool stays resident.

Dense-quality alternative path

The release matters because it offered a practical open sparse model path at a time when most comparable open checkpoints were still fully dense.

Compute and capacity decoupling

Mixtral is important less because of a new attention design and more because it separates total parameter capacity from per-token compute in a way that users can feel operationally.

Training and release context

How it was released

Sparse release milestone

Mixtral mattered as one of the first widely adopted open sparse MoE checkpoints to feel practical outside research demos.

Architecture-first release

The release is fundamentally about sparse routing, not about a new tokenizer, longer context, or multimodal packaging.

Open deployment angle

Mistral shipped it as an openly deployable sparse alternative for users who wanted higher resident capacity without dense-model compute scaling.

Where it is strong

Where it is strong

Capability per token

Active compute stays much lower than total resident capacity, which is the main reason people reach for Mixtral.

Instruction use

The instruct release is a strong open general assistant baseline when VRAM is available.

MoE experimentation

Useful for teams exploring sparse routing behavior without moving to frontier closed models.

Memory behavior

What dominates VRAM

Even though only a subset of experts is active per token, single-GPU VRAM still carries the resident experts in memory.

Sources

Where this page is grounded