Back to calculator

Model notes

Llama 3.1 8B

Compact dense Llama model with grouped-query attention and a 128K context window.

8B dense • 131,072 context • 8 KV heads

Architecture

Model spec

Architecture

Dense decoder-only transformer

Total params

8B

Active params

Dense model

Layers

32

Hidden size

4,096

Attention heads

32

KV heads

8

KV-bearing layers

32

Context length

131,072

Modality

Text

License

Llama 3.1 Community License

Why it matters

Why memory behaves this way

Research highlight

Grouped-query attention keeps KV state lighter than full multi-head attention while retaining a long native context window.

Memory note

Dense weights dominate the footprint; grouped KV heads help prevent cache growth from exploding at long context.

Checkpoints

Official profiles

Official BF16 checkpoint

BF16 checkpoint

Current

Meta's official Llama 3.1 8B Instruct release is a BF16 checkpoint with grouped-query attention.

vLLMTransformers
Open checkpoint

Sources

Reference links