Training Compute-Optimal Large Language Models

Paper: Training Compute-Optimal Large Language Models
Authors: Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al.
Common name: The Chinchilla scaling paper
arXiv: 2203.15556

One-sentence summary

For a fixed language-model pretraining compute budget, the paper finds that model parameters and training tokens should increase in approximately equal proportions; its 70B-parameter Chinchilla model, trained on 1.4T tokens, consequently outperforms the 280B-parameter Gopher model trained with the same total compute.

Research question

The paper asks:

Given a fixed number of training FLOPs, how should compute be allocated between model size and the number of training tokens?

Earlier scaling work, particularly Kaplan et al. (2020), placed most additional compute into parameters. Under that prescription, a tenfold compute increase would increase model size by approximately 5.5 times but training data by only 1.8 times. This encouraged a generation of models with rapidly increasing parameter counts but similar training lengths of roughly 300B tokens.

Hoffmann et al. argue that these models were undertrained. Their alternative result is approximately symmetric:

double the model size;
double the number of training tokens;
total training compute therefore increases by approximately four times.

This result is often expressed as the “Chinchilla rule” that a compute-optimal dense Transformer should train on approximately 20 tokens per parameter. That ratio is a convenient approximation derived from the paper’s fitted frontier, not a universal constant.

Experimental basis

The authors train more than 400 autoregressive Transformer models:

Parameter counts range from roughly 44M to 16B.
Training lengths range from 5B to more than 400B tokens.
Models are trained on the MassiveText distribution, with additional scaling checks on C4 and GitHub code.
The analysis uses training loss as the optimization target.
Compute is measured with a detailed FLOP calculation closely approximated by:

[ C \approx 6ND, ]

where (C) is training compute, (N) is the parameter count, and (D) is the number of training tokens.

The study estimates the compute-optimal frontier using three approaches.

Three scaling analyses

Approach 1: Envelope over training curves

For each model size, the authors train models with four different cosine learning-rate cycle lengths. They interpolate the loss curves and, for each FLOP budget, identify the model and training horizon producing the lowest loss.

They then fit:

[ N_{\mathrm{opt}} \propto C^a,\qquad D_{\mathrm{opt}} \propto C^b. ]

The fitted exponents are:

(a = 0.50)
(b = 0.50)

Approach 2: IsoFLOP profiles

For nine fixed compute budgets, the authors train models of different sizes. The number of tokens changes inversely with model size so that all models on one profile consume the same compute. A parabola is fitted to loss versus parameter count to locate the minimum for each budget.

The fitted exponents are:

(a = 0.49)
(b = 0.51)

Approach 3: Parametric loss model

The authors fit:

[ L(N,D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}. ]

The fitted values are:

[ L(N,D) = 1.69 + \frac{406.4}{N^{0.34}} + \frac{410.7}{D^{0.28}}. ]

Minimizing this loss subject to (C \approx 6ND) produces:

(a = 0.46)
(b = 0.54)

All three methods therefore place roughly equal scaling weight on parameters and data. Kaplan et al. estimated (a=0.73) and (b=0.27).

Why the result differs from Kaplan et al.

The paper identifies two main methodological differences.

First, Kaplan et al. used a common learning-rate schedule across training horizons and analyzed intermediate points along that schedule. Hoffmann et al. show that a cosine schedule should be matched to the intended training length. If the schedule is much longer than the actual run, the learning rate has not decayed sufficiently at the measurement point, causing short runs to appear worse than they would under a properly calibrated schedule. Overshooting the target training length by more than approximately 25% noticeably degrades final loss.

Second, this study includes many larger proxy models, up to 16B parameters. The authors observe curvature in the compute-loss frontier, so estimates based primarily on very small models may extrapolate incorrectly.

At (10^{21}) FLOPs, they directly compare:

a 2.80B model predicted by their method;
a 4.74B model predicted by the Kaplan scaling rule.

The smaller, longer-trained model achieves lower final loss.

Compute-optimal projections

Approach 1 produces the following approximate frontier:

Parameters	Training tokens	Training FLOPs
400M	8.0B	(1.92\times10^{19})
1B	20.2B	(1.21\times10^{20})
10B	205B	(1.23\times10^{22})
67B	1.5T	(5.76\times10^{23})
175B	3.7T	(3.85\times10^{24})
280B	5.9T	(9.90\times10^{24})
1T	21.2T	(1.27\times10^{26})

These estimates imply approximately 20–22 training tokens per parameter. The other two methods predict similar or larger token requirements, especially at high compute.

The authors emphasize that long-range extrapolations are uncertain. Their observed frontier curvature may imply that even smaller, more heavily trained models become optimal at very large budgets.

Chinchilla experiment

The decisive large-scale test compares:

Model	Parameters	Tokens	Relative size	Training compute
Gopher	280B	300B	4×	Approximately equal
Chinchilla	70B	1.4T	1×	Approximately equal

Chinchilla allocates the same compute toward a model four times smaller trained on nearly five times as many tokens.

Main results

Chinchilla outperforms Gopher on nearly every reported benchmark:

MMLU 5-shot: 67.6% versus 60.0%.
BIG-bench average: 65.1% versus 54.4%.
WikiText-103 perplexity: 7.16 versus 7.75.
RACE-middle: 86.8% versus 75.1%.
RACE-high: 82.3% versus 71.6%.
Natural Questions, 5-shot: 31.5% versus 24.5%.
TriviaQA unfiltered, 5-shot: 73.2% versus 63.6%.
TruthfulQA, 0-shot: 43.6% versus 29.5%.

On MMLU, Chinchilla improves on 51 of 57 tasks, ties on two, and loses on four. It also outperforms Gopher on all evaluated Pile subsets.

Because the model is four times smaller, Chinchilla also requires less inference memory and compute. This is a critical systems consequence: optimizing only training performance understates the benefit of allocating more training compute to data and less to parameters.

Dataset and training details

Chinchilla is trained on MassiveText:

Source	Sampling share	Approximate epochs in 1.4T tokens
MassiveWeb	45%	1.24
Books	30%	0.75
C4	10%	0.77
News	10%	0.21
GitHub	4%	0.13
Wikipedia	1%	3.40

The scaling experiments generally operate below one data epoch, but the final Chinchilla run repeats MassiveWeb and Wikipedia data. The paper does not fully investigate the multiple-epoch regime.

The scaling result is also reproduced on C4 and GitHub code:

C4: parameter exponent 0.50, token exponent 0.50.
GitHub: parameter exponent 0.53, token exponent 0.47.

This suggests that near-equal parameter/data scaling is not unique to MassiveText, within the tested settings.

Important confounds in the Chinchilla–Gopher comparison

Chinchilla and Gopher do not differ only in parameter count and token count:

Chinchilla uses AdamW; Gopher uses Adam.
Chinchilla keeps a higher-precision optimizer copy of model weights.
Chinchilla uses a modified tokenizer without NFKC normalization.
The MassiveText source mixture differs slightly.
Architecture widths, head counts, batch sizes, and peak learning rates differ.

The appendix shows at smaller scale that AdamW and the higher-precision optimizer state improve training relative to the Gopher setup. These changes therefore explain some unknown portion of Chinchilla’s advantage.

This does not invalidate the scaling-law evidence, because the three frontier analyses are based on hundreds of controlled smaller runs. It does mean that the 70B-versus-280B comparison is not a clean single-variable ablation of the scaling prescription.

Strengths

A consequential resource-allocation question: The paper changes how fixed training budgets should be spent.
Three convergent estimators: Independent methods produce similar parameter/data exponents.
Direct comparison with the earlier scaling rule: The paper identifies a concrete learning-rate-schedule issue and runs a controlled head-to-head test.
Large experimental sweep: More than 400 models support the fitted relationships.
Large-scale validation: Chinchilla demonstrates that the proposed allocation works at a practically significant scale.
Cross-dataset check: Similar scaling appears on natural language and source code.
Deployment relevance: The result reduces inference cost, not only pretraining loss.

Limitations

Only one large-scale pair: The core large-scale validation consists of Chinchilla and Gopher, without replicas or intermediate-scale confirmation.
Large-run confounds: Optimizer, numerical precision, tokenizer, source mixture, and some hyperparameters change alongside the compute allocation.
Training loss is the fitted objective: Downstream capability is evaluated for Chinchilla, but the frontier itself is selected using next-token loss. Some capabilities may scale differently.
Fixed architecture family: The conclusions apply to the tested dense autoregressive Transformers. Different architectures, retrieval, mixture-of-experts systems, context lengths, or tokenizers can change the frontier.
Single-pass assumption: Most scaling runs use less than one epoch. Modern training frequently repeats high-quality datasets, and repeated-token behavior is not modeled.
Data quality is implicit: The law treats tokens as a quantity even though token utility varies substantially. The authors explicitly expect the frontier to depend on access to sufficiently high-quality data.
Power-law extrapolation: The fitted law is projected far beyond observed compute despite measured curvature in the frontier.
No full cost model: Training FLOPs are optimized, but data acquisition, filtering, optimizer state, communication, wall-clock efficiency, and inference demand are not integrated into one objective.
Limited safety analysis: Better language modeling does not remove bias or toxicity. Chinchilla’s unconditional toxicity is roughly comparable to Gopher’s.
Closed artifact: Chinchilla was not publicly released, limiting independent replication.

Bottom line

The paper’s durable contribution is not that 70B parameters or 1.4T tokens are inherently optimal. It establishes that parameter count alone is a poor proxy for how effectively training compute has been used.

Its practical conclusions are:

many early large language models were too large and trained for too few tokens;
parameters and tokens should scale at approximately equal rates under the studied regime;
learning-rate schedules must match the intended training horizon;
training a smaller model longer can improve both quality and inference economics;
the useful token budget grows rapidly, making dataset quality and governance central constraints; and
scaling laws are empirical planning tools, not timeless constants.

One-sentence summary#

Research question#

Experimental basis#

Three scaling analyses#

Approach 1: Envelope over training curves#

Approach 2: IsoFLOP profiles#

Approach 3: Parametric loss model#

Why the result differs from Kaplan et al.#

Compute-optimal projections#

Chinchilla experiment#

Main results#

Dataset and training details#

Important confounds in the Chinchilla–Gopher comparison#

Strengths#

Limitations#

Bottom line#