Pretraining

DataComp-LM: In Search of the Next Generation of Training Sets for Language Models

DataComp-LM establishes a controlled benchmark for dataset research and finds that aggressive model-based quality filtering is more effective than conventional source mixing.

Training Compute-Optimal Large Language Models

The Chinchilla paper shows that model parameters and training tokens should scale in approximately equal proportions, enabling smaller, better-trained models.