DataComp-LM: In Search of the Next Generation of Training Sets for Language Models

Paper: DataComp-LM: In Search of the Next Generation of Training Sets for Language Models
Authors: Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, et al.
arXiv: 2406.11794
Project: datacomp.ai/dclm

One-sentence summary

DataComp-LM establishes a controlled benchmark for language-model dataset research and shows, through 416 baseline experiments, that aggressive model-based filtering of Common Crawl is substantially more important than conventional source mixing for general language-model performance.

Problem and contribution

Dataset curation results are difficult to compare because published models vary simultaneously in training data, architecture, optimization, tokenizer, compute, and evaluation. DataComp-LM (DCLM) addresses this confounding by fixing the model-training and evaluation recipes while allowing the dataset to vary.

The paper contributes:

DCLM-Pool, a 240-trillion-token corpus extracted from Common Crawl data from 2013–2022.
A controlled benchmark with filtering and mixing tracks across five compute scales, from 412M to 6.9B parameters.
A 53-task evaluation suite, including a 22-task low-variance Core aggregate and MMLU.
416 controlled experiments covering extraction, heuristic filtering, deduplication, learned quality filtering, source mixing, and decontamination.
DCLM-Baseline, a 3.8T-token filtered web corpus.
Models trained on the released data, including a 7B model trained on 2.6T tokens that reaches 63.7% MMLU 5-shot accuracy.

The benchmark’s central methodological idea is straightforward: hold the training system constant, vary only the data intervention, and judge the intervention through downstream model performance.

Benchmark design

Scales

Scale	Parameters	Training tokens	Approx. H100 hours	Candidate pool
400M-1x	412M	8.2B	26	469B tokens
1B-1x	1.4B	28.8B	240	1.64T
3B-1x	2.8B	55.9B	740	3.18T
7B-1x	6.9B	138B	3,700	7.85T
7B-2x	6.9B	276B	7,300	15.7T

The authors report strong correlations between dataset rankings at smaller and larger scales: Pearson correlations with 7B results are 0.838 for 412M, 0.956 for 1.4B, and 0.982 for 2.8B. This supports using smaller proxy models to iterate on data interventions.

Tracks

Filtering track: Select training data from a fixed subset of DCLM-Pool without adding external text.
Mixing track: Combine DCLM-Pool documents with freely available external sources.

Training and evaluation code cannot be changed. This isolates dataset quality more effectively than comparisons among independently trained public models.

DCLM-Baseline pipeline

The final pipeline is:

Extract text directly from Common Crawl HTML with resiliparse.
Apply RefinedWeb-style heuristic filters.
Deduplicate using a modified Bloom-filter method at document and paragraph resolution.
Score documents with a fastText binary classifier.
Retain the top 10% of documents by classifier score.

The fastText classifier is trained on approximately 400,000 examples:

Positive class: 100,000 OpenHermes 2.5 examples plus 100,000 curated ELI5 question-answer examples.
Negative class: 200,000 documents sampled from an earlier RefinedWeb reproduction.
Features: unigrams and bigrams.

This pipeline yields the 3.8T-token DCLM-Baseline corpus.

Main empirical findings

1. Model-based quality filtering is the dominant intervention

At the 1B scale, the RefinedWeb reproduction achieves 27.5 Core and 14.6 Extended. Alternative filters produce:

Filter	Core	Extended
RefinedWeb reproduction	27.5	14.6
PageRank, top 20%	26.1	12.9
Semantic deduplication	27.1	13.8
BGE-feature classifier	27.2	14.0
AskLLM	28.6	14.3
Perplexity	29.0	15.0
Top-k average logits	29.2	14.7
fastText, OpenHermes + ELI5	30.2	15.4

A cheap bigram classifier outperforms much more computationally expensive LLM-based scoring. The composition of the classifier’s positive examples matters considerably: at the 7B scale, OpenHermes + ELI5 produces Core 41.0 and MMLU 29.2, versus Core 35.7 and MMLU 27.0 when Wikipedia is the positive class.

The result should not be read as “fastText is intrinsically the best quality model.” The classifier operationalizes a particular target distribution. Its success indicates that defining the positive reference distribution well can matter more than classifier sophistication.

2. Stricter filtering improves quality

Keeping the top 10% of documents performs better than retaining the top 15% or 20% on Core and MMLU. This is evidence for a quality-over-quantity regime when a sufficiently large raw pool is available.

3. Better HTML extraction matters

At the 1B scale:

Extraction	Core	Extended
`resiliparse`	24.1	13.4
`trafilatura`	24.5	12.5
Common Crawl WET	20.7	12.2

Direct extraction from HTML beats Common Crawl’s WET text by at least 2.5 Core points. resiliparse is selected because it is approximately eight times faster than trafilatura while producing similar model quality.

4. Deduplication helps, but its exact implementation is secondary

At the 1B scale, the modified Bloom-filter deduplicator and an exact + MinHash + suffix-array pipeline both improve Core by 2.1 points over no deduplication. At the 7B-2x scale, the selected Bloom-filter setup and MinHash + suffix array are nearly identical:

Bloom filter: MMLU 44.3, Core 45.3.
MinHash + suffix array: MMLU 44.4, Core 45.5.

The Bloom-filter implementation is favored for scalability. An important caveat is that the two methods define duplication differently. A global MinHash pass would still remove 85% of DCLM-Baseline documents, yet the dataset performs strongly. The paper therefore questions the assumption that all detectable near-duplication is harmful.

Deduplication hyperparameters can change the data distribution in task-specific ways. Using five-token minimum n-grams preserves Core performance but sharply damages MMLU, apparently because short structured spans such as list items and multiple-choice content are removed.

5. Mixing conventional “high-quality” sources can hurt

Adding the RedPajama non-web mixture improves weaker Common Crawl datasets but degrades DCLM-Baseline:

Base web dataset	Core change after mixing	Extended change
C4	+2.2	+0.8
RedPajama Common Crawl	+1.7	+1.4
RefinedWeb	+1.4	+0.2
DCLM-Baseline	−1.2	−1.0

Individually adding filtered Wikipedia, books, arXiv, or GitHub also fails to improve the DCLM-only model at the tested scale. Source labels such as “Wikipedia” or “books” are therefore not reliable proxies for marginal training value once the web corpus has already been filtered aggressively.

6. Human judgments do not predict training utility well

Sixteen AI graduate students and professors labeled 499 documents, with three labels per document and 71% average agreement. AskLLM best matches these labels at about 82% ROC-AUC, but produces substantially worse downstream models than several fastText filters that achieve only about 73% ROC-AUC.

The evidence is limited by the small, specialized annotator pool and sample size. Within that experiment, however, “looks useful to a human reviewer” is not a dependable surrogate for “improves a pretrained model.”

7. Dataset rankings are robust to some training changes

DCLM-Baseline remains ahead of RedPajama and C4 across tested learning-rate and weight-decay settings. Dataset improvements also correlate across the baseline Transformer, a Gemma-like architecture, and a Mamba-like architecture. These tests reduce—but do not eliminate—the concern that results are artifacts of one training recipe.

8. Decontamination does not explain the reported MMLU gains

Removing pages that match MMLU or HellaSwag questions and answer options does not reduce performance:

MMLU: 51.8 before removal, 52.7 after.
HellaSwag: 77.9 before removal, 78.4 after.

The authors acknowledge that contamination is difficult to define and that their rules trade precision against recall.

Scaled model result

For the largest model, the authors augment DCLM-Baseline with StarCoder and ProofPile2 for code and mathematics. The model is trained for 2T tokens, followed by two cooldown runs on a distribution containing 70% more-strictly filtered DCLM data and 30% math data. The cooldown checkpoints are weight-averaged, then adapted from a 2,048-token to an 8,192-token context window.

The resulting 7B model reports:

Core: 57.1
MMLU 5-shot: 63.7
Extended: 45.4

This exceeds the paper’s listed open-data 7B baselines and approaches several closed-data 7–8B models. The strongest model result is not attributable to DCLM-Baseline alone: it also includes code/math data, altered mixture weights, cooldown training, model souping, and long-context continual training. The controlled benchmark experiments provide cleaner evidence for the dataset claims than this final system comparison.

Strengths

Controlled comparisons: The paper directly addresses a major weakness in data-curation research by fixing training and evaluation.
Scale: The experimental program spans 416 runs and approximately 1.2 million estimated H100 hours.
Open artifacts: Pool, curated data, models, processing code, recipes, and experiment records are released.
Useful negative results: PageRank, semantic deduplication, expensive LLM scoring, conventional source mixing, and human agreement all underperform plausible expectations.
Proxy-scale validation: The cross-scale ranking evidence makes the benchmark practical for groups unable to run repeated 7B experiments.
Systems detail: Extraction throughput, deduplication scalability, sharding, token yield, and memory tradeoffs receive serious treatment.

Limitations and concerns

Benchmark-directed optimization: The learned filter is selected through downstream performance on a fixed evaluation suite. Repeated experimentation risks overfitting the dataset pipeline to that suite, even without literal test contamination.
English and general-knowledge focus: The evaluation design rewards English language understanding more than multilingual, code, math, safety, or domain-specific capabilities.
One-factor-at-a-time ablations: Most interventions are studied individually; interactions among extraction, heuristics, filtering, deduplication, mixing, and training length remain incompletely characterized.
Limited replication: The authors report insufficient run-to-run variance analysis. Some small differences may be training or sampling noise.
Filter semantics are unclear: OpenHermes + ELI5 works, but the paper does not fully identify which textual properties the classifier selects. It may favor instruction-like organization, explanatory style, topic composition, or benchmark-adjacent formatting.
Aggressive selection changes coverage: Keeping only 10% of filtered web documents may suppress rare languages, minority viewpoints, niche domains, or unconventional writing styles. The paper’s bias and toxicity analysis is narrow and does not establish broad representational quality.
Data governance remains unresolved: DCLM-Pool may contain personal, copyrighted, or sensitive text. Common Crawl redaction support and robots.txt compliance do not settle consent, licensing, or downstream model-training rights.
Final-model comparison is less controlled: The hero run combines several data and training interventions, so it should not be treated as a clean estimate of DCLM-Baseline’s isolated effect.
High barrier to full reproduction: The benchmark permits low-cost entry, but reproducing the paper’s full search required frontier-scale academic/industrial compute.

Bottom line

The paper’s most important result is not merely that DCLM-Baseline performs well. It is that dataset research becomes substantially more credible when curation is evaluated as a controlled systems variable. Empirically, the work shows that:

careful extraction beats convenient pre-extracted text;
deduplication helps, but scalable approximations can be sufficient;
a well-specified cheap classifier can beat expensive LLM judges;
severe filtering can outperform retaining more data;
adding prestigious sources can hurt after strong web filtering; and
human notions of document quality are weak proxies for training utility.

For this repository, the transferable contribution is the experimental discipline: isolate the infrastructure policy, evaluate it through downstream behavior, validate low-cost proxies against higher-scale settings, and keep end-to-end showcase results distinct from causal evidence.

One-sentence summary#

Problem and contribution#

Benchmark design#

Scales#

Tracks#

DCLM-Baseline pipeline#

Main empirical findings#

1. Model-based quality filtering is the dominant intervention#

2. Stricter filtering improves quality#

3. Better HTML extraction matters#

4. Deduplication helps, but its exact implementation is secondary#

5. Mixing conventional “high-quality” sources can hurt#

6. Human judgments do not predict training utility well#

7. Dataset rankings are robust to some training changes#

8. Decontamination does not explain the reported MMLU gains#

Scaled model result#

Strengths#

Limitations and concerns#

Bottom line#