Neural Machine Translation of Rare Words with Subword Units

Paper: Neural Machine Translation of Rare Words with Subword Units
Authors: Rico Sennrich, Barry Haddow, and Alexandra Birch
arXiv: 1508.07909
Code: rsennrich/subword-nmt

One-sentence summary

The paper adapts byte pair encoding (BPE) to learn a fixed vocabulary of variable-length subword units, enabling neural machine translation systems to translate and generate rare or previously unseen words without an external dictionary.

Problem

Early neural machine translation systems used fixed word vocabularies, commonly containing 30,000–50,000 entries. Every word outside that vocabulary was represented by an unknown token. Translation, however, is an open-vocabulary task:

names may need copying or transliteration;
cognates and loanwords exhibit regular character transformations;
compounds and inflected words can be translated compositionally;
morphologically productive languages continuously produce new word forms.

Existing systems addressed this mismatch with large vocabularies and dictionary-based fallback. These methods cannot generate genuinely unseen words and assume a source word maps cleanly to one target word. That assumption fails for compounding, morphology, and transliteration.

The paper proposes handling open vocabulary inside the neural model by representing words as sequences of reusable subword units.

Main contributions

It demonstrates that an attention-based neural translation model can directly translate rare and unseen words when its input and output consist of subword sequences.
It adapts byte pair encoding, originally a compression algorithm, into a data-driven word-segmentation method.
It compares BPE with character n-grams, compound splitting, Morfessor, hyphenation, large word vocabularies, and dictionary fallback.
It shows that subword models learn productive compounding and transliteration rather than merely replacing unknown tokens.

BPE for word segmentation

The algorithm begins with a vocabulary of individual characters. Each word is represented as characters plus an end-of-word marker. It then repeatedly:

counts adjacent symbol pairs across the training vocabulary, weighted by word frequency;
finds the most frequent pair;
merges that pair into a new symbol;
repeats for a chosen number of merge operations.

For example, frequent merges may transform:

l o w </w>

into:

low </w>

Frequent words eventually become single symbols. Less frequent words remain decomposed into reusable character sequences. The vocabulary size is approximately:

[ \text{initial character vocabulary} + \text{number of merges}. ]

The number of merge operations is therefore the principal granularity control:

fewer merges produce smaller vocabularies and longer sequences;
more merges produce larger vocabularies and shorter, more word-like sequences.

At inference time, a new word is split into characters and the learned merges are applied in order. Any word composed of known characters can therefore be represented without a word-level unknown token.

Why BPE is useful

BPE balances two competing costs:

Word models: short sequences but very large, sparse vocabularies with unknown words.
Character models: open vocabulary but substantially longer sequences and harder alignment.

BPE produces variable-length units: common patterns remain compact while rare words are decomposed. The units need not correspond to linguistically correct morphemes. The model can still learn from imperfect or over-segmented representations because useful character patterns recur across words.

In the German training corpus:

Segmentation	Tokens	Types	Development unknowns
Words	100M	1.75M	1,079
Characters	550M	3K	0
Character bigrams	306M	20K	34
Morfessor	109M	544K	237
BPE	112M	63K	0
Joint BPE	111M	82K	32

BPE remains close to word-level sequence length while reducing the symbol vocabulary by more than an order of magnitude.

Independent versus joint BPE

The paper evaluates two variants:

Independent BPE: Learn separate merge rules for source and target languages.
Joint BPE: Learn one merge vocabulary from the union of source and target text.

Joint BPE encourages related source and target words to be segmented consistently, making copying, cognate translation, and transliteration easier to learn.

For English–Russian, the authors transliterate the Russian vocabulary into Latin characters before learning joint merges, then map the merge rules back to Cyrillic. This aligns corresponding character sequences despite the different scripts.

Joint BPE generally performs best on rare and unseen words. The qualitative analysis shows that independent segmentation can create mismatched boundaries that induce character insertion or deletion errors during transliteration.

Experimental setup

The experiments use WMT 2015:

English→German: 4.2M sentence pairs, approximately 100M tokens.
English→Russian: 2.6M sentence pairs, approximately 50M tokens.
Development: newstest2013.
Evaluation: newstest2014 and newstest2015.

The model is an attention-based recurrent encoder-decoder with gated recurrent units:

hidden size: 1,000;
embedding size: 620;
beam size: 12;
Adadelta optimization;
eight-model ensembles for the main comparisons.

Systems include:

WUnk: word model that emits unknown tokens;
WDict: word model with dictionary fallback;
C2-50k: common words plus character bigrams;
BPE-60k: independently learned BPE;
BPE-J90k: joint BPE.

Evaluation uses BLEU, chrF3, and unigram F1 measured over all, rare, and training-set-out-of-vocabulary words.

Results

English→German

System	BLEU, ensemble	chrF3, ensemble	Rare-word F1	OOV F1
Word + dictionary	24.2	52.4	36.8	36.8
Character bigrams	25.3	53.5	40.5	30.9
Independent BPE	24.5	53.9	40.9	29.3
Joint BPE	24.7	54.1	41.8	33.6

Dictionary fallback remains strong for German OOVs because many are names that can be copied unchanged. Subword systems nevertheless perform better on rare words and overall translation metrics.

English→Russian

System	BLEU, ensemble	chrF3, ensemble	Rare-word F1	OOV F1
Word + dictionary	22.8	51.0	26.5	6.6
Character bigrams	24.1	51.6	27.8	17.4
Independent BPE	23.6	52.7	29.7	15.6
Joint BPE	24.1	53.0	29.7	18.3

Subword models provide a larger OOV improvement when scripts differ because copying a Latin-script source name is not a valid Russian translation. Joint BPE improves both segmentation consistency and transliteration.

Relative to the dictionary baseline, subword ensembles improve by:

up to 1.1 BLEU for English→German;
up to 1.3 BLEU for English→Russian;
5.0 rare-word F1 points for English→German;
3.2 rare-word F1 points for English→Russian.

Key empirical observations

Rare in-vocabulary words are also a problem

The failure mode is not limited to unseen words. Word representations become poorly estimated as frequency decreases. A model retaining 500,000 whole words performs progressively worse on low-frequency entries, then improves when it switches to subword representation for words outside its shortlist.

This supports a broader principle: retaining a rare item as an atomic token can be worse than decomposing it into frequent reusable units.

Linguistically imperfect segmentation can still work

BPE does not reliably recover morpheme boundaries. The paper shows correct compound translations even when words are over-segmented or split at linguistically implausible positions. Statistical reuse and alignment consistency matter more than morphological purity for this model.

Joint segmentation is a form of alignment prior

Joint BPE does more than compress text. It biases related source and target strings toward compatible decompositions. This makes cross-language mappings easier to learn, particularly for names and cognates.

Better rare-word handling only modestly changes BLEU

Rare and OOV words account for approximately 9–11% of the test data. They often carry important semantic content, but sentence-level aggregate metrics dilute their contribution. The paper therefore reports targeted rare-word and OOV metrics rather than relying only on BLEU.

Strengths

Simple intervention: BPE requires preprocessing changes rather than a specialized translation architecture or external fallback model.
Open-vocabulary generation: The model can construct words absent from training instead of only copying or looking them up.
Strong systems tradeoff: It achieves near-word-level sequence lengths with a much smaller vocabulary.
Targeted evaluation: Rare and OOV word accuracy directly tests the claimed benefit.
Qualitative mechanism evidence: Examples demonstrate learned compounding and transliteration.
Broad applicability: The method is architecture-independent and became foundational for later neural NLP tokenizers.

Limitations

Limited language coverage: The experiments cover only translation from English into German and Russian.
Old model architecture: Results use recurrent encoder-decoder systems; absolute performance does not directly transfer to modern Transformers.
No controlled merge-count search: Vocabulary sizes are selected largely for comparison with prior work rather than optimized systematically.
Training variability: Development results vary by as much as one BLEU point, and the authors select the best model among eight runs.
Weak causal isolation: Systems differ in segmentation and effective vocabulary size, making the exact source of each gain difficult to separate.
Character coverage is not universal: The original method can still encounter unknown characters or source-only joint-BPE symbols.
Frequency is not linguistic structure: BPE can split inside morphemes, merge across useful boundaries, and encode corpus-specific artifacts.
Sequence fragmentation: Rare words can still become long symbol sequences, increasing computation and making long-range modeling harder.
Static vocabulary: The learned merge list does not adapt after training and may poorly serve new domains, scripts, or terminology.
Aggregate translation gains are modest: The strongest evidence is on rare-word behavior rather than large overall BLEU improvements.

Historical significance

The paper transformed BPE from a compression technique into a practical neural text-tokenization method. Its central abstraction—a fixed inventory of variable-length units learned from symbol frequency—became standard in neural machine translation and later language models.

Modern tokenizers often differ in important details:

byte-level BPE starts from bytes to guarantee arbitrary input coverage;
WordPiece selects merges using a likelihood-related objective;
Unigram language-model tokenization begins with a large inventory and prunes it;
SentencePiece learns directly from raw text and treats whitespace explicitly.

The enduring contribution is therefore broader than the exact implementation: vocabulary should be learned at a level between words and characters to balance coverage, statistical efficiency, and sequence length.

Bottom line

The paper’s main result is that open vocabulary does not require an unlimited word dictionary. A model can represent text with a fixed vocabulary while retaining the ability to produce unseen words by composing frequent subword units.

The lasting lessons are:

atomic word vocabularies waste capacity on sparse rare forms;
pure character models solve coverage at substantial sequence-length cost;
learned variable-length units provide a practical middle ground;
consistent segmentation can encode useful cross-language alignment;
targeted metrics are needed when an intervention affects rare but semantically important cases; and
“token count” is meaningful only relative to a specific tokenizer.

One-sentence summary#

Problem#

Main contributions#

BPE for word segmentation#

Why BPE is useful#

Independent versus joint BPE#

Experimental setup#

Results#

English→German#

English→Russian#

Key empirical observations#

Rare in-vocabulary words are also a problem#

Linguistically imperfect segmentation can still work#

Joint segmentation is a form of alignment prior#

Better rare-word handling only modestly changes BLEU#

Strengths#

Limitations#

Historical significance#

Bottom line#