Byte Pair Encoding (BPE): Unlocking the Power of Subwords in Modern Language Models

8 min readFeb 6, 2025

Have you ever wondered how modern language models like GPT or BERT can understand such a broad range of words, phrases, and even internet slang? One of the secrets behind their success is Byte Pair Encoding, commonly referred to as BPE. Although the name might sound a bit technical, BPE is actually a simple yet powerful technique that helps models handle both common and rare words by breaking them down into smaller chunks called subwords. In this article, we will explore what BPE is, why it plays such a significant role in natural language processing, and how you can implement it in Python with a hands-on example.

Why Do We Need Byte Pair Encoding?

The Vocabulary Explosion Problem

In the early days of language modeling, one common approach was to treat every word in a text as a distinct token. This seems logical at first glance, but it causes two big headaches:

Enormous Vocabulary Size
Languages are brimming with unique words, and many domains (like biomedical literature or legal documents) are filled with specialized terms. Including every possible word in a vocabulary would make it extremely large, driving up memory usage and slowing down both training and inference.
Handling Rare and Unknown Words
With purely word-level tokenization, any word that does not appear in the training set (or appears very infrequently) ends up as an unknown token (often represented as <unk>). This means the model cannot learn anything about that word, even if it appears in the test set.

How Subword Tokenization Solves These Issues

Subword tokenization is a middle ground between character-level and word-level tokenization. Instead of creating tokens for every single word, we split words into smaller pieces known as subwords. For frequently occurring words or phrases, these subwords might be almost the entire word. However, for rare or unfamiliar words, subwords end up being shorter fragments that still capture meaningful linguistic patterns.

By doing this, we get:

No More Unknown Words: Any word, even if it appears for the first time, can be broken into subwords that the model already knows.
Compact Yet Flexible Vocabulary: We can set a manageable vocabulary size, striking a balance between storing every word and just working with individual characters.

From `<unk>` to Nirvana: The 4 Stages of Handling Rare Words

What Exactly is Byte Pair Encoding?

Byte Pair Encoding, or BPE, is not a new technique. It was originally used in data compression by Philip Gage in the February 1994. In NLP, it became famous thanks to a 2016 paper titled “Neural Machine Translation of Rare Words with Subword Units” by Sennrich, Haddow, and Birch. The core idea is straightforward:

Start with Individual Characters
Imagine you have the word “coffee.” You start by splitting it into its individual characters: c o f f e e.
Count the Frequency of Adjacent Pairs
You look through your entire text (or corpus) to find which pairs of symbols (initially individual characters) appear most often. Let us say f f is extremely common in your dataset.
Merge the Most Frequent Pair
If f f is the most frequent pair, you merge it into a single token ff. Now “coffee” transforms from c o f f e e to c o ff e e.
Iterate the Process
You repeat the frequency counting step and merge the next most common pair of symbols. Eventually, subwords might become longer. If ee is frequently seen, it might merge into ee, making “coffee” look like c o ff ee.

Through multiple rounds of merging, the tokenizer gradually develops a vocabulary of subwords. Common words or sequences remain relatively intact as single tokens, while rare or unfamiliar sequences are broken down further into smaller pieces.

Step-by-Step Example: From Raw Text to BPE Vocabulary

Let us walk through a simple example with a tiny corpus of text.

Sample Text

I love coffee
I love tea

Split Words into Characters:

“I” → I
“love” → l o v e
“coffee” → c o f f e e
“tea” → t e a

Count Pair Frequencies: Look at every adjacent pair of symbols (characters at first). For instance:

l o (in “love”)
o v (in “love”)
v e (in “love”)
c o (in “coffee”)
o f (in “coffee”)
f f (in “coffee”)
f e (in “coffee”)
e e (in “coffee”)
t e (in “tea”)
e a (in “tea”)

Suppose f f appears the most often (especially if you have more text featuring words like “coffee,” “toffee,” etc.).

Merge the Most Frequent Pair: If f f is the most common pair, we replace f f with ff. The word “coffee” transforms from c o f f e e to c o ff e e.

Repeat the Process: We keep repeating this procedure until we have made a certain number of merges or we decide no more merges are beneficial. Over multiple iterations, we might end up with subwords like co, ff, and ee, or even entire words like “coffee” if it appears extremely frequently.

This process results in a subword vocabulary that is more flexible than word-level tokenization and more compact than character-level tokenization. No matter which new words appear in your text, the tokenizer can handle them by combining known subwords in different ways.

Why Do Modern Models Rely on BPE?

Efficient Vocabulary Size
Instead of dealing with millions of unique words, we deal with a fixed set of subwords. This keeps memory usage and computational demands more manageable
Handling Rare and New Words
You never have an unknown token issue with subwords. Even brand-new words that did not appear in the training set can be segmented into recognized subwords, allowing the model to still make educated guesses.
Consistent and Reversible Tokenization
BPE is deterministic. Once you train the subword vocabulary, the same text will always produce the same sequence of tokens. This makes training and inference more stable.
Language Agnostic
BPE does not rely on the linguistic properties of any single language. It merges frequent character sequences regardless of whether they come from English, Spanish, or other languages. This flexibility is especially helpful for multilingual models.

Hands-On Tutorial: Implementing BPE in Python

There are multiple libraries for training and using BPE tokenizers. One popular choice is the Hugging Face tokenizers library.

Installation

pip install tokenizers

2. Preparing a Corpus

Let us say we create a file named sample_corpus.txt with some text in it:

I love coffee.
Coffee is the best way to start my day.
I also love tea, but coffee is my favorite.

This text is short, but it should be enough to demonstrate the process.

3. Training a BPE Tokenizer

from tokenizers import ByteLevelBPETokenizer

# Initialize the tokenizer
tokenizer = ByteLevelBPETokenizer()

# Train on our corpus
# We set vocab_size to a small number for demo purposes
# min_frequency ensures tokens must appear at least that many times
# special_tokens are extra tokens we might need in models like GPT or BERT
tokenizer.train(files=["sample_corpus.txt"], vocab_size=1000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>"
])

#  Save the tokenizer files (vocab.json and merges.txt)
tokenizer.save_model("bpe_tokenizer")

What is happening here?

We create an instance of ByteLevelBPETokenizer.
We train it on our text file with a maximum of 1000 tokens (vocab_size=1000).
We only keep tokens that appear at least twice (min_frequency=2).
We add a few common special tokens used in transformer models.

4. Using the Trained Tokenizer

Once training is finished, the tokenizer produces two main files: vocab.json and merges.txt. Here is how to load the tokenizer and use it:

from tokenizers import ByteLevelBPETokenizer

# Initialize the tokenizer with the files we just saved
tokenizer = ByteLevelBPETokenizer(
    "bpe_tokenizer/vocab.json",
    "bpe_tokenizer/merges.txt"
)

# Encode a piece of text
sample_text = "I love coffee"
encoded = tokenizer.encode(sample_text)

print("Tokens:", encoded.tokens)
print("Token IDs:", encoded.ids)

# Decode the tokens back to text
decoded_text = tokenizer.decode(encoded.ids)
print("Decoded text:", decoded_text)

You might see output like:

Tokens: ['I', 'Ġlove', 'Ġcoffee']
Token IDs: [45, 279, 275]
Decoded text: I love coffee

The Ġ symbol is just how the byte-level tokenizer represents whitespace.

When exploring Byte Pair Encoding, you might wonder how it compares to other subword methods, especially WordPiece. Although both techniques break words into smaller units to manage large vocabularies and unknown tokens, they use different underlying algorithms. WordPiece, commonly seen in models like BERT, follows a unique strategy for determining which tokens to merge. While the ultimate goal of both methods is to handle text in a more flexible way, they each take distinct paths to get there.

Another topic that often comes up is selecting the best vocabulary size. This choice depends heavily on the language you are working with and how large or specialized your dataset is. Many practitioners choose a vocabulary size between 8,000 and 50,000 tokens. A bigger vocabulary can capture more subtle language details, but it also requires more memory and computational effort. In most cases, the only way to find the best fit is to experiment with different sizes and see what produces the best outcomes for your specific application.

You might also wonder if you will need to retrain your tokenizer frequently. That depends on how stable your domain is. If you shift from everyday social media text to specialized fields like medical literature, you may need to retrain the tokenizer so it can learn new terminology. However, if your domain remains relatively consistent, one well-trained tokenizer can serve you for quite some time.

Lastly, it is natural to ask whether BPE can handle multiple languages at once. The simplicity of BPE is one of its strongest qualities because it relies on frequency counts of character sequences rather than language rules. This makes it possible for many multilingual models to use a single shared subword vocabulary across different languages. If you are dealing with multiple writing systems, you might need to increase the size of your vocabulary so the tokenizer can accurately capture the unique features of each language.

Conclusion

Byte Pair Encoding might sound intimidating at first, but it is actually a straightforward and powerful method that helps modern language models handle words of all shapes and sizes. By breaking down words into subwords, BPE solves the challenge of managing enormous vocabularies and unknown terms. This method strikes a practical balance, ensuring that language models remain both efficient and highly adaptable.

Whether you are working on text classification, machine translation, or any other NLP task, having a solid understanding of BPE will serve you well. It is one of the foundational techniques used in almost every state-of-the-art model you encounter today. If you are new to subword tokenization, try training your own tokenizer on a custom dataset using the Python example above, and see firsthand how your model’s performance and flexibility improve.

Experiment with BPE, tweak vocabulary sizes, compare its performance against other tokenizers like WordPiece, and watch how these seemingly small details can have a big impact on your NLP project.

Check out this awesome video on Tokenization by Andrej Karpathy!

Have fun exploring the world of subwords and happy tokenizing!