Artificial Intelligence and Machine Learning have taken the world by storm, revolutionizing the way we interact with technology. Within this field, large language models stand out, and among the most important components shaping their capabilities is Byte Pair Encoding (BPE). The roots of Byte Pair Encoding trace back to an article published in Volume 12, Issue 2, of the February 1994 edition of the C/C++ Users Journal. Authored by Philip Gage, this article, titled A New Algorithm for Data Compression, introduced the concept. This algorithm, originally conceived three decades ago, has found new relevance as a means to enhance Neural Machine Translation (NMT), as showcased in the paper titled Neural Machine Translation of Rare Words with Subword Units.
Machine translation is an important part of the human experience. While ~70% of all humans can speak one of the top 10 most spoken languages in the world, there are thousands of languages spoken on this earth. Breaking down the language barrier is critical to developing a deeper understanding of our fellow humans, or at least that is what Ludwig Lazar Zamenhof, the Ashkenazi Jewish Ophthalmologist, believed when he created Esperanto in 1887.
The internal idea of Esperanto is: the foundation of a neutral language will help break down barriers between peoples and help people get used to the idea that each one of them should see their neighbors only as a human being and a brother."
- L.L.Zamenhof, 1912
Esperanto may not have achieved the status of the envisioned "Lingvo Internacia" that Zamenhof had dreamed of, but machine translation has made significant strides in enhancing communication and allowing us to share our ideas globally. The Bilingual Evaluation Understudy score (BLEU) is how we empirically gauge a machine translation against a human reference translation. By modifying the Byte Pair Encoding algorithm, the authors of the NMT paper were able to create an algorithm that increased the effectiveness of machine translation from English → German and English → Russian.
To understand how it was modified, it’s important to look at the algorithm’s original implementation. The original C code for BPE can be found here, but the gist is to replace frequently occurring pairs of bytes in the document, with a single byte that doesn't occur in the original data. This compresses and reduces the overall size of the data. The general steps of this method are as follows…
Count Pairs: Start by counting the frequency of all byte pairs in the input data.
Create Dictionary: Create a dictionary to record the most frequent pairs and assign them a new unique byte.
Replace Most Frequent Pair: Find the most frequently occurring pairs of bytes, and replace all occurrences of this pair with the new unique byte that you assigned to it in the dictionary.
Repeat: Repeat steps 1 to 3 until no more replacements can be done or until you have reached a specified number of iterations.
Decompression: For decompression, you would use the dictionary to replace the unique bytes with their original byte pairs.
In the NMT paper, the steps are much the same, but instead of working with bytes you work with individual characters. This is what that looks like…
Initial Vocabulary: Initially, the vocabulary is built with all individual characters present in the training corpus.
Pair Frequencies: Like in the traditional BPE, the frequencies of all pairs are calculated, but this time instead of at the byte level, it’s at the character level.
Merge Operations: Based on the calculated frequencies, a number of merge operations are performed. A merge operation is where the most frequently occurring character pair is merged to form a new symbol (or subword unit). For example ‘l’ and ‘o’ might be merged into the pair ‘lo’.
Iterative Merging: This process of merging is carried out iteratively, where in each iteration the most frequent pair of symbols (which initially are characters, but later could be subwords formed from previous merges) is merged to form a new symbol. This continues until a predetermined number of merge operations that you specify has been completed, thus constructing a subword vocabulary that contains both individual characters and merged symbols.
Subword Segmentation: Once the vocabulary is built, words in the training data are segmented into subword units present in the vocabulary. This segmentation helps in reducing the number of out-of-vocabulary words, as rare words can be represented as sequences of subword units.
The minimal Python implementation of this ends up being surprisingly small. Here is the Python code from figure 1 of the paper.
import re, collections
def get_stats(vocab):
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols)-1):
pairs[symbols[i], symbols[i+1]] += freq
return pairs
def merge vocab(pair, v_in):
v_out = {}
bigram = re.escape(' '.join(pair))
p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
for word in v_in:
w_out = p.sub(''.join(pair), word)
v_out[w_out] = v_in[word]
return v_out
vocab = ['l o w </w>' : 5, 'l o w e r </w': 2,
'n e w e s t </w>':6, 'w i d e s t </w>':3]
num_merges = 10
for i in range(num_merges):
pairs = get_stats(vocab)
best = max(pairs, key=pairs.get)
vocab = merge_vocab(best, vocab)
print(best)
Probably the easiest way to visualize what this looks like in practice is by using OpenAI’s token visualizer. Each colored section of the text is a token or subword. In the example below text is highlighted from an excerpt of The Strange Case of Dr. Jekyll and Mr. Hyde.
So what is the advantage of all of this? Normally Neural Machine Translation models are limited to 30,000 - 50,000 words. This proves to be an issue for translation as it’s an “open-vocabulary” problem. This means that when dealing with language, there is a nigh unlimited set of novel words, sentence construction, and semantic meaning to wade through. This is especially difficult when translating agglutinative languages like Esperanto, which can create new words by prefixing existing ones. For instance, in Esperanto, adding the prefix "mal-" negates the meaning of the subsequent word. "Hela" translates to "light" in Esperanto, but "malhela" conveys "dark." This flexibility allows for creative word combinations like kontraŭmalsanterapio (“therapy against sickness”), which you will probably never find in any Esperanto dictionary. This adds lots of complexity to translation.
When a model encounters an out of vocabulary word, the typical approach is to use a back off dictionary that contains translations of rare words. But this assumes the dictionary will have the rare word in the first place. Subwords created through the modified BPE algorithm improves the model’s capability to handle these situations. For example this allows models to handle different forms of a word (e.g., "run," "ran," "running") by translating them into their corresponding subword units and then generating the target translation based on these subunits. This proves helpful for languages with rich inflectional morphology, where words change form based on tense, case, gender, etc. Instead of storing all possible inflected forms in the vocabulary, subword units allow the model to generate word forms based on a combination of subunits, making the model more efficient and capable of handling a wide range of words.
But, while Byte Pair Encoding has proved to be a very effective way to get NMTs and LLMs to understand human language, people are continuing to test other ways text can be encoded. I think it’s fascinating how an algorithm conceived three decades ago could breathe new life into such cutting edge technology. It makes me wonder what other great breakthroughs we discovered many years ago will see new life in future applications.
Call To Action 📣
If you made it this far thanks for reading! If you are new welcome! I like to talk about technology, niche programming languages, AI, and low-level coding. I’ve recently started a Twitter and would love for you to check it out. I also have a Mastodon if that is more your jam. If you liked the article, consider liking and subscribing. And if you haven’t why not check out another article of mine! Thank you for your valuable time.