Artificial Intelligence and Machine Learning have taken the world by storm, revolutionizing the way we interact with technology. Within this field, large language models stand out, and among the most important components shaping their capabilities is Byte Pair Encoding (BPE). The roots of Byte Pair Encoding trace back to an article published in Volume 12, Issue 2, of the February 1994 edition of the C/C++ Users Journal. Authored by Philip Gage, this article, titled A New Algorithm for Data Compression, introduced the concept. This algorithm, originally conceived three decades ago, has found new relevance as a means to enhance Neural Machine Translation (NMT), as showcased in the paper titled Neural Machine Translation of Rare Words with Subword Units.
Machine translation is an important part of the human experience. While ~70% of all humans can speak one of the top 10 most spoken languages in the world, there are thousands of languages spoken on this earth. Breaking down the language barrier is critical to developing a deeper understanding of our fellow humans, or at least that is what Ludwig Lazar Zamenhof, the Ashkenazi Jewish Ophthalmologist, believed when he created Esperanto in 1887.
The internal idea of Esperanto is: the foundation of a neutral language will help break down barriers between peoples and help people get used to the idea that each one of them should see their neighbors only as a human being and a brother.”
– L.L.Zamenhof, 1912
Esperanto may not have achieved the status of the envisioned “Lingvo Internacia” that Zamenhof had dreamed of, but machine translation has made significant strides in enhancing communication and allowing us to share our ideas globally. The Bilingual Evaluation Understudy score (BLEU) is how we empirically gauge a machine translation against a human reference translation. By modifying the Byte Pair Encoding algorithm, the authors of the NMT paper were able to create an algorithm that increased the effectiveness of machine translation from English → German and English → Russian.
To understand how it was modified, it’s important to look at the algorithm’s original implementation. The original C code for BPE can be found here, but the gist is to replace frequently occurring pairs of bytes in the document, with a single byte that doesn’t occur in the original data. This compresses and reduces the overall size of the data. The general steps of this method are as follows…
-
Count Pairs: Start by counting the frequency of all byte pairs in the input data.
-
Create Dictionary: Create a dictionary to record the most frequent pairs and assign them a new unique byte.
-
Replace Most Frequent Pair: Find the most frequently occurring pairs of bytes, and replace all occurrences of this pair with the new unique byte that you assigned to it in the dictionary.
-
Repeat: Repeat steps 1 to 3 until no more replacements can be done or until you have reached a specified number of iterations.
-
Decompre