Understanding Tokenization: From Text to Integers

Introduction

Language models are mathematical functions; they operate on numbers, not raw text. Tokenization is the crucial first step in converting human-readable text into a sequence of integers (tokens) that a model can process. These tokens are then mapped to embedding vectors.

1. Naive Approaches and Their Flaws

Word-Level Tokenization

The most intuitive approach: split text by spaces and punctuation.

Problems:
1. Vocabulary Explosion: A language like English has hundreds of thousands of words. The model’s vocabulary would be enormous, making the final embedding and output layers computationally massive.
2. Out-of-Vocabulary (OOV) Words: If the model encounters a word not seen during training (e.g., a new slang term, a typo, or a technical name), it has no token for it. It typically maps it to an <UNK> (unknown) token, losing all semantic meaning.
3. Poor Generalization: The model treats eat, eating, and eaten as three completely separate, unrelated tokens. It fails to capture the shared root eat, making it harder to learn morphological relationships.

Character-Level Tokenization

The opposite extreme: split text into individual characters.

Advantage: The vocabulary is tiny and fixed (e.g., ~256 for bytes). There are no OOV words.
Problems:
1. Loss of Semantic Meaning: A single character like t has very little semantic value on its own. The model must expend significant capacity just to learn the structure of common words from characters.
2. Long Sequences: A simple sentence becomes a very long sequence of tokens. This makes it computationally difficult for the model to learn long-range dependencies and can quickly exceed the context window.

2. The Solution: Subword Tokenization

Modern LLMs use a hybrid approach called subword tokenization, which combines the best of both worlds.

Core Idea: Common words are kept as single, unique tokens (e.g., the, is, and), while rare or complex words are broken down into smaller, meaningful sub-units (e.g., tokenization -> token + ization).
Benefits:
- It gracefully handles OOV words by breaking them into known subwords.
- It captures morphological relationships (e.g., eating -> eat + ing).
- It keeps the vocabulary size manageable while maintaining semantic richness.

Algorithm: Byte-Pair Encoding (BPE)

BPE is a data-driven algorithm used to create a subword vocabulary. It starts with a base vocabulary of individual characters (or bytes) and learns merge rules from a large text corpus.

Initialization: The initial vocabulary consists of all single bytes (0-255).
Iteration: a. Count the frequency of all adjacent token pairs in the training corpus. b. Find the most frequent pair (e.g., ('e', 'r')). c. Merge this pair into a new, single token ('er'). d. Add this new token to the vocabulary and replace all occurrences of the pair in the corpus with the new token.
Repeat: Continue this process for a predetermined number of merges, which defines the final vocabulary size.

The byte_pair_encoding_algo.py script in this repository provides a simple, hands-on example of this process.

BPE Training Process Visualization

Our implementation shows the BPE training process in real-time:

Pre-processing: The tokenizer processes ~1MB of Shakespeare text
Word Tokenization: Identifies 15,057 unique word tokens
Pair Counting: Analyzes frequency of adjacent character/token pairs
Merge Computation: Learns 12,455 merge rules to build the final vocabulary

This creates a vocabulary that efficiently represents Shakespeare’s linguistic patterns, learning common words like “the”, “and”, “to” as single tokens while breaking down rare words into meaningful subparts.

3. Modern Tokenizers in GPT Models

OpenAI’s models use a library called tiktoken for high-performance BPE tokenization.

cl100k_base: The tokenizer used for models like GPT-3.5 and GPT-4. It has a vocabulary size of ~100,000.
o200k_base: A newer tokenizer for more advanced models. It has a larger vocabulary of ~200,000 and includes special tokens designed for chat and instruction-following.

Special Tokens for Chat

Modern tokenizers reserve special tokens to encode conversational structure, which is critical for chat models.

Chat Formatting: Tokens like <|im_start|> and <|im_end|> are used to delineate turns in a conversation, allowing the model to understand the flow of a multi-speaker dialogue.
Role-Based Prompting: Tokens are used to specify roles (e.g., system, user, assistant), preserving the instructions and context for how the model should behave.

These special tokens are not part of the regular text but are essential metadata that the model uses to understand its task. As shown in tokenizer.py, you can enable them with allowed_special="all".

Comprehensive Tokenization Comparison

Let’s compare different tokenization approaches with practical examples:

Example 1: Simple Text

Input: "Hello world!"

Character-level: 12 tokens ['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!']
Word-level: 2 tokens ['Hello', 'world!']
GPT-4 BPE: 3 tokens ['Hello', ' world', '!']
Custom BPE: 4 tokens ['H', 'ello', ' world', '!']

Example 2: Complex Words

Input: "Supercalifragilisticexpialidocious"

Character-level: 34 tokens (one per character)
Word-level: 1 token (entire word, likely OOV)
GPT-4 BPE: 11 tokens ['Sup', 'erc', 'al', 'if', 'rag', 'il', 'istic', 'exp', 'ial', 'id', 'ocious']

This demonstrates BPE’s key advantage: it gracefully handles unknown words by breaking them into meaningful subparts.

Example 3: Modern Terms

Input: "COVID-19 vaccination appointments"

GPT-4 BPE: 5 tokens ['COVID', '-', '19', ' vaccination', ' appointments']
Custom Shakespeare BPE: 10 tokens (struggles with modern terms)

This shows how tokenizer training data affects performance on different domains.

BPE Benefits Analysis

Morphological Understanding

"running" → 1 token (common word, learned as whole)
"unhappiness" → 3 tokens ['un', 'h', 'appiness'] (breaks down prefix/suffix)
"tokenization" → 2 tokens ['token', 'ization'] (recognizes root + suffix)

Compression Efficiency

From our analysis of a technical paragraph:

Character-level: 344 tokens (1:1 ratio)
Word-level: 49 tokens (7x compression)
BPE: 67 tokens (5.1x compression vs characters, better vocabulary management)

Key Insights

Vocabulary Size: BPE requires much smaller vocabulary than word-level
OOV Handling: Never encounters truly unknown tokens
Compression: Balances sequence length with vocabulary size
Domain Adaptation: Learns patterns specific to training data

Practical Demonstration: Shakespeare Tokenizer in Action

Now that we understand the theory, let’s see tokenization in action using our custom BPE tokenizer trained on Shakespeare’s complete works:

Shakespeare-Optimized Tokenization

Our custom tokenizer shows interesting domain-specific behavior:

Example 1: Classic Shakespeare

Input: "To be or not to be, that is the question"
Tokens: ['To', 'Ġbe', 'Ġor', 'Ġnot', 'Ġto', 'Ġbe', ',', 'Ġthat', 'Ġis', 'Ġthe', 'Ġquestion']
Token Count: 11 tokens

Example 2: Shakespearean Language

Input: "Thou art more lovely and more temperate"
Tokens: ['Thou', 'Ġart', 'Ġmore', 'Ġlovely', 'Ġand', 'Ġmore', 'Ġtemperate']
Token Count: 7 tokens

Example 3: Famous Shakespeare Quote

Input: "All the world's a stage, and all the men and women merely players"
Tokens: ['All', 'Ġthe', 'Ġworld', "'s", 'Ġa', 'Ġstage', ',', 'Ġand', 'Ġall', 'Ġthe', 'Ġmen', 'Ġand', 'Ġwomen', 'Ġmerely', 'Ġplayers']
Token Count: 15 tokens

Implementation Code

Our tokenizer training implementation demonstrates the BPE process:

def train_tokenizer(data_path: str, tokenizer_path: str):
    # Initialize BPE tokenizer
    tokenizer = Tokenizer(BPE())
    tokenizer.pre_tokenizer = ByteLevelPreTokenizer(add_prefix_space=False)
    tokenizer.post_processor = ByteLevelPostProcessor(trim_offsets=False)
    tokenizer.decoder = decoder.ByteLevel()
    
    # Train with special tokens and minimum frequency
    trainer = BpeTrainer(special_tokens=["<|endoftext|>"], min_frequency=2)
    tokenizer.train([data_path], trainer)
    return tokenizer

Analysis of Results

Domain Adaptation Success:

Archaic Words: “Thou”, “art” are learned as single tokens (common in Shakespeare)
Complex Words: “temperate” stays whole, showing efficient learning from corpus
Modern Efficiency: Our tokenizer has ~12,521 tokens, optimized for Early Modern English

Technical Insights:

Space Encoding: The ‘Ġ’ prefix indicates spaces (ByteLevel encoding standard)
Punctuation Handling: Contractions like “’s” and punctuation are separate tokens
Compression: Achieves good balance between sequence length and vocabulary size

This demonstrates how BPE tokenizers adapt to their training domain, making them highly effective for specific text types while maintaining the flexibility to handle any input text through subword decomposition.

Introduction#

1. Naive Approaches and Their Flaws#

Word-Level Tokenization#

Character-Level Tokenization#

2. The Solution: Subword Tokenization#

Algorithm: Byte-Pair Encoding (BPE)#

BPE Training Process Visualization#

3. Modern Tokenizers in GPT Models#

Special Tokens for Chat#

Comprehensive Tokenization Comparison#

Example 1: Simple Text#

Example 2: Complex Words#

Example 3: Modern Terms#

BPE Benefits Analysis#

Morphological Understanding#

Compression Efficiency#

Key Insights#

Practical Demonstration: Shakespeare Tokenizer in Action#

Shakespeare-Optimized Tokenization#

Implementation Code#

Analysis of Results#