Skip to content
Learn Agentic AI12 min read0 views

Understanding Tokenization: How LLMs Read and Process Text

Learn how LLMs break text into tokens using BPE, WordPiece, and SentencePiece algorithms, and how tokenization impacts cost, performance, and application design.

Why Tokenization Matters

LLMs do not read text the way humans do. They cannot process raw characters or even whole words directly. Instead, every piece of text is first broken into tokens — subword units that the model uses as its vocabulary. Tokenization is the first step in every LLM interaction, and it affects everything from cost (you pay per token) to capability (context windows are measured in tokens) to behavior (how the model "sees" your text).

Understanding tokenization is not optional knowledge for anyone building with LLMs. It is foundational.

What Is a Token?

A token is a chunk of text that the model treats as a single unit. Tokens can be whole words, parts of words, single characters, or even punctuation. The exact split depends on the tokenizer.

import tiktoken

# Load the tokenizer used by GPT-4o
enc = tiktoken.encoding_for_model("gpt-4o")

# Tokenize a simple sentence
text = "Hello, world!"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")

# Decode each token to see what it represents
for token_id in tokens:
    print(f"  {token_id} -> '{enc.decode([token_id])}'")

# Output:
# Text: Hello, world!
# Token IDs: [9906, 11, 1917, 0]
# Token count: 4
#   9906 -> 'Hello'
#   11 -> ','
#   1917 -> ' world'
#   0 -> '!'

Notice that "Hello" is one token, the comma is one token, " world" (with the leading space) is one token, and "!" is one token. Common words tend to be single tokens, while rare words get split into subwords.

Byte Pair Encoding (BPE): The Dominant Algorithm

Most modern LLMs use Byte Pair Encoding (BPE) or a variant of it. BPE builds a vocabulary by starting with individual bytes and iteratively merging the most frequent pairs:

def simple_bpe_training(text, num_merges=10):
    """Simplified BPE training to illustrate the algorithm."""
    # Start with individual characters
    tokens = list(text)
    vocab = set(tokens)

    for i in range(num_merges):
        # Count all adjacent pairs
        pairs = {}
        for j in range(len(tokens) - 1):
            pair = (tokens[j], tokens[j + 1])
            pairs[pair] = pairs.get(pair, 0) + 1

        if not pairs:
            break

        # Find the most frequent pair
        best_pair = max(pairs, key=pairs.get)
        merged = best_pair[0] + best_pair[1]
        vocab.add(merged)

        # Merge all occurrences of this pair
        new_tokens = []
        j = 0
        while j < len(tokens):
            if j < len(tokens) - 1 and (tokens[j], tokens[j + 1]) == best_pair:
                new_tokens.append(merged)
                j += 2
            else:
                new_tokens.append(tokens[j])
                j += 1
        tokens = new_tokens

        print(f"Merge {i+1}: '{best_pair[0]}' + '{best_pair[1]}' -> '{merged}' (count: {pairs[best_pair]})")

    return tokens, vocab

text = "the cat sat on the mat the cat ate the rat"
final_tokens, vocab = simple_bpe_training(text, num_merges=5)
print(f"Final tokens: {final_tokens}")
print(f"Vocabulary size: {len(vocab)}")

The key property of BPE is that common words become single tokens while rare words are broken into smaller pieces. This means the model never encounters an out-of-vocabulary word — it can always fall back to character-level or byte-level representations.

WordPiece and SentencePiece

Other tokenization algorithms serve different models:

WordPiece (used by BERT) is similar to BPE but selects merges based on likelihood rather than frequency. It uses the "##" prefix to indicate subword continuation:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

# WordPiece example (conceptual)
# "unbelievable" might be tokenized as:
# ["un", "##believ", "##able"]

# The ## prefix means "this token continues the previous word"

SentencePiece (used by Llama, Mistral, T5) treats the input as a raw byte stream without pre-tokenization. It handles any language and does not require whitespace-separated words:

# SentencePiece works directly on raw text
# No need for language-specific pre-processing
# Useful for multilingual models

# Example with the sentencepiece library:
import sentencepiece as spm

# Load a pre-trained SentencePiece model
sp = spm.SentencePieceProcessor()
sp.load("path/to/model.model")

text = "This is a test sentence."
tokens = sp.encode(text, out_type=str)
print(tokens)  # ['This', ' is', ' a', ' test', ' sent', 'ence', '.']

How Tokenization Affects Your Applications

Tokenization has direct, practical consequences for everything you build with LLMs.

Cost is measured in tokens, not words. A rough rule of thumb for English text is that 1 token is approximately 4 characters or 0.75 words. But this varies dramatically:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

samples = {
    "English prose": "The quick brown fox jumps over the lazy dog.",
    "Python code": "def fibonacci(n):\n    return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
    "JSON data": '{"name": "Alice", "age": 30, "city": "New York"}',
    "URLs": "https://api.example.com/v2/users?page=1&limit=50",
    "Non-English": "Kuantumu konpyuutingu wa mirai no gijutsu desu.",
}

for label, text in samples.items():
    token_count = len(enc.encode(text))
    ratio = len(text) / token_count
    print(f"{label:20s}: {token_count:3d} tokens, {ratio:.1f} chars/token")

Code and structured data tend to use more tokens per character than natural English. Non-Latin scripts often use significantly more tokens because they are less represented in the training data.

Context windows are token budgets. When a model has a 128K token context window, that includes both your input and the model's output. A system prompt, conversation history, and retrieved documents all compete for the same token budget:

def estimate_context_usage(system_prompt, history, retrieved_docs, model="gpt-4o"):
    """Estimate how much of the context window you are using."""
    enc = tiktoken.encoding_for_model(model)

    system_tokens = len(enc.encode(system_prompt))
    history_tokens = sum(len(enc.encode(msg["content"])) for msg in history)
    docs_tokens = sum(len(enc.encode(doc)) for doc in retrieved_docs)

    total = system_tokens + history_tokens + docs_tokens
    max_context = 128_000  # GPT-4o context window

    print(f"System prompt: {system_tokens:,} tokens")
    print(f"History:       {history_tokens:,} tokens")
    print(f"Documents:     {docs_tokens:,} tokens")
    print(f"Total input:   {total:,} / {max_context:,} tokens ({total/max_context*100:.1f}%)")
    print(f"Remaining for output: {max_context - total:,} tokens")

    return total

Tokenization Edge Cases and Pitfalls

Tokenization can produce surprising behavior. Here are common pitfalls:

enc = tiktoken.encoding_for_model("gpt-4o")

# Trailing spaces change tokenization
print(len(enc.encode("Hello")))      # 1 token
print(len(enc.encode("Hello ")))     # 2 tokens (space is separate)

# Numbers tokenize inconsistently
print(len(enc.encode("100")))        # 1 token
print(len(enc.encode("1000")))       # 1 token
print(len(enc.encode("123456789")))  # may be multiple tokens

# Repeated characters are expensive
print(len(enc.encode("aaa")))        # fewer tokens
print(len(enc.encode("aaaaaaaaaa"))) # more tokens than you might expect

These edge cases matter when you are counting tokens for cost estimation, context window management, or when debugging why a model's response seems to cut off unexpectedly.

FAQ

How do I count tokens before making an API call?

Use the tiktoken library from OpenAI. Call tiktoken.encoding_for_model("gpt-4o") to get the correct tokenizer for your model, then use enc.encode(text) to get the token list. The length of that list is your token count. For non-OpenAI models, use their respective tokenizer libraries or the transformers library from Hugging Face.

Why does the same text use different token counts across different models?

Each model family trains its own tokenizer with a different vocabulary. GPT-4o, Claude, Llama, and Gemini all have different tokenizers. A word that is a single token in one model might be two tokens in another. Always use the correct tokenizer for the specific model you are calling.

Does tokenization affect model accuracy?

Yes. If a word is split into subword tokens, the model must compose meaning from pieces, which can reduce accuracy on tasks involving rare words or specialized terminology. This is one reason why models perform better on common English than on technical jargon or low-resource languages — common text maps to fewer, more meaningful tokens.


#Tokenization #BPE #LLM #Tiktoken #NLP #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.