LLM Tokenization Deep Dive: BPE, SentencePiece, Tiktoken (2026)

Your model never sees the string “tokenization”. It sees the integer 83002 (in o200k_base), or 83118 (in cl100k_base), or three integers if you’re unlucky with a SentencePiece vocab trained on the wrong corpus. LLM tokenization explained properly is the single highest-leverage subject most application engineers skip — it controls your inference bill, your context-window math, your multilingual fairness, and a surprising fraction of your prompt-engineering surprises (yes, the trailing space matters).

Architecture at a glance

LLM Tokenization Deep Dive: BPE, SentencePiece, Tiktoken (2026) — architecture diagram — Architecture diagram — LLM Tokenization Deep Dive: BPE, SentencePiece, Tiktoken (2026)

This is a working-engineer deep dive into how the three dominant tokenizer families — Byte-Pair Encoding (BPE), SentencePiece, and OpenAI’s Tiktoken — actually work in 2026. We will train a BPE in 30 lines of Python, contrast it with SentencePiece’s Unigram LM, time Tiktoken’s Rust core against pure-Python alternatives, quantify the 6x cost penalty Tamil pays versus English, and look at the credible (but not yet dominant) tokenizer-free future being staked out by Meta’s Byte Latent Transformer. By the end you will know exactly which knob to turn when the bill spikes.

Why tokenization exists at all

Tokenization exists because transformers consume integer IDs into an embedding matrix, not Unicode code points or raw bytes. A tokenizer is the deterministic, lossless function from a string to a list of IDs (and back). The vocabulary size sets the width of the embedding/softmax matrices, so it directly trades model parameter count, sequence length, and out-of-vocabulary (OOV) behaviour.

There are three escape valves on the design space, and every tokenizer picks a point on them:

Granularity — characters, bytes, subwords, or whole words. Whole-word vocabularies blow up past 1M entries and still have OOV. Character/byte tokenizers have tiny vocabularies but multiply sequence length by 4–10x, blowing up attention’s quadratic cost.
Determinism vs. probabilistic — BPE merges are greedy and deterministic given a merge table. Unigram LM tokenizers (SentencePiece’s default) can sample alternative segmentations during training, which is a regularizer.
Pre-tokenization — does the algorithm assume the input has already been split on whitespace/punctuation (BPE, WordPiece), or does it operate on the raw stream including spaces (SentencePiece)?

The first widely-deployed subword scheme was WordPiece, introduced by Schuster and Nakajima for Google’s voice search in 2012 and later picked up by BERT. The breakthrough that made subword tokenization the default for LLMs was Sennrich, Haddow and Birch’s 2016 BPE paper, which adapted a 1994 data-compression algorithm to neural machine translation. Every frontier model in 2026 — GPT-4o, o3, Claude 4.6, Gemini 2.0, LLaMA 3.3, DeepSeek V3, Qwen 3 — sits somewhere on the BPE / SentencePiece family tree.

The reference tokenization pipeline

Every production tokenizer is the same five-stage pipeline: normalize Unicode, pre-tokenize into spans, segment each span with a subword model, map subwords to IDs, then look up the embedding row. The differences between BPE, SentencePiece, and Tiktoken collapse into “what happens in stages 2 and 3”. Stages 1, 4, and 5 are nearly identical across implementations.

Stage 1, normalization, is the source of more bugs than the algorithm itself. NFC vs NFKC composition decides whether "café" and "café" produce the same IDs. The Unicode Normalization Forms standard (UAX #15) defines four forms; LLaMA uses NFKC, GPT-4 family uses no explicit normalization (it relies on the byte-level fallback to make every input representable). Skip this and your retrieval pipeline will silently miss matches.

Stage 2, pre-tokenization, is where philosophies diverge. GPT-2/3/4 use a regex that approximates “split on whitespace and group contractions” — the famous pattern starting ?\\p{L}+|\\p{N}+|.... SentencePiece, by design, does not pre-tokenize. It treats spaces as ordinary characters (rendered as ▁, U+2581 lower-one-eighth-block) so that detokenization is lossless even for languages that don’t use spaces.

Stage 3, subword segmentation, is the algorithm proper. We will walk through BPE, Unigram, and Tiktoken next.

Stages 4–5 are mechanical: a hash map from subword string to integer ID, and an embedding lookup. The embedding matrix has shape V x d_model; for GPT-4o-class models that is roughly 200,000 x 12,288 = ~2.5B parameters spent purely on the input/output projections. That is why doubling the vocab is not free — even before you train, you pay for it on every forward pass.

BPE byte-pair encoding, step by step

Byte-Pair Encoding starts from a character (or byte) vocabulary and greedily merges the most frequent adjacent pair until the vocabulary reaches the target size. The output is two artifacts: a vocab.json mapping tokens to IDs, and a merges.txt listing the ordered merge operations. At inference time, you apply the merges in order to any new string and read off the IDs.

Here is a runnable, faithful implementation in 30 lines of Python (Python 3.11+). It produces the same merges as the original GPT-2 algorithm on small corpora:

from collections import Counter

def get_pair_counts(corpus):
    pairs = Counter()
    for word, freq in corpus.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i + 1])] += freq
    return pairs

def merge_pair(pair, corpus):
    a, b = pair
    bigram = a + " " + b
    replacement = a + b
    return {w.replace(bigram, replacement): f for w, f in corpus.items()}

def train_bpe(text, vocab_size):
    # word -> freq; each word represented as space-separated symbols + end marker
    words = Counter(text.split())
    corpus = {" ".join(list(w)) + " </w>": f for w, f in words.items()}
    vocab = set(ch for w in corpus for ch in w.split())
    merges = []
    while len(vocab) < vocab_size:
        pairs = get_pair_counts(corpus)
        if not pairs:
            break
        best = max(pairs, key=pairs.get)
        corpus = merge_pair(best, corpus)
        merges.append(best)
        vocab.add(best[0] + best[1])
    return merges, vocab

merges, vocab = train_bpe("low low low lowest newest widest" * 100, vocab_size=30)
print(merges[:5])
# [('l', 'o'), ('lo', 'w'), ('w', 'e'), ('we', 's'), ('wes', 't')]

That is the entire core idea. Production BPE in Hugging Face tokenizers is a Rust implementation of the same algorithm with three additions: a priority queue so merges are O(N log N) not O(N^2), parallelism across documents, and byte-level fallback.

Byte-level BPE, introduced by GPT-2, runs the same algorithm but on the 256 UTF-8 byte values instead of characters. This is the elegant trick that makes the tokenizer truly closed: any sequence of bytes can always be encoded, so there is no <UNK> token and no OOV. The cost is that non-Latin scripts get tokenized at roughly UTF-8 byte rates if they weren’t well-represented in training (we will quantify this in the multilingual section).

Vocab sizes in 2026 cluster in two bands. The 32k–50k band — LLaMA 2 (32k), GPT-2 (50,257), GPT-3.5 cl100k_base predecessor (50,257) — was optimized for English-dominant corpora. The 100k–200k band — cl100k_base (100,277), o200k_base (200,019) used in GPT-4o/o1/o3, LLaMA 3 (128k), Gemini’s internal vocab (~256k), Qwen 3 (151k) — reflects the shift to truly multilingual training data, where a bigger vocab buys back the per-token efficiency loss on non-English text.

SentencePiece — Unigram, BPE mode, and the underscore trick

SentencePiece is Google’s tokenizer library, presented in Kudo and Richardson’s 2018 paper. It is the tokenizer behind T5, mT5, ALBERT, XLNet, LLaMA 1/2/3, Mistral, Gemma 2/3, and most multilingual research models. Two things make it distinct from textbook BPE.

First, no pre-tokenization. The input is treated as a raw Unicode stream including spaces. Spaces are explicitly encoded as ▁ (U+2581) before training, so a sentence like "Hello world" becomes "▁Hello▁world". Detokenization is then trivially text.replace("▁", " ").lstrip() — lossless even for Chinese, Japanese, and Thai where whitespace is not a word boundary. This single design choice is why SentencePiece won the multilingual research community.

Second, Unigram LM tokenization is offered alongside BPE as a first-class mode. Introduced in Kudo 2018’s “Subword Regularization”, Unigram works in reverse to BPE: start with a huge seed vocabulary (say 1M subwords pulled from the corpus), assign each a probability, and iteratively prune the least-useful entries until you hit the target size. “Useful” is defined by the EM-estimated likelihood the entry contributes to the corpus.

The practical difference at inference time:

BPE picks one segmentation per word, deterministically, via the merge list.
Unigram can sample from the top-k segmentations weighted by probability. During training, this exposes the model to alternative segmentations of the same word, which acts as a regularizer (the paper reports +1 to +2 BLEU on low-resource MT).

A typical SentencePiece training command using the Google SentencePiece library:

spm_train \
  --input=corpus.txt \
  --model_prefix=my_tok \
  --vocab_size=32000 \
  --model_type=unigram \
  --character_coverage=0.9995 \
  --normalization_rule_name=nfkc \
  --byte_fallback=true \
  --split_digits=true \
  --user_defined_symbols=<|im_start|>,<|im_end|>,<|tool_call|>

Two flags deserve attention. --character_coverage=0.9995 means “keep enough characters to cover 99.95% of the corpus; anything rarer falls back to bytes” — the standard setting for multilingual training, dropped to 0.9999 for CJK-only. --byte_fallback=true is what makes the tokenizer closed-vocabulary like byte-level BPE: any character outside the kept set is decomposed into its UTF-8 bytes, which are guaranteed slots in the vocab.

Tiktoken — OpenAI’s BPE done in Rust

Tiktoken is OpenAI’s open-source BPE tokenizer, with the merge tables for the proprietary OpenAI models baked in. The repo is at github.com/openai/tiktoken. It is a vanilla byte-level BPE — algorithmically nothing new — but the implementation is a Rust core wrapped in Python, and the merge tables for cl100k_base and o200k_base are the only public source of truth for how GPT-4 and GPT-4o count tokens. If you need to predict your OpenAI bill or fit a prompt under a context-window cap, you use Tiktoken.

The encoder mapping in 2026:

Encoding	Vocab size	Models
`r50k_base` (gpt2)	50,257	GPT-2, GPT-3 davinci
`p50k_base`	50,281	code-davinci-002, text-davinci-003
`cl100k_base`	100,277	GPT-3.5-turbo, GPT-4, text-embedding-3-*
`o200k_base`	200,019	GPT-4o, GPT-4o-mini, o1, o3, o3-mini

A minimal usage example, straight from the OpenAI Cookbook token-counting recipe:

import tiktoken

enc = tiktoken.get_encoding("o200k_base")
# or: enc = tiktoken.encoding_for_model("gpt-4o")

tokens = enc.encode("The quick brown fox jumps over the lazy dog.")
print(len(tokens), tokens)
# 10 [976, 5701, 19705, 39935, 65499, 1072, 290, 30099, 6446, 13]

decoded = enc.decode(tokens)
print(decoded)
# 'The quick brown fox jumps over the lazy dog.'

# Reverse-lookup individual pieces for debugging
print([enc.decode_single_token_bytes(t) for t in tokens])
# [b'The', b' quick', b' brown', b' fox', b' jumps', b' over', b' the', b' lazy', b' dog', b'.']

Two implementation details matter when you scale this. First, Rust BPE with regex pre-tokenization clocks roughly 1–2 million tokens/sec/core on commodity hardware — about 4–6x faster than Hugging Face’s BPE on the same data, primarily because Tiktoken’s regex split is in fancy-regex and its merge step uses a custom hash table. Second, the regex pattern itself is the cl100k vs o200k difference as much as the vocab is — o200k_base uses a pattern that better preserves multi-character punctuation and numbers, giving roughly 10–15% fewer tokens on the same English text.

Tokenizer training — the knobs that matter

Training a tokenizer is a one-shot offline process, but the choices baked in are nearly impossible to change once the base model is pre-trained. Five decisions dominate.

Corpus sampling. The token frequencies in your tokenizer training corpus directly become merge priorities. If you sample 99% English Common Crawl and 1% Chinese, BPE will spend almost no merges on CJK characters and Chinese inference cost will be 2–3x English. The standard fix, used by mT5 and LLaMA 3, is temperature-based oversampling: p(lang) ∝ count(lang)^(1/T) with T around 0.3, which boosts low-resource languages without overwhelming high-resource ones.

Vocab size. The classic 2024 trade-off curves cross around 100k for English-dominant models and 200k+ for multilingual. Below 32k, you pay in sequence length. Above 256k, the input/output embedding matrices start to dominate parameter count for sub-7B models. For a 70B model the cost is negligible; for a 1B edge model it is the difference between fitting in 1GB and 1.4GB.

Number handling. --split_digits=true (SentencePiece) or its BPE equivalent forces every digit into its own token. This is what made LLaMA 2 arithmetic measurably better than LLaMA 1: without digit splitting, the tokenizer learns “1024” as one token, “1025” as another, and the model never sees the place-value structure.

Whitespace policy. Leading-space tokens (▁the vs the, or the vs the) are not interchangeable. This is the root cause of the famous “trailing space breaks completion” bug: if you end your prompt with "Q: What is the capital of France? A: " (trailing space), the model picks from Paris-style tokens that lack a leading space; if you end with "A:" (no trailing space), it picks from ▁Paris-style tokens. The probabilities differ. Test both during eval.

Multilingual gotchas — the Tamil tax

A single tokenizer choice can introduce a 6x cost difference across languages for users running the same product. This is not a corner case — it is the median experience for non-Latin-script users of GPT-class APIs. The phenomenon was quantified by Petrov et al. 2023, “Language Model Tokenizers Introduce Unfairness Between Languages”, and the gap has narrowed but not closed in o200k_base.

A reproducible measurement using Tiktoken o200k_base on the same sentence:

Language	String	Tokens	Ratio vs English
English	“The quick brown fox jumps over the lazy dog”	9	1.0x
Spanish	“El zorro marrón rápido salta sobre el perro perezoso”	11	1.2x
German	“Der schnelle braune Fuchs springt über den faulen Hund”	13	1.4x
Chinese (Simplified)	敏捷的棕色狐狸跳过懒狗	18	2.0x
Hindi (Devanagari)	“तेज़ भूरी लोमड़ी आलसी कुत्ते के ऊपर कूदती है”	41	4.6x
Tamil	“விரைவான பழுப்பு நரி சோம்பேறி நாயின் மீது குதிக்கிறது”	54	6.0x

The mechanism is straightforward: BPE merges only happen for byte sequences that appeared frequently in training. If Tamil was 0.01% of training data, the only merges learned for Tamil bytes are very common bigrams, and most characters fall back to per-byte UTF-8 encoding (3 bytes per character for the Tamil Unicode block U+0B80–U+0BFF). At the API rate card, a Tamil customer pays 6x for the same conversation as an English one, and they hit context-window limits 6x sooner.

The 2026 mitigations are partial. o200k_base cut the Hindi penalty from ~8x (cl100k_base) to ~4.6x by including more Devanagari in the merge training. Gemma 3’s tokenizer reportedly uses 256k vocab specifically to keep Indic and African languages under 3x. The only true fix is either training your own SentencePiece on a balanced corpus, or moving to a byte-latent architecture (next section).

Practical impacts — your bill, your context, your KV cache

Token count is the single line item on your inference bill. It also dictates two other things engineers under-account for: context-window headroom and KV-cache memory. Cutting average tokens per request by 15% is a 15% bill cut, a 15% latency cut (because attention is O(N^2) in input length and KV-cache I/O is O(N) per output token), and 15% more headroom before you hit the model’s hard limit.

The math, made concrete for a chatbot serving 1M requests/day with average 1500 input tokens and 500 output tokens, on GPT-4o pricing ($2.50/M input, $10.00/M output):

daily_input_cost  = 1_000_000 * 1500 * $2.50 / 1_000_000 = $3,750
daily_output_cost = 1_000_000 *  500 * $10.00 / 1_000_000 = $5,000
daily_total       = $8,750  ->  monthly $262,500

Move to o200k_base from cl100k_base for the same prompts (roughly 12% fewer tokens on mixed English text) and the bill drops to ~$231,000/month — $31k/month saved by changing one constructor argument. Move the system prompt and chat history into prompt caching on Claude 4.6 (cached tokens at $0.30/M, a 10x discount) and you save another tier. Tokenization is not a back-office concern.

Context windows scale the same way. Claude 4.6 Sonnet’s 200k context is 200k Anthropic tokens, not 200k OpenAI tokens, and Anthropic’s tokenizer is closer to GPT-3.5 in efficiency than to o200k_base. Gemini 2.0 Pro’s 2M context is huge in tokens but a 2M-token Gemini conversation in Tamil is closer to 330k tokens of useful information. Always quote context budgets in characters or bytes when comparing across providers.

The KV cache effect is the third axis. For a 70B model with d_model=8192 and n_layers=80, the KV-cache footprint is roughly 2 * n_layers * d_model * 2 bytes (fp16) = 2.6 MB per token. A 100k-token context allocates 260 GB of cache across the GPU fleet; cutting that to 85k tokens (15% tokenizer win) frees 39 GB — which you can spend on batch size. For the gritty details on how production inference servers manage that cache, see our KV cache optimization guide for LLM inference.

Vocab size also affects throughput differently across architectures. The mixture-of-experts architecture deep dive covers the case where the output softmax over a 200k vocab becomes a measurable fraction of the per-token compute on MoE models, because the expert routing reduces FFN cost but does nothing for the LM head. And if you’re running benchmarks across vLLM, SGLang, and TensorRT-LLM, you’ll want our H100 inference framework benchmark — tokenizer overhead shows up there as a 2–5% throughput delta between Rust-backed and Python-backed encoders.

Trade-offs and failure modes

No tokenizer is uniformly correct. Choose for your workload and know the failure modes.

BPE failure modes. Greedy merges produce sub-optimal segmentations for OOD text — a domain-specific word like "glomerulonephritis" may shatter into 8 tokens even when more efficient segmentations exist in the vocab, because the greedy merge order doesn’t see them. BPE also has the “glitch token” problem: tokens like SolidGoldMagikarp exist in cl100k_base because they appeared frequently in training corpus URLs, but the model never saw them in real context, so prompting them triggers undefined behaviour.

SentencePiece / Unigram failure modes. Unigram is computationally heavier to train (EM iterations), and the probabilistic segmentation makes some downstream tasks (e.g., exact-string retrieval, span extraction) harder. The ▁ underscore convention breaks naive string-matching code: searching for "hello" in the token stream misses ▁hello. Always decode before matching.

Tiktoken failure modes. It is locked to OpenAI’s merge tables. You cannot train a new vocabulary with Tiktoken; for that you need tokenizers (HF) or sentencepiece (Google). Tiktoken also has no streaming decode API that handles partial UTF-8 cleanly — you have to buffer until you get a complete code point, which is a footgun in chat UIs.

Byte-level BPE failure modes. Cost penalty on non-Latin scripts (the Tamil tax above). Also: byte sequences that don’t form valid UTF-8 can be generated by the model and crash naive decoders. Use errors="replace" or errors="surrogateescape" in decode.

Character-level failure modes. Sequence length explodes. A 2000-character document is a 2000-token sequence, and attention is O(N^2). Not viable for general-purpose LLMs at current architectures.

Byte-latent / tokenizer-free failure modes. Meta’s Byte Latent Transformer (BLT) paper shows that dynamic byte-patching can match BPE on perplexity at equivalent compute, but the routing network adds engineering complexity and the inference stack is research-grade. BLT is the credible future, not the production present, in 2026.

Practical recommendations

A short checklist for the next time you touch tokenization in production.

Always compute token counts with the exact tokenizer the model will use. len(text.split()) is not a token count. tiktoken.encoding_for_model("gpt-4o") is.
For non-English workloads, measure the per-language token ratio against your billing assumptions. Budget for 2–3x English on CJK, 4–6x on Indic scripts unless you confirm otherwise.
Normalize Unicode before encoding if you do any string comparison post-tokenization. NFC for general use, NFKC for search/dedup.
Pin the tokenizer version with your model version. tiktoken upgrades have changed counts before. Cache the encoder object; constructing it parses the merge table.
Reserve special tokens before pre-training. Adding them later means partially-trained embeddings and pain in tool-use evals.
Profile tokenizer overhead on the hot path. For very short prompts at very high QPS, Python-level tokenizer construction or string copying can dominate model latency. Use the Rust backends.
Test your prompts with and without trailing whitespace. This is the single highest-yield prompt-engineering check and most teams skip it.
For multilingual products, train a custom SentencePiece on a balanced corpus rather than relying on the base model’s tokenizer. The 2–4x cost improvement pays for the engineering in weeks.

FAQ

What is the difference between BPE, SentencePiece, and Tiktoken?

BPE (Byte-Pair Encoding) is the underlying algorithm — greedy merging of frequent adjacent pairs. SentencePiece is Google’s tokenizer library that offers BPE and Unigram LM modes, treats spaces as ordinary characters, and skips pre-tokenization (great for multilingual and CJK). Tiktoken is OpenAI’s Rust implementation of byte-level BPE with the specific merge tables (cl100k_base, o200k_base) used by GPT-3.5, GPT-4, GPT-4o, o1, and o3. You use Tiktoken to count and predict tokens against OpenAI APIs; you use SentencePiece or HF Tokenizers to train your own vocabulary.

How many tokens is one English word on GPT-4o?

For typical English prose tokenized with o200k_base, the ratio is approximately 0.75 words per token, or equivalently ~1.33 tokens per word. A 1000-word article is therefore around 1300 tokens. Code, URLs, JSON, and non-English text shift the ratio significantly — code can hit 2–3 tokens per “word” because of punctuation, and Tamil hits 6 tokens per English-equivalent word. Always measure with tiktoken.encoding_for_model("gpt-4o").encode(text) for an exact count.

Why does adding a trailing space change my model’s output?

Because tokens with a leading space (▁the, the) are different IDs from tokens without (the). If your prompt ends with a space, the model’s next-token distribution is over tokens that lack a leading space, biasing toward word continuations. If your prompt ends without a space, the next token will likely have a leading space, biasing toward fresh words. The completion API behaviour is well-documented; the chat completions API hides this with templating but it can still leak through in tool-call arguments. Test both.

Can I change a model’s tokenizer after pre-training?

Practically, no — not without retraining. The tokenizer determines the input embedding matrix and the output softmax weights, so swapping tokenizers means re-initializing both and effectively re-pre-training. There is research on tokenizer transplantation (e.g., zero-shot tokenizer transfer via embedding alignment), but accuracy drops significantly. You can extend a tokenizer with added special tokens — the embeddings for those tokens start untrained and learn during fine-tuning — but you cannot replace it wholesale.

Are tokenizer-free models like Byte Latent Transformer ready for production?

Not in 2026. Meta’s BLT paper (December 2024, updated 2025) demonstrates that dynamic byte patching can match BPE at equivalent FLOPs and beat it on robustness to noisy input, but published models top out at the 8B scale and the inference stack lacks the maturity of vLLM/SGLang/TensorRT-LLM. Expect BLT-style approaches to ship in research releases through 2026 and start appearing in production frontier models by 2027–2028 once the patching network is hardened and the inference servers catch up.

Does Claude use the same tokenizer as GPT-4?

No. Anthropic’s tokenizer is proprietary and is closer to GPT-3.5’s cl100k_base in efficiency than to OpenAI’s newer o200k_base. For the same English text, Claude typically reports 5–15% more tokens than GPT-4o. This matters when comparing context-window claims (Claude 4.6 Sonnet 200k vs GPT-4o 128k) and per-token pricing — always normalize on the same input string with each provider’s official tokenizer or token-counting endpoint before drawing a cost conclusion.

LLM Tokenization Deep Dive: BPE, SentencePiece, Tiktoken (2026)

LLM Tokenization Deep Dive: BPE, SentencePiece, Tiktoken (2026)

Architecture at a glance

Why tokenization exists at all

The reference tokenization pipeline

BPE byte-pair encoding, step by step

SentencePiece — Unigram, BPE mode, and the underscore trick

Tiktoken — OpenAI’s BPE done in Rust

Tokenizer training — the knobs that matter

Multilingual gotchas — the Tamil tax

Practical impacts — your bill, your context, your KV cache

Trade-offs and failure modes

Practical recommendations

FAQ

What is the difference between BPE, SentencePiece, and Tiktoken?

How many tokens is one English word on GPT-4o?

Why does adding a trailing space change my model’s output?

Can I change a model’s tokenizer after pre-training?

Are tokenizer-free models like Byte Latent Transformer ready for production?

Does Claude use the same tokenizer as GPT-4?

Further reading

Related

Comments

Leave a Reply Cancel reply

Tag Cloud

Categories