I built a tool to benchmark tokenizers across 100+ languages and found some wild disparities [R]

By skyforbes Nov 28, 2025 No Comments

TL;R: Created tokka-bench to compare tokenizers across languages. Turns out your fine-tune's multilingual performance might suck because of tokenization, not architecture. Also explains why proprietary models (Claude, GPT, Gemini) are so much better at non-English tasks.

Links:

https://preview.redd.it/7i03jela9elf1.png?width=1724&format=png&auto=webp&s=95378457970e6337b147e71d7a8f0ab2dd67cb91

The Problem Nobody Talks About

I started this as a side quest while pretraining a multilingual model, but tokenization turned out to be way more important than expected. There are two hidden layers creating massive efficiency gaps:

UTF-8 encoding differences:

English: ~1 byte per character
Arabic: 2+ bytes per character
Chinese: 3+ bytes per character

Tokenization bias: Most tokenizers are trained on English-heavy data, so they allocate way more vocabulary to English patterns. These compound into serious problems.

Why This Affects Performance

uring training: If you allocate tokens proportionally (10M English, 1M Khmer), the Khmer text has WAY less semantic content because it needs more tokens per word. Plus Khmer tokens end up being character-level instead of semantic units, making concept storage much harder.

uring inference: Low-resource languages need 2-3x more tokens per sentence:

Slower throughput (costs more to serve)
Context windows fill up faster
More chances to mess up during generation

What I Built

tokka-bench measures four key things:

Efficiency – bytes per token (compression quality)
Coverage – unique tokens used (script representation)
Word splitting – how often semantic units get fragmented
Subword fertility – average tokens per semantic unit

Interesting Findings

You can actually reverse-engineer training data from tokenizer performance:

Kimi K2: Exceptional Mandarin coverage (obviously Chinese-trained)
Gemma 3: Strong Urdu/Hindi performance
gpt-oss: Good Arabic/Gujarati coverage

Weirdest finding: Programming languages show almost identical efficiency across all tokenizers. Probably because everyone trains on GitHub with similar language distributions.

Technical etails

Built on high-quality datasets (FineWeb, FineWeb-2, StarCoder). Samples 2MB per language and calculates per-language metrics. Has some limitations around cross-linguistic comparison due to UTF-8 differences, but great for comparing tokenizers on the same language.

Shoutout to Judit Ács for the original subword fertility metrics and Rust et al's ACL paper that laid the groundwork.

PS: if you're from an AI lab and want to contribute your tokenizer's metrics (even if proprietary), please reach out! The community would benefit a lot from understanding how SOTA systems handle this stuff.

Posted this on LinkedIn/Twitter already but figured r/MachineLearning would appreciate the technical details. Happy to answer questions about methodology or findings!

By skyforbes

MachineLearning

I built a tool to benchmark tokenizers across 100+ languages and found some wild disparities [R]

The Problem Nobody Talks About

Why This Affects Performance

What I Built

Interesting Findings

Technical etails

Like this:

By skyforbes

Leave a ReplyCancel reply

You Missed

OpenAI if you want to actually compete with Gemimi 3

Help Please !

Gemini 3 Pro Thinking vs GPT-5.1 Thinking

EU to sanction 10 Russians for torturing journalist Victoria Roshchyna and other Ukrainians

Archives

I built a tool to benchmark tokenizers across 100+ languages and found some wild disparities [R]

The Problem Nobody Talks About

Why This Affects Performance

What I Built

Interesting Findings

Technical etails

Like this:

By skyforbes

Related Posts

[R] Seeking Publicly Available Paired MRI + Genomic/Structured Data for Multimodal ML (Human/Animal/Plant)

[D] How to fairly compare AI training methods when they produce different population sizes?

[D] Weight Tying in LLM Seems to Force the Last MLP to Become the True Unembedding

Leave a ReplyCancel reply

You Missed

OpenAI if you want to actually compete with Gemimi 3

Help Please !

Gemini 3 Pro Thinking vs GPT-5.1 Thinking

EU to sanction 10 Russians for torturing journalist Victoria Roshchyna and other Ukrainians