I built a tool to benchmark tokenizers across 100+ languages and found some wild disparities [R]


TL;R: Created tokka-bench to compare tokenizers across languages. Turns out your fine-tune's multilingual performance might suck because of tokenization, not architecture. Also explains why proprietary models (Claude, GPT, Gemini) are so much better at non-English tasks.

Links:

https://preview.redd.it/7i03jela9elf1.png?width=1724&format=png&auto=webp&s=95378457970e6337b147e71d7a8f0ab2dd67cb91

The Problem Nobody Talks About

I started this as a side quest while pretraining a multilingual model, but tokenization turned out to be way more important than expected. There are two hidden layers creating massive efficiency gaps:

UTF-8 encoding differences:

  • English: ~1 byte per character
  • Arabic: 2+ bytes per character
  • Chinese: 3+ bytes per character

Tokenization bias: Most tokenizers are trained on English-heavy data, so they allocate way more vocabulary to English patterns. These compound into serious problems.

Why This Affects Performance

uring training: If you allocate tokens proportionally (10M English, 1M Khmer), the Khmer text has WAY less semantic content because it needs more tokens per word. Plus Khmer tokens end up being character-level instead of semantic units, making concept storage much harder.

uring inference: Low-resource languages need 2-3x more tokens per sentence:

  • Slower throughput (costs more to serve)
  • Context windows fill up faster
  • More chances to mess up during generation

What I Built

tokka-bench measures four key things:

  1. Efficiency – bytes per token (compression quality)
  2. Coverage – unique tokens used (script representation)
  3. Word splitting – how often semantic units get fragmented
  4. Subword fertility – average tokens per semantic unit

Interesting Findings

You can actually reverse-engineer training data from tokenizer performance:

  • Kimi K2: Exceptional Mandarin coverage (obviously Chinese-trained)
  • Gemma 3: Strong Urdu/Hindi performance
  • gpt-oss: Good Arabic/Gujarati coverage

Weirdest finding: Programming languages show almost identical efficiency across all tokenizers. Probably because everyone trains on GitHub with similar language distributions.

Technical etails

Built on high-quality datasets (FineWeb, FineWeb-2, StarCoder). Samples 2MB per language and calculates per-language metrics. Has some limitations around cross-linguistic comparison due to UTF-8 differences, but great for comparing tokenizers on the same language.

Shoutout to Judit Ács for the original subword fertility metrics and Rust et al's ACL paper that laid the groundwork.

PS: if you're from an AI lab and want to contribute your tokenizer's metrics (even if proprietary), please reach out! The community would benefit a lot from understanding how SOTA systems handle this stuff.

Posted this on LinkedIn/Twitter already but figured r/MachineLearning would appreciate the technical details. Happy to answer questions about methodology or findings!

Leave a Reply