Ask a frontier model to count the letter 'r' in "strawberry." Often you'll get two. Sometimes three. Rarely with consistent accuracy across a hundred trials. This isn't a bug somebody hasn't fixed. It's a direct consequence of how text enters the model in the first place, and it explains a dozen other strange behaviours that cluster around counting, spelling, and arithmetic.
Language models don't read text. They read tokens, which are integers that index a lookup table. The algorithm most of them use to build that table is called Byte-Pair Encoding, or BPE. It was originally designed as a data compression scheme. Philip Gage wrote it in 1994 for compressing files. Sennrich, Haddow, and Birch adapted it for machine translation in 2016, and it's been the default across most major model families since.
The idea is simple. Start with individual characters. Find the most frequent adjacent pair in your training corpus. Merge them into a new token. Find the next most frequent pair. Merge. Repeat for tens of thousands of iterations. The result is a vocabulary full of common words and common subword pieces. "The" is one token. "ing" is one token. "Strawberry" might split into ["straw", "berry"]. Rare words fragment into more pieces.
Once the model is trained, it sees "strawberry" as two integers, not ten characters. No mechanism inside the transformer can reach inside a token to ask how many r's it contains. The letters are sealed inside the token the way pages are sealed inside a book you can only see the cover of. The model has, statistically, learned that strawberry contains three r's. It just hasn't learned it from the token sequence. It's learned it from surrounding text that happened to mention the fact. That knowledge is fragile, and it decays on uncommon words.
The arithmetic failures come from the same place. Numbers don't tokenize uniformly. "480" might be one token. "481" might be two. A four-digit number can split one way and the same digits rearranged can split another. Researchers using arithmetic as a diagnostic have found that when an answer has more digits than either input, accuracy on certain tasks collapses to under 10%. The model isn't bad at maths. It's being handed digit sequences in a shape it wasn't trained to work with.
The fix, in principle, is byte-level tokenization. Every byte becomes a token. No merging, no hidden letters. The tradeoff is sequence length. The same passage takes more tokens, sometimes many more. That means more compute, longer context windows, slower inference. GPT-2 went byte-level and paid the cost. Recent models use hybrid approaches: BPE for efficiency, special handling for digits, sometimes per-character processing inside the chain-of-thought. None of it is free.
What strikes me is how much of the model's apparent cognitive style is downstream of this one preprocessing choice. The blind spot for letters isn't a reasoning failure. It's the data format not including letters, most of the time. Change the tokenizer and you change what the model can notice.
The cost-per-task shift playing out across the industry is partly a reasoning-modes story, but it's also a tokenization story. The reasoning variants often break digits into individual character tokens during arithmetic steps, which inflates token counts for the same sum. The reason your cheap tier struggles with long division is the same reason your expensive tier's bill keeps climbing when you ask it to do the job properly. Both are paying, differently, for the fact that the default encoding hides the thing you want the model to count.
Sources:
-
BPE vs Byte-level Tokenization: Why LLMs Struggle with Counting — SOTAAZ
-
The "Strawberry R Counting" Problem in LLMs: Causes and Solutions — secwest.net
-
Tokenizer Arithmetic: The Hidden Layer That Bites You in Production — Tian Pan
-
How LLM Tokenization Actually Works Under the Hood — Let's Data Science
-
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations — arXiv