Why Google AI Still Fails at Simple Letter Counting

Table of Contents

Why Google AI Still Fails at Simple Letter Counting

Artificial intelligence has made remarkable progress in reasoning, coding, summarization, and multimodal understanding. Yet modern large language models (LLMs) continue to fail at surprisingly trivial tasks.

Classic examples have become internet folklore:

Failing to count how many times the letter r appears in strawberry
Refusing to output known celebrity names
Generating confident hallucinations for simple factual prompts

Now, the same limitation has surfaced inside Google Search itself.

After Google integrated AI more deeply into Search through AI Overviews and AI Mode, users discovered a bizarre failure mode: asking Google Search “How many Ps are in the word google?” often produced incorrect answers.

Even more strangely, the AI occasionally hallucinated additional mistakes, such as incorrectly claiming the word Pixel contains two Ps.

The issue quickly went viral because the failure involved Google misinterpreting its own brand name. However, the underlying problem extends far beyond Google Search. It exposes one of the deepest structural weaknesses shared by nearly every modern LLM architecture.

🔍 Google’s AI-Centric Search Overhaul
#

To understand why this failure matters, it is important to examine how Google recently redesigned Search around generative AI.

At Google I/O 2026, the company announced what it described as the largest transformation of Search in 25 years. The traditional search experience was restructured around a new AI-first workflow combining:

AI Overviews
AI Mode
Conversational follow-up interactions
Direct answer generation

Instead of acting primarily as a gateway to web pages, Search increasingly behaves like a conversational AI assistant.

Traditional blue links still exist, but they are no longer the centerpiece of the interaction model.

Why Google Made the Shift
#

The move was largely driven by pressure from AI-native competitors such as:

ChatGPT
Perplexity
Claude-powered search systems

Google’s challenge was no longer simply indexing information. It now had to compete on reasoning, summarization, and conversational retrieval.

However, embedding LLMs directly into Search dramatically raises user expectations for accuracy.

When users interact with standalone chatbots, occasional hallucinations are often tolerated. Search engines operate under a different standard. For decades, Google Search has represented authoritative factual retrieval. Once AI-generated responses appear with the same level of authority as indexed search results, even minor mistakes become highly visible.

This context shift magnified what would otherwise have been dismissed as a harmless AI quirk.

🧠 Why LLMs Cannot Reliably Count Letters
#

The core issue is surprisingly simple:

LLMs do not actually read text character by character the way humans do.

Humans naturally parse words as sequences of letters:

G - O - O - G - L - E

Language models process text differently.

Tokens, Not Characters
#

Modern LLMs operate using tokens, not letters.

A token is a chunk of language that may represent:

An entire word
Part of a word
Multiple words
Common subword fragments

For example, the word strawberry may be tokenized like this:

["Str", "aw", "berry"]

To the model, the original character structure is partially abstracted away.

When asked:

“How many r’s are in strawberry?”

the model must internally reconstruct hidden character-level information from tokenized fragments. Since models are not explicitly optimized for symbolic character manipulation, they frequently fail.

The same problem occurs with words like Google, which may exist as a single high-frequency token in the vocabulary.

In effect, the AI sees a compressed semantic identifier rather than a sequence of letters.

Meaning vs. Structure
#

LLMs are primarily optimized for predicting semantic relationships between tokens, not analyzing textual structure.

This distinction is critical.

Modern transformers excel at:

Language prediction
Semantic reasoning
Context completion
Pattern association

They are comparatively weak at:

Character counting
Precise spelling analysis
Symbolic manipulation
Deterministic logical tracking

As AI researcher Matthew Guzdial explains:

“When the model sees the word ‘the’, it receives the holistic encoding for ‘the’. It has no inherent awareness that it comprises the letters T, H, and E.”

This is not a bug unique to Google. It is a direct consequence of how current LLM architectures process language.

⚠️ Jagged Intelligence: Why AI Feels Inconsistently Smart
#

One of the most fascinating aspects of modern LLMs is their uneven capability distribution.

An AI model may:

Solve advanced mathematics
Generate production-ready code
Write research summaries
Pass professional exams

while simultaneously failing at:

Counting letters
Comparing simple shapes
Tracking symbolic relationships

Former OpenAI researcher Andrej Karpathy refers to this phenomenon as Jagged Intelligence.

The term describes how AI systems exhibit highly non-linear competence. Their strengths and weaknesses do not scale uniformly.

Why Step-by-Step Prompting Helps
#

Interestingly, many models can solve character-counting tasks when prompted carefully.

For example:

List every letter in the word first, then count the r's.

This often produces the correct answer.

The reason is that chain-of-thought prompting forces the model into slower, deliberate reasoning instead of fast probabilistic guessing.

This behavior mirrors the psychological distinction between:

Fast intuitive thinking (System 1)
Slow analytical reasoning (System 2)

Without explicit prompting, models frequently default to low-effort inference shortcuts.

🤖 Why Google Search Received Stronger Backlash
#

The technical limitation itself is well-known inside AI research circles.

What changed was the environment.

When ChatGPT makes a mistake, users perceive it as an AI assistant error. When Google Search delivers a wrong answer directly inside the search results page, the implications feel much larger.

Search Engines Carry Different Expectations
#

Google Search historically represented:

Objective retrieval
Indexed factual accuracy
Verifiable references
Reliable navigation

Generative AI changes this paradigm.

The moment Search transitions from:

“Here are relevant documents.”

to:

“Here is the answer.”

the burden of correctness increases dramatically.

The irony that Google AI failed specifically on the word Google amplified the incident into a viral example of AI unreliability.

AI Overviews Already Had a Troubled History
#

This is also not Google’s first public AI failure.

Earlier AI Overview incidents included hallucinated recommendations sourced from satirical Reddit comments, including infamous examples such as:

Advising users to add glue to pizza sauce
Suggesting people eat small rocks daily

Although Google deployed numerous safeguards afterward, recent failures involving spelling and instruction parsing demonstrate that foundational LLM weaknesses remain unresolved.

🧪 Can Tokenization Be Replaced?
#

Researchers are actively exploring alternatives to token-based architectures.

One of the most notable approaches comes from Meta AI’s Byte Latent Transformer (BLT) architecture.

Unlike traditional tokenizers, BLT processes raw byte-level input directly.

BLT Architecture Overview
#

       ┌────────────────────────────────────────────────────────┐
       │                  Byte Latent Transformer               │
       └───────────────────────────┬────────────────────────────┘
                                   │
                    ┌──────────────┴──────────────┐
                    │     Local Encoder (Bytes)   │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────┴──────────────┐
                    │      Latent Transformer     │
                    │   (Processes Dynamic Chunks)│
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────┴──────────────┐
                    │     Local Decoder (Bytes)   │
                    └─────────────────────────────┘

Instead of compressing words into semantic tokens, BLT allows the model to preserve character-level information throughout processing.

This dramatically improves tasks involving:

Spelling
Character tracking
Symbolic reasoning
Fine-grained text manipulation

The Computational Tradeoff
#

The downside is computational cost.

Tokenization exists largely for efficiency.

Without tokens:

Sequence lengths grow dramatically
Attention costs increase quadratically
Training becomes significantly more expensive
Inference latency rises

Although BLT introduces dynamic chunk grouping to mitigate these costs, scaling byte-level architectures to frontier-model sizes remains extremely expensive.

Current BLT experiments are still far smaller than production systems deployed by companies such as Google or OpenAI.

🛠️ Practical Industry Workarounds
#

Because replacing transformer architectures is costly, many AI companies instead rely on layered mitigation strategies.

Tool Calling
#

One practical solution is allowing models to recognize their own limitations and delegate tasks externally.

Instead of estimating character counts internally, an AI can call:

A calculator
A code interpreter
A search index
A deterministic parser

This hybrid approach is increasingly common in modern agentic systems.

Confidence-Aware Training
#

Meta also explored alignment techniques during Llama training that encourage models to:

Answer only when confidence is high
Refuse uncertain outputs
Avoid confident hallucinations

This reduces the frequency of obviously incorrect answers, although it does not fundamentally solve the underlying architectural issue.

Localized Engineering Patches
#

Google itself is likely deploying targeted safeguards specifically for:

Letter counting
Prompt interpretation
System instruction boundaries

However, these are effectively engineering band-aids layered on top of deeper structural constraints.

📌 Conclusion
#

The viral “How many Ps are in Google?” incident was not simply an embarrassing product bug. It exposed a fundamental truth about modern AI systems:

Large language models understand language statistically, not symbolically.

Despite their impressive reasoning capabilities, most LLMs still lack robust character-level awareness because they process text through semantic token abstractions rather than raw letters.

As AI becomes increasingly integrated into core infrastructure such as search engines, user expectations for reliability continue to rise. This creates tension between the probabilistic nature of generative models and the deterministic accuracy people expect from systems like Google Search.

Future architectures such as Byte Latent Transformers may eventually reduce these weaknesses, but the computational tradeoffs remain enormous. Until then, AI systems will likely continue relying on a mixture of prompting techniques, tool calling, alignment engineering, and targeted safeguards to patch over their symbolic blind spots.

The irony is hard to ignore: modern AI can generate complex software systems, summarize scientific papers, and reason across vast datasets — yet it still occasionally struggles to spell the word “Google.”