Skip to main content

Why Memory Bandwidth, Not Compute, Determines LLM Inference Speed

·494 words·3 mins
LLM AI Hardware TPU Memory Bandwidth Mixture of Experts Inference Latency Large Language Models KV Cache Autoregressive Models AI Performance
Table of Contents

Why Memory Bandwidth, Not Compute, Determines LLM Inference Speed

In an in-depth discussion with Dwarkesh Patel, Reiner Pope, MatX founder and former Google TPU architect, sheds light on why stacking raw compute power (FLOPS) fails to reduce single-user LLM latency. He explores memory bandwidth limits, Mixture of Experts (MoE) routing challenges, KV cache management, and the economic implications for API pricing in 2026.


πŸ’° The Economics of LLM API Pricing: High Costs for Low Concurrency
#

Leading LLM platforms often charge up to 6x premium rates for marginal improvements in token streaming speed. This pricing reflects adjustments in concurrency and scheduling logic, rather than faster chips.

Concurrency vs Efficiency
#

  • Fast Mode (Low Concurrency): Assigns few users per GPU cluster, reducing waiting queues for single users but drastically reducing overall hardware efficiency.
  • Standard Mode (High Concurrency): Processes thousands of user requests together, maximizing throughput while increasing per-user latency.

The Memory Bandwidth Wall
#

For single-user inference, VRAM bandwidth is the limiting factor, not raw compute. LLMs use autoregressive decoding, requiring access to full model weights for every generated token:

Inference Latency Formula:
$$\text{Total Latency} = \max(t_{\text{compute}}, t_{\text{memory}})$$
With $$t_{\text{compute}}$$ negligible per token, $$t_{\text{memory}}$$ dominates:
$$t_{\text{memory}} = \frac{\text{Total Model Parameters}}{\text{Memory Bandwidth}}$$

Simply increasing FLOPS cannot reduce single-user lag. Optimal concurrency aligns with $300 \times$ the model’s sparsity ratio, balancing memory and computation.


⚑ Prefill vs Decoding: Memory Utilization & KV Cache Tiering
#

Hardware utilization varies between input (prefill) and output (decoding) stages:

  • Input Stage (Prefill): Prompt processed in parallel, achieving high utilization.
  • Output Stage (Decoding): Sequential token generation creates a memory-bound bottleneck.

Tiered KV Cache Architecture
#

Providers manage active conversation memory across multiple tiers:

Memory Tier Baseline Drain Time Use Case
HBM (High Bandwidth Memory) ~20 ms Active token generation / Immediate processing
Host DDR Memory 1–10 s Mid-tier caching for paused sessions
Flash Storage ~1 min Long-term context archiving

Tiered caching allows cost-efficient long-context support. However, ultra-long prompts (e.g., Gemini 3.1, >200,000 tokens) hit steep cost spikes due to slower retrieval times, reinforcing memory bandwidth as the primary bottleneck.


πŸ”„ Pipeline Parallelism Breakdown in Mixture of Experts (MoE) Models
#

Massive MoE architectures, such as DeepSeek-V3, require parameter distribution across multiple nodes due to VRAM limits, creating a communication wall.

  • MoE Paradox: High per-token compute efficiency is offset by inter-node data transfer overhead.
  • Dynamic Routing: Each token must be routed in real-time to the correct expert module across GPUs or servers.
  • Impact on Latency: Network transfer dominates, making additional FLOPS ineffective for single-user responsiveness.

Key Implications for LLM Deployment
#

  1. Memory Optimization: High-bandwidth memory and efficient tiering are essential.
  2. Distributed Expert Coordination: Reduces communication overhead in MoE inference.
  3. Economic Efficiency: API pricing reflects the trade-off between low-latency single-user access and hardware utilization.

Key Takeaway: Single-user LLM latency is determined by memory bandwidth, KV cache tiering, and MoE routing, not raw FLOPS. Effective deployment strategies require hardware-aware LLM optimization, memory hierarchy management, and careful orchestration of distributed experts to ensure responsive AI services.

Related

Google TPU v8 Explained: Training vs Inference Split
·600 words·3 mins
TPU Google Cloud AI Hardware Machine Learning Data Center
CPU vs GPU vs TPU in 2026: How Google Trillium Redefines AI Compute
·655 words·4 mins
CPU GPU TPU Google Trillium AI Hardware Data Center ASIC Machine Learning
Google Custom Chips Explained: Axion ARM CPU and TPU v6 Trillium
·577 words·3 mins
Google Cloud ARM TPU AI Hardware Data Center