Google TPU v8 Explained: Training vs Inference Split
As of April 22, 2026, Google officially unveiled its eighth-generation TPU (v8) at Google Cloud Next.
For the first time, Google has split its TPU roadmap into two specialized chips:
- TPU 8t (Sunfish) for large-scale training
- TPU 8i (Zebrafish) for high-efficiency inference
This architectural shift reflects a deeper industry transition: modern AI workloads—especially agentic AI swarms and trillion-parameter models—no longer fit a “one-size-fits-all” accelerator design. The previous generation, TPU v7 (Ironwood), began to show limits under these emerging workloads.
🚀 The Two-Pronged TPU Strategy: 8t vs. 8i #
Google’s TPU v8 marks a decisive move toward specialization—separating training and inference into independently optimized systems.
| Feature | TPU 8t (Training) | TPU 8i (Inference) |
|---|---|---|
| Codename | Sunfish | Zebrafish |
| Core Design | High-throughput (Broadcom partner) | Cost-efficient (MediaTek partner) |
| Performance | ~2.8× over v7 | ~80% over v7 |
| Topology | 3D Torus (massive clusters) | Boardfly (low-latency, high-radix) |
| Scale | Up to 9,600 chips (121 ExaFlops) | Optimized for agent swarms |
Instead of forcing a compromise, Google now optimizes:
- Throughput and scale → training (8t)
- Latency and efficiency → inference (8i)
⚙️ Technical Innovations: Breaking the Memory Wall #
Both TPU v8 variants are built around a vertically integrated stack, including Google’s custom Axion ARM CPUs, enabling tighter coupling between compute, memory, and networking.
TPU 8t: The Training Powerhouse #
The 8t (Sunfish) is designed for extreme-scale distributed training:
-
Massive Interconnect Bandwidth
Inter-chip interconnect (ICI) bandwidth is doubled, while TPUDirect boosts storage access speeds by 10× over v7. -
Virgo Network Architecture
A new network fabric enables scaling to 1 million chips in a single logical cluster, with near-linear scaling efficiency. -
Autonomous Reconfiguration
Using Optical Circuit Switching (OCS), the system dynamically reroutes around failures—allowing long-running training jobs to continue uninterrupted.
This makes 8t particularly suited for frontier model training where uptime and scaling efficiency are critical.
TPU 8i: The Inference & Reasoning Engine #
The 8i (Zebrafish) is purpose-built for modern inference workloads, especially reasoning-heavy models:
-
Boardfly Topology
A hierarchical high-radix network that reduces hop count by over 50%, cutting all-to-all latency by ~50%—a key requirement for Mixture-of-Experts (MoE) models. -
Massive On-Chip SRAM (384MB)
Roughly 3× larger than v7, enabling full KV cache residency on-chip, effectively eliminating memory bottlenecks during inference. -
Collectives Acceleration Engine (CAE)
A dedicated hardware block for global operations, reducing latency for reasoning workflows (e.g., chain-of-thought) by up to 5×.
This design directly targets real-time AI systems where latency—not raw FLOPs—is the bottleneck.
⚡ Power and Efficiency in the TPU v8 Era #
Efficiency is no longer optional at hyperscale—it’s foundational. TPU v8 delivers a 2× performance-per-watt improvement over v7 through:
-
Axion CPU Integration
Custom ARM-based CPUs enable system-level NUMA optimization, improving memory locality and reducing overhead. -
Liquid Cooling v4
Advanced liquid cooling supports significantly higher power densities than traditional air-cooled systems. -
Real-Time Power Management
Hardware dynamically adjusts power usage based on workload phases (training, inference, communication), minimizing waste.
🤖 The Bigger Picture: Enter the Agentic AI Era #
With TPU v8, Google is clearly aligning its infrastructure with the rise of agentic AI systems—distributed, collaborative AI agents operating at scale.
By offering:
- Bare-metal access
- Native support for frameworks like SGLang, vLLM, and JAX
Google positions TPU v8 as a direct competitor to next-generation GPU architectures such as NVIDIA’s Rubin platform.
🧠 Final Takeaway #
TPU v8 isn’t just a performance upgrade—it’s a philosophical shift in AI hardware design:
- Training and inference are now fundamentally different problems
- Specialized silicon delivers better efficiency than general-purpose accelerators
- Infrastructure is evolving to support AI systems, not just models
In short, Google’s TPU v8 signals the transition from the model-centric era to the agent-centric era of computing.