Google TPU v8 Explained: Training vs Inference Split

Table of Contents

Google TPU v8 Explained: Training vs Inference Split

As of April 22, 2026, Google officially unveiled its eighth-generation TPU (v8) at Google Cloud Next.

For the first time, Google has split its TPU roadmap into two specialized chips:

TPU 8t (Sunfish) for large-scale training
TPU 8i (Zebrafish) for high-efficiency inference

This architectural shift reflects a deeper industry transition: modern AI workloads—especially agentic AI swarms and trillion-parameter models—no longer fit a “one-size-fits-all” accelerator design. The previous generation, TPU v7 (Ironwood), began to show limits under these emerging workloads.

🚀 The Two-Pronged TPU Strategy: 8t vs. 8i
#

Google’s TPU v8 marks a decisive move toward specialization—separating training and inference into independently optimized systems.

Feature	TPU 8t (Training)	TPU 8i (Inference)
Codename	Sunfish	Zebrafish
Core Design	High-throughput (Broadcom partner)	Cost-efficient (MediaTek partner)
Performance	~2.8× over v7	~80% over v7
Topology	3D Torus (massive clusters)	Boardfly (low-latency, high-radix)
Scale	Up to 9,600 chips (121 ExaFlops)	Optimized for agent swarms

Instead of forcing a compromise, Google now optimizes:

Throughput and scale → training (8t)
Latency and efficiency → inference (8i)

⚙️ Technical Innovations: Breaking the Memory Wall
#

Both TPU v8 variants are built around a vertically integrated stack, including Google’s custom Axion ARM CPUs, enabling tighter coupling between compute, memory, and networking.

TPU 8t: The Training Powerhouse
#

The 8t (Sunfish) is designed for extreme-scale distributed training:

Massive Interconnect Bandwidth
Inter-chip interconnect (ICI) bandwidth is doubled, while TPUDirect boosts storage access speeds by 10× over v7.
Virgo Network Architecture
A new network fabric enables scaling to 1 million chips in a single logical cluster, with near-linear scaling efficiency.
Autonomous Reconfiguration
Using Optical Circuit Switching (OCS), the system dynamically reroutes around failures—allowing long-running training jobs to continue uninterrupted.

This makes 8t particularly suited for frontier model training where uptime and scaling efficiency are critical.

TPU 8i: The Inference & Reasoning Engine
#

The 8i (Zebrafish) is purpose-built for modern inference workloads, especially reasoning-heavy models:

Boardfly Topology
A hierarchical high-radix network that reduces hop count by over 50%, cutting all-to-all latency by ~50%—a key requirement for Mixture-of-Experts (MoE) models.
Massive On-Chip SRAM (384MB)
Roughly 3× larger than v7, enabling full KV cache residency on-chip, effectively eliminating memory bottlenecks during inference.
Collectives Acceleration Engine (CAE)
A dedicated hardware block for global operations, reducing latency for reasoning workflows (e.g., chain-of-thought) by up to 5×.

This design directly targets real-time AI systems where latency—not raw FLOPs—is the bottleneck.

⚡ Power and Efficiency in the TPU v8 Era
#

Efficiency is no longer optional at hyperscale—it’s foundational. TPU v8 delivers a 2× performance-per-watt improvement over v7 through:

Axion CPU Integration
Custom ARM-based CPUs enable system-level NUMA optimization, improving memory locality and reducing overhead.
Liquid Cooling v4
Advanced liquid cooling supports significantly higher power densities than traditional air-cooled systems.
Real-Time Power Management
Hardware dynamically adjusts power usage based on workload phases (training, inference, communication), minimizing waste.

🤖 The Bigger Picture: Enter the Agentic AI Era
#

With TPU v8, Google is clearly aligning its infrastructure with the rise of agentic AI systems—distributed, collaborative AI agents operating at scale.

By offering:

Bare-metal access
Native support for frameworks like SGLang, vLLM, and JAX

Google positions TPU v8 as a direct competitor to next-generation GPU architectures such as NVIDIA’s Rubin platform.

🧠 Final Takeaway
#

TPU v8 isn’t just a performance upgrade—it’s a philosophical shift in AI hardware design:

Training and inference are now fundamentally different problems
Specialized silicon delivers better efficiency than general-purpose accelerators
Infrastructure is evolving to support AI systems, not just models

In short, Google’s TPU v8 signals the transition from the model-centric era to the agent-centric era of computing.