Skip to main content

Google TPU Evolution: From TPU v2 to Ironwood in 8 Years

·1691 words·8 mins
Google TPU AI Infrastructure Machine Learning Data Centers Semiconductors Gemini High-Performance Computing JAX XLA
Table of Contents

Google TPU Evolution: From TPU v2 to Ironwood in 8 Years

When discussing advanced chip design, companies such as Intel, AMD, and NVIDIA traditionally dominate the conversation. However, in 2016, Google fundamentally altered the landscape by introducing a custom AI accelerator known as the Tensor Processing Unit (TPU).

What began as a specialized chip for AI inference has evolved into one of the world’s most sophisticated AI training platforms. Between TPU v2 in 2017 and Ironwood in 2025, Google’s TPU supercomputing infrastructure achieved an extraordinary increase in peak system performance of approximately 3,600ร—.

More importantly, Google’s TPU journey demonstrates a powerful engineering lesson: architectural stability can outperform constant reinvention.

๐Ÿš€ Why Google Decided to Build Its Own AI Chips
#

By 2016, Google’s core servicesโ€”including Search, Translate, Photos, and Adsโ€”were increasingly dependent on deep learning models.

At the time, general-purpose CPUs and GPUs faced several challenges:

  • High power consumption
  • Limited inference efficiency
  • Rising infrastructure costs
  • Increasing AI deployment demands

Google responded by designing TPU v1, a custom ASIC optimized specifically for neural network inference.

The results were dramatic:

  • Up to 30ร— better energy efficiency than contemporary GPUs
  • Up to 80ร— better energy efficiency than CPUs
  • Significantly lower operational costs for large-scale AI services

The success of TPU v1 inspired a wave of custom AI silicon initiatives across the industry. Companies including Intel, Amazon, Alibaba, and numerous startups soon began investing in dedicated AI accelerators.

However, TPU v1 was only the beginning. Google’s long-term competitive advantage emerged with TPU v2, which introduced large-scale AI training capabilities.

๐Ÿ—๏ธ The Remarkable Stability of TPU Architecture
#

One of the most surprising aspects of the TPU story is how little its core architecture has changed.

In the early years, many hardware experts questioned whether custom AI ASICs could survive in a field evolving as rapidly as machine learning. Chip development cycles often require multiple years, while AI model architectures can change within months.

Google’s experience proved otherwise.

Over eight years and five TPU generations, the foundational TPU architecture has remained largely intact while continuing to support every major wave of AI innovation.

The AI Landscape Changed Completely
#

When TPU v2 was introduced:

  • Multi-Layer Perceptrons (MLPs) and DLRM models dominated workloads
  • Recurrent Neural Networks (RNNs) were widely deployed
  • Transformers had not yet become mainstream
  • Diffusion models did not exist in production systems

Fast forward to 2026:

  • RNN workloads have effectively disappeared
  • Transformer models account for approximately 74% of Google’s training workloads
  • Diffusion models power image and video generation systems
  • Multimodal foundation models dominate AI research

Despite these fundamental shifts, TPU’s core hardware design remains relevant.

This architectural longevity is one of the platform’s greatest achievements.

โš™๏ธ The Core TPU Architecture
#

The TPU architecture follows a simple philosophy: maximize throughput for matrix operations while minimizing programming complexity.

Across multiple generations, improvements have largely focused on scale, precision, memory capacity, bandwidth, and reliability rather than redesigning the entire architecture.

TensorCore: The Foundation of TPU Compute
#

Each TPU contains two large TensorCores.

Rather than relying on thousands of small programmable cores, Google chose a small number of extremely powerful compute engines capable of processing massive blocks of data efficiently.

Benefits include:

  • Simpler programming models
  • Lower scheduling overhead
  • Higher computational efficiency
  • Easier compiler optimization

Internally, TensorCores utilize Very Long Instruction Word (VLIW) execution, allowing multiple operations to be bundled and executed simultaneously.

A dedicated 128-lane vector processing unit handles operations such as:

  • Activation functions
  • Layer normalization
  • Quantization
  • Non-matrix arithmetic

This separation enables matrix and vector workloads to execute concurrently.

๐Ÿงฎ MXU: The Heart of TPU Compute
#

The Matrix Multiplication Unit (MXU) is the engine that powers nearly all modern AI workloads.

Large language models, diffusion models, and recommendation systems all depend heavily on matrix multiplication.

Evolution of the MXU
#

TPU v2 featured:

2 ร— 128 ร— 128 systolic arrays

Ironwood expands this significantly:

4 ร— 256 ร— 256 BF16 arrays
FP8 support
Equivalent to four 512 ร— 512 FP8 matrix multiplications

The systolic array architecture remains one of Google’s most important innovations because it delivers extremely high computational density with predictable data movement patterns.

BF16 Changed AI Computing
#

Google also pioneered the Bfloat16 (BF16) numerical format.

BF16 structure:

1-bit Sign
8-bit Exponent
7-bit Mantissa

Compared with FP32, BF16 sacrifices precision while preserving dynamic range.

This design perfectly aligns with deep learning workloads, where numerical range matters more than exact precision.

The industry’s later adoption of FP8 and FP4 formats largely follows the same philosophy.

๐Ÿงฉ SparseCore: Specialized Acceleration for Sparse Workloads
#

SparseCore is one of the TPU architecture’s most distinctive features.

Unlike TensorCores, SparseCore focuses on sparse computations such as:

  • Recommendation systems
  • Embedding lookups
  • Transformer communication
  • Top-K selection
  • Token decoding

Despite consuming only approximately 5% of chip area and power, SparseCore delivers substantial acceleration for workloads that would otherwise be inefficient on dense matrix hardware.

SparseCore Evolution
#

Generation SparseCore Count
Early TPUs 2
Ironwood 4

Initially designed for search and advertising systems, SparseCore has evolved into an important communication acceleration engine for large-scale foundation models.

๐Ÿ’พ Memory Architecture: Scaling Bandwidth for AI
#

One of TPU’s defining architectural choices is the elimination of traditional CPU-style cache hierarchies.

Instead, Google relies on software-managed memory movement through DMA scheduling.

TPU Memory Hierarchy
#

HBM (Global Shared Memory)
        โ†‘
DMA Scheduling
        โ†‘
VMEM (On-Chip Vector Memory)
        โ†‘
Compute Units

This design gives the compiler direct control over data movement, enabling predictable performance at scale.

Growth Across Generations
#

Component TPU v2 Ironwood
On-Chip Memory 32 MB 128 MB
HBM Capacity 16 GiB 192 GiB
HBM Stacks 2 8
Bandwidth 700 GB/s 7,300 GB/s

Memory bandwidth has increased by roughly 10ร—, helping keep pace with ever-growing model sizes.

๐ŸŒ Interconnect Evolution
#

Scaling AI increasingly depends on connecting thousands of chips into a unified system.

Google’s Inter-Chip Interconnect (ICI) has steadily evolved over successive TPU generations.

Generation Configuration
TPU v2 4 ร— 62 GB/s
TPU v4 6 ร— 50 GB/s
TPU v5p 6 ร— 100 GB/s
Ironwood 6 ร— 100 GB/s

ICI allows TPU clusters to function as a single distributed supercomputer rather than a collection of isolated accelerators.

๐Ÿ”ง Managing Clusters With Tens of Thousands of Chips
#

Training modern Gemini-scale models requires enormous infrastructure.

At this scale, hardware failures are not exceptional eventsโ€”they are guaranteed.

Google’s solution combines architectural resilience with optical networking.

Optical Circuit Switches (OCS)
#

Beginning with TPU v4, Google introduced Optical Circuit Switches.

The basic deployment unit is a:

4 ร— 4 ร— 4 Cube
= 64 TPUs

Each cube connects independently to an optical switch.

When hardware fails:

  • Faulty nodes can be bypassed
  • Remaining hardware continues operating
  • Entire clusters do not require shutdown

This dramatically improves availability and deployment flexibility.

Flexible Cluster Scheduling
#

OCS enables:

  • Dynamic cluster composition
  • Fault-tolerant scheduling
  • Efficient resource utilization
  • Rapid hardware replacement

Even partially degraded clusters can continue training large models efficiently.

๐Ÿ›ก๏ธ Ironwood’s Hardware Reliability Innovations
#

As cluster sizes grow, silent hardware errors become a major concern.

Ironwood introduces dedicated mechanisms to detect and mitigate these issues.

FBIST: Functional Built-In Self-Test
#

FBIST continuously validates hardware throughout its lifecycle:

  • Manufacturing
  • Burn-in testing
  • Data center deployment
  • Production operation

Potential failures can be identified before impacting training jobs.

Vector Unit Hardware Replay
#

Ironwood introduces hardware-level replay verification.

The mechanism:

  1. Uses idle execution slots
  2. Re-executes selected calculations
  3. Verifies computational correctness
  4. Identifies defective compute units

Because verification occurs during otherwise unused cycles, performance impact is effectively zero.

๐Ÿ“ˆ Achieving High Effective Throughput
#

Raw peak performance matters far less than effective throughput.

Google measures effective throughput by accounting for:

  • Recovery operations
  • Fault handling
  • Idle time
  • Communication overhead

Results are impressive:

System Effective Throughput
TPU v4 97%
TPU v5p (Gemini Training) 93%

Maintaining above 90% utilization across tens of thousands of chips is a significant engineering achievement.

๐ŸŽฏ Six Design Principles Behind TPU’s Success
#

Over five generations, Google has distilled six core principles that continue to define TPU architecture.

1. Systolic Arrays for Matrix Computation
#

Large matrix multiplications remain the dominant workload in modern AI.

2. Low-Precision, Large-Range Formats
#

BF16, FP8, and FP4 prioritize dynamic range over unnecessary precision.

3. HBM as Primary External Memory
#

High-bandwidth memory eliminates traditional memory bottlenecks.

4. Proprietary High-Speed Interconnects
#

Thousands of chips can operate as a unified distributed system.

5. Software-Controlled Memory Management
#

DMA scheduling replaces hardware cache complexity.

6. Dedicated Vector Processing Units
#

Matrix and non-matrix workloads execute independently without resource contention.

๐Ÿ’ก TPU Innovations Rarely Found Elsewhere
#

Two TPU innovations remain relatively unique within the industry.

Optical Circuit Switches
#

OCS enables:

  • Modular deployments
  • Fault isolation
  • Incremental cluster expansion
  • Simplified maintenance

SparseCore
#

SparseCore provides specialized acceleration for:

  • Embedding operations
  • Recommendation systems
  • Distributed communication
  • Decoding workloads

Few competing AI accelerators implement a dedicated sparse-processing engine at this scale.

๐Ÿ” Why TPU Continues to Matter
#

After eight years of continuous evolution, Google’s TPU ecosystem has developed several advantages that extend beyond raw hardware performance.

Simplified Programming Model
#

Developers work with a small number of large compute engines rather than thousands of independent cores.

Hardware and Software Co-Design
#

The XLA compiler and JAX ecosystem evolve alongside TPU hardware, reducing migration costs between generations.

Massive Unified Clusters
#

Optical switching enables training jobs to span tens of thousands of chips while maintaining high utilization.

Predictable Upgrade Path
#

Each generation improves:

  • Compute performance
  • Memory capacity
  • Bandwidth
  • Reliability
  • Cluster scale

without disrupting the software ecosystem.

Sustainability and Energy Efficiency
#

Every TPU generation improves performance per watt, reducing the environmental impact of large-scale AI training.

๐Ÿ Conclusion
#

From TPU v2 to Ironwood, Google’s AI hardware strategy demonstrates that long-term architectural consistency can outperform constant reinvention.

While AI workloads evolved from RNNs to Transformers, diffusion models, and multimodal foundation models, TPU’s core design principles remained remarkably stable. Systolic arrays, BF16 arithmetic, software-managed memory, HBM, and large-scale distributed interconnects have continued to scale successfully across five generations.

Today, TPU serves as the computational backbone behind Gemini and many of Google’s most advanced AI systems. More importantly, it offers a blueprint for future AI infrastructure: build a strong foundation, evolve it systematically, and optimize relentlessly rather than rebuilding from scratch every few years.

Related

From Optical Interconnects to Optical Computing: The Photonics Era
·1024 words·5 mins
Photonics Optical Computing CPO AI Infrastructure Data Centers CXL Interconnects Semiconductors High-Performance Computing AI Scaling
LPO vs CPO vs NPO: The Future of AI Optical Interconnects
·1544 words·8 mins
AI Infrastructure Optical Networking LPO NPO CPO Silicon Photonics Data Centers HPC Semiconductors Networking
Intel Reportedly Wins 3 Million AI Chip Order as NVIDIA Evaluates Packaging Tech
·1023 words·5 mins
Intel NVIDIA Google Tsmc AI Chips Semiconductors Advanced Packaging TPU Feynman GPU Foundry