Google TPU Evolution: From TPU v2 to Ironwood in 8 Years

Table of Contents

Google TPU Evolution: From TPU v2 to Ironwood in 8 Years

When discussing advanced chip design, companies such as Intel, AMD, and NVIDIA traditionally dominate the conversation. However, in 2016, Google fundamentally altered the landscape by introducing a custom AI accelerator known as the Tensor Processing Unit (TPU).

What began as a specialized chip for AI inference has evolved into one of the world’s most sophisticated AI training platforms. Between TPU v2 in 2017 and Ironwood in 2025, Google’s TPU supercomputing infrastructure achieved an extraordinary increase in peak system performance of approximately 3,600×.

More importantly, Google’s TPU journey demonstrates a powerful engineering lesson: architectural stability can outperform constant reinvention.

🚀 Why Google Decided to Build Its Own AI Chips
#

By 2016, Google’s core services—including Search, Translate, Photos, and Ads—were increasingly dependent on deep learning models.

At the time, general-purpose CPUs and GPUs faced several challenges:

High power consumption
Limited inference efficiency
Rising infrastructure costs
Increasing AI deployment demands

Google responded by designing TPU v1, a custom ASIC optimized specifically for neural network inference.

The results were dramatic:

Up to 30× better energy efficiency than contemporary GPUs
Up to 80× better energy efficiency than CPUs
Significantly lower operational costs for large-scale AI services

The success of TPU v1 inspired a wave of custom AI silicon initiatives across the industry. Companies including Intel, Amazon, Alibaba, and numerous startups soon began investing in dedicated AI accelerators.

However, TPU v1 was only the beginning. Google’s long-term competitive advantage emerged with TPU v2, which introduced large-scale AI training capabilities.

🏗️ The Remarkable Stability of TPU Architecture
#

One of the most surprising aspects of the TPU story is how little its core architecture has changed.

In the early years, many hardware experts questioned whether custom AI ASICs could survive in a field evolving as rapidly as machine learning. Chip development cycles often require multiple years, while AI model architectures can change within months.

Google’s experience proved otherwise.

Over eight years and five TPU generations, the foundational TPU architecture has remained largely intact while continuing to support every major wave of AI innovation.

The AI Landscape Changed Completely
#

When TPU v2 was introduced:

Multi-Layer Perceptrons (MLPs) and DLRM models dominated workloads
Recurrent Neural Networks (RNNs) were widely deployed
Transformers had not yet become mainstream
Diffusion models did not exist in production systems

Fast forward to 2026:

RNN workloads have effectively disappeared
Transformer models account for approximately 74% of Google’s training workloads
Diffusion models power image and video generation systems
Multimodal foundation models dominate AI research

Despite these fundamental shifts, TPU’s core hardware design remains relevant.

This architectural longevity is one of the platform’s greatest achievements.

⚙️ The Core TPU Architecture
#

The TPU architecture follows a simple philosophy: maximize throughput for matrix operations while minimizing programming complexity.

Across multiple generations, improvements have largely focused on scale, precision, memory capacity, bandwidth, and reliability rather than redesigning the entire architecture.

TensorCore: The Foundation of TPU Compute
#

Each TPU contains two large TensorCores.

Rather than relying on thousands of small programmable cores, Google chose a small number of extremely powerful compute engines capable of processing massive blocks of data efficiently.

Benefits include:

Simpler programming models
Lower scheduling overhead
Higher computational efficiency
Easier compiler optimization

Internally, TensorCores utilize Very Long Instruction Word (VLIW) execution, allowing multiple operations to be bundled and executed simultaneously.

A dedicated 128-lane vector processing unit handles operations such as:

Activation functions
Layer normalization
Quantization
Non-matrix arithmetic

This separation enables matrix and vector workloads to execute concurrently.

🧮 MXU: The Heart of TPU Compute
#

The Matrix Multiplication Unit (MXU) is the engine that powers nearly all modern AI workloads.

Large language models, diffusion models, and recommendation systems all depend heavily on matrix multiplication.

Evolution of the MXU
#

TPU v2 featured:

2 × 128 × 128 systolic arrays

Ironwood expands this significantly:

4 × 256 × 256 BF16 arrays
FP8 support
Equivalent to four 512 × 512 FP8 matrix multiplications

The systolic array architecture remains one of Google’s most important innovations because it delivers extremely high computational density with predictable data movement patterns.

BF16 Changed AI Computing
#

Google also pioneered the Bfloat16 (BF16) numerical format.

BF16 structure:

1-bit Sign
8-bit Exponent
7-bit Mantissa

Compared with FP32, BF16 sacrifices precision while preserving dynamic range.

This design perfectly aligns with deep learning workloads, where numerical range matters more than exact precision.

The industry’s later adoption of FP8 and FP4 formats largely follows the same philosophy.

🧩 SparseCore: Specialized Acceleration for Sparse Workloads
#

SparseCore is one of the TPU architecture’s most distinctive features.

Unlike TensorCores, SparseCore focuses on sparse computations such as:

Recommendation systems
Embedding lookups
Transformer communication
Top-K selection
Token decoding

Despite consuming only approximately 5% of chip area and power, SparseCore delivers substantial acceleration for workloads that would otherwise be inefficient on dense matrix hardware.

SparseCore Evolution
#

Generation	SparseCore Count
Early TPUs	2
Ironwood	4

Initially designed for search and advertising systems, SparseCore has evolved into an important communication acceleration engine for large-scale foundation models.

💾 Memory Architecture: Scaling Bandwidth for AI
#

One of TPU’s defining architectural choices is the elimination of traditional CPU-style cache hierarchies.

Instead, Google relies on software-managed memory movement through DMA scheduling.

TPU Memory Hierarchy
#

HBM (Global Shared Memory)
        ↑
DMA Scheduling
        ↑
VMEM (On-Chip Vector Memory)
        ↑
Compute Units

This design gives the compiler direct control over data movement, enabling predictable performance at scale.

Growth Across Generations
#

Component	TPU v2	Ironwood
On-Chip Memory	32 MB	128 MB
HBM Capacity	16 GiB	192 GiB
HBM Stacks	2	8
Bandwidth	700 GB/s	7,300 GB/s

Memory bandwidth has increased by roughly 10×, helping keep pace with ever-growing model sizes.

🌐 Interconnect Evolution
#

Scaling AI increasingly depends on connecting thousands of chips into a unified system.

Google’s Inter-Chip Interconnect (ICI) has steadily evolved over successive TPU generations.

Generation	Configuration
TPU v2	4 × 62 GB/s
TPU v4	6 × 50 GB/s
TPU v5p	6 × 100 GB/s
Ironwood	6 × 100 GB/s

ICI allows TPU clusters to function as a single distributed supercomputer rather than a collection of isolated accelerators.

🔧 Managing Clusters With Tens of Thousands of Chips
#

Training modern Gemini-scale models requires enormous infrastructure.

At this scale, hardware failures are not exceptional events—they are guaranteed.

Google’s solution combines architectural resilience with optical networking.

Optical Circuit Switches (OCS)
#

Beginning with TPU v4, Google introduced Optical Circuit Switches.

The basic deployment unit is a:

4 × 4 × 4 Cube
= 64 TPUs

Each cube connects independently to an optical switch.

When hardware fails:

Faulty nodes can be bypassed
Remaining hardware continues operating
Entire clusters do not require shutdown

This dramatically improves availability and deployment flexibility.

Flexible Cluster Scheduling
#

OCS enables:

Dynamic cluster composition
Fault-tolerant scheduling
Efficient resource utilization
Rapid hardware replacement

Even partially degraded clusters can continue training large models efficiently.

🛡️ Ironwood’s Hardware Reliability Innovations
#

As cluster sizes grow, silent hardware errors become a major concern.

Ironwood introduces dedicated mechanisms to detect and mitigate these issues.

FBIST: Functional Built-In Self-Test
#

FBIST continuously validates hardware throughout its lifecycle:

Manufacturing
Burn-in testing
Data center deployment
Production operation

Potential failures can be identified before impacting training jobs.

Vector Unit Hardware Replay
#

Ironwood introduces hardware-level replay verification.

The mechanism:

Uses idle execution slots
Re-executes selected calculations
Verifies computational correctness
Identifies defective compute units

Because verification occurs during otherwise unused cycles, performance impact is effectively zero.

📈 Achieving High Effective Throughput
#

Raw peak performance matters far less than effective throughput.

Google measures effective throughput by accounting for:

Recovery operations
Fault handling
Idle time
Communication overhead

Results are impressive:

System	Effective Throughput
TPU v4	97%
TPU v5p (Gemini Training)	93%

Maintaining above 90% utilization across tens of thousands of chips is a significant engineering achievement.

🎯 Six Design Principles Behind TPU’s Success
#

Over five generations, Google has distilled six core principles that continue to define TPU architecture.

1. Systolic Arrays for Matrix Computation
#

Large matrix multiplications remain the dominant workload in modern AI.

2. Low-Precision, Large-Range Formats
#

BF16, FP8, and FP4 prioritize dynamic range over unnecessary precision.

3. HBM as Primary External Memory
#

High-bandwidth memory eliminates traditional memory bottlenecks.

4. Proprietary High-Speed Interconnects
#

Thousands of chips can operate as a unified distributed system.

5. Software-Controlled Memory Management
#

DMA scheduling replaces hardware cache complexity.

6. Dedicated Vector Processing Units
#

Matrix and non-matrix workloads execute independently without resource contention.

💡 TPU Innovations Rarely Found Elsewhere
#

Two TPU innovations remain relatively unique within the industry.

Optical Circuit Switches
#

OCS enables:

Modular deployments
Fault isolation
Incremental cluster expansion
Simplified maintenance

SparseCore
#

SparseCore provides specialized acceleration for:

Embedding operations
Recommendation systems
Distributed communication
Decoding workloads

Few competing AI accelerators implement a dedicated sparse-processing engine at this scale.

🔍 Why TPU Continues to Matter
#

After eight years of continuous evolution, Google’s TPU ecosystem has developed several advantages that extend beyond raw hardware performance.

Simplified Programming Model
#

Developers work with a small number of large compute engines rather than thousands of independent cores.

Hardware and Software Co-Design
#

The XLA compiler and JAX ecosystem evolve alongside TPU hardware, reducing migration costs between generations.

Massive Unified Clusters
#

Optical switching enables training jobs to span tens of thousands of chips while maintaining high utilization.

Predictable Upgrade Path
#

Each generation improves:

Compute performance
Memory capacity
Bandwidth
Reliability
Cluster scale

without disrupting the software ecosystem.

Sustainability and Energy Efficiency
#

Every TPU generation improves performance per watt, reducing the environmental impact of large-scale AI training.

🏁 Conclusion
#

From TPU v2 to Ironwood, Google’s AI hardware strategy demonstrates that long-term architectural consistency can outperform constant reinvention.

While AI workloads evolved from RNNs to Transformers, diffusion models, and multimodal foundation models, TPU’s core design principles remained remarkably stable. Systolic arrays, BF16 arithmetic, software-managed memory, HBM, and large-scale distributed interconnects have continued to scale successfully across five generations.

Today, TPU serves as the computational backbone behind Gemini and many of Google’s most advanced AI systems. More importantly, it offers a blueprint for future AI infrastructure: build a strong foundation, evolve it systematically, and optimize relentlessly rather than rebuilding from scratch every few years.

From Optical Interconnects to Optical Computing: The Photonics Era

13 June 2026·1024 words·5 mins

Photonics Optical Computing CPO AI Infrastructure Data Centers CXL Interconnects Semiconductors High-Performance Computing AI Scaling

LPO vs CPO vs NPO: The Future of AI Optical Interconnects

12 June 2026·1544 words·8 mins

AI Infrastructure Optical Networking LPO NPO CPO Silicon Photonics Data Centers HPC Semiconductors Networking

Intel Reportedly Wins 3 Million AI Chip Order as NVIDIA Evaluates Packaging Tech

8 June 2026·1023 words·5 mins

Intel NVIDIA Google Tsmc AI Chips Semiconductors Advanced Packaging TPU Feynman GPU Foundry

🚀 Why Google Decided to Build Its Own AI Chips #

🏗️ The Remarkable Stability of TPU Architecture #

The AI Landscape Changed Completely #

⚙️ The Core TPU Architecture #

TensorCore: The Foundation of TPU Compute #

🧮 MXU: The Heart of TPU Compute #

Evolution of the MXU #

BF16 Changed AI Computing #

🧩 SparseCore: Specialized Acceleration for Sparse Workloads #

SparseCore Evolution #

💾 Memory Architecture: Scaling Bandwidth for AI #

TPU Memory Hierarchy #

Growth Across Generations #

🌐 Interconnect Evolution #

🔧 Managing Clusters With Tens of Thousands of Chips #

Optical Circuit Switches (OCS) #

Flexible Cluster Scheduling #

🛡️ Ironwood’s Hardware Reliability Innovations #

FBIST: Functional Built-In Self-Test #

Vector Unit Hardware Replay #

📈 Achieving High Effective Throughput #

🎯 Six Design Principles Behind TPU’s Success #

1. Systolic Arrays for Matrix Computation #

2. Low-Precision, Large-Range Formats #

3. HBM as Primary External Memory #

4. Proprietary High-Speed Interconnects #

5. Software-Controlled Memory Management #

6. Dedicated Vector Processing Units #

💡 TPU Innovations Rarely Found Elsewhere #

Optical Circuit Switches #

SparseCore #

🔍 Why TPU Continues to Matter #

Simplified Programming Model #

Hardware and Software Co-Design #

Massive Unified Clusters #

Predictable Upgrade Path #

Sustainability and Energy Efficiency #

🏁 Conclusion #

Related