Google TPU Evolution: From TPU v2 to Ironwood in 8 Years
When discussing advanced chip design, companies such as Intel, AMD, and NVIDIA traditionally dominate the conversation. However, in 2016, Google fundamentally altered the landscape by introducing a custom AI accelerator known as the Tensor Processing Unit (TPU).
What began as a specialized chip for AI inference has evolved into one of the world’s most sophisticated AI training platforms. Between TPU v2 in 2017 and Ironwood in 2025, Google’s TPU supercomputing infrastructure achieved an extraordinary increase in peak system performance of approximately 3,600ร.
More importantly, Google’s TPU journey demonstrates a powerful engineering lesson: architectural stability can outperform constant reinvention.
๐ Why Google Decided to Build Its Own AI Chips #
By 2016, Google’s core servicesโincluding Search, Translate, Photos, and Adsโwere increasingly dependent on deep learning models.
At the time, general-purpose CPUs and GPUs faced several challenges:
- High power consumption
- Limited inference efficiency
- Rising infrastructure costs
- Increasing AI deployment demands
Google responded by designing TPU v1, a custom ASIC optimized specifically for neural network inference.
The results were dramatic:
- Up to 30ร better energy efficiency than contemporary GPUs
- Up to 80ร better energy efficiency than CPUs
- Significantly lower operational costs for large-scale AI services
The success of TPU v1 inspired a wave of custom AI silicon initiatives across the industry. Companies including Intel, Amazon, Alibaba, and numerous startups soon began investing in dedicated AI accelerators.
However, TPU v1 was only the beginning. Google’s long-term competitive advantage emerged with TPU v2, which introduced large-scale AI training capabilities.
๐๏ธ The Remarkable Stability of TPU Architecture #
One of the most surprising aspects of the TPU story is how little its core architecture has changed.
In the early years, many hardware experts questioned whether custom AI ASICs could survive in a field evolving as rapidly as machine learning. Chip development cycles often require multiple years, while AI model architectures can change within months.
Google’s experience proved otherwise.
Over eight years and five TPU generations, the foundational TPU architecture has remained largely intact while continuing to support every major wave of AI innovation.
The AI Landscape Changed Completely #
When TPU v2 was introduced:
- Multi-Layer Perceptrons (MLPs) and DLRM models dominated workloads
- Recurrent Neural Networks (RNNs) were widely deployed
- Transformers had not yet become mainstream
- Diffusion models did not exist in production systems
Fast forward to 2026:
- RNN workloads have effectively disappeared
- Transformer models account for approximately 74% of Google’s training workloads
- Diffusion models power image and video generation systems
- Multimodal foundation models dominate AI research
Despite these fundamental shifts, TPU’s core hardware design remains relevant.
This architectural longevity is one of the platform’s greatest achievements.
โ๏ธ The Core TPU Architecture #
The TPU architecture follows a simple philosophy: maximize throughput for matrix operations while minimizing programming complexity.
Across multiple generations, improvements have largely focused on scale, precision, memory capacity, bandwidth, and reliability rather than redesigning the entire architecture.
TensorCore: The Foundation of TPU Compute #
Each TPU contains two large TensorCores.
Rather than relying on thousands of small programmable cores, Google chose a small number of extremely powerful compute engines capable of processing massive blocks of data efficiently.
Benefits include:
- Simpler programming models
- Lower scheduling overhead
- Higher computational efficiency
- Easier compiler optimization
Internally, TensorCores utilize Very Long Instruction Word (VLIW) execution, allowing multiple operations to be bundled and executed simultaneously.
A dedicated 128-lane vector processing unit handles operations such as:
- Activation functions
- Layer normalization
- Quantization
- Non-matrix arithmetic
This separation enables matrix and vector workloads to execute concurrently.
๐งฎ MXU: The Heart of TPU Compute #
The Matrix Multiplication Unit (MXU) is the engine that powers nearly all modern AI workloads.
Large language models, diffusion models, and recommendation systems all depend heavily on matrix multiplication.
Evolution of the MXU #
TPU v2 featured:
2 ร 128 ร 128 systolic arrays
Ironwood expands this significantly:
4 ร 256 ร 256 BF16 arrays
FP8 support
Equivalent to four 512 ร 512 FP8 matrix multiplications
The systolic array architecture remains one of Google’s most important innovations because it delivers extremely high computational density with predictable data movement patterns.
BF16 Changed AI Computing #
Google also pioneered the Bfloat16 (BF16) numerical format.
BF16 structure:
1-bit Sign
8-bit Exponent
7-bit Mantissa
Compared with FP32, BF16 sacrifices precision while preserving dynamic range.
This design perfectly aligns with deep learning workloads, where numerical range matters more than exact precision.
The industry’s later adoption of FP8 and FP4 formats largely follows the same philosophy.
๐งฉ SparseCore: Specialized Acceleration for Sparse Workloads #
SparseCore is one of the TPU architecture’s most distinctive features.
Unlike TensorCores, SparseCore focuses on sparse computations such as:
- Recommendation systems
- Embedding lookups
- Transformer communication
- Top-K selection
- Token decoding
Despite consuming only approximately 5% of chip area and power, SparseCore delivers substantial acceleration for workloads that would otherwise be inefficient on dense matrix hardware.
SparseCore Evolution #
| Generation | SparseCore Count |
|---|---|
| Early TPUs | 2 |
| Ironwood | 4 |
Initially designed for search and advertising systems, SparseCore has evolved into an important communication acceleration engine for large-scale foundation models.
๐พ Memory Architecture: Scaling Bandwidth for AI #
One of TPU’s defining architectural choices is the elimination of traditional CPU-style cache hierarchies.
Instead, Google relies on software-managed memory movement through DMA scheduling.
TPU Memory Hierarchy #
HBM (Global Shared Memory)
โ
DMA Scheduling
โ
VMEM (On-Chip Vector Memory)
โ
Compute Units
This design gives the compiler direct control over data movement, enabling predictable performance at scale.
Growth Across Generations #
| Component | TPU v2 | Ironwood |
|---|---|---|
| On-Chip Memory | 32 MB | 128 MB |
| HBM Capacity | 16 GiB | 192 GiB |
| HBM Stacks | 2 | 8 |
| Bandwidth | 700 GB/s | 7,300 GB/s |
Memory bandwidth has increased by roughly 10ร, helping keep pace with ever-growing model sizes.
๐ Interconnect Evolution #
Scaling AI increasingly depends on connecting thousands of chips into a unified system.
Google’s Inter-Chip Interconnect (ICI) has steadily evolved over successive TPU generations.
| Generation | Configuration |
|---|---|
| TPU v2 | 4 ร 62 GB/s |
| TPU v4 | 6 ร 50 GB/s |
| TPU v5p | 6 ร 100 GB/s |
| Ironwood | 6 ร 100 GB/s |
ICI allows TPU clusters to function as a single distributed supercomputer rather than a collection of isolated accelerators.
๐ง Managing Clusters With Tens of Thousands of Chips #
Training modern Gemini-scale models requires enormous infrastructure.
At this scale, hardware failures are not exceptional eventsโthey are guaranteed.
Google’s solution combines architectural resilience with optical networking.
Optical Circuit Switches (OCS) #
Beginning with TPU v4, Google introduced Optical Circuit Switches.
The basic deployment unit is a:
4 ร 4 ร 4 Cube
= 64 TPUs
Each cube connects independently to an optical switch.
When hardware fails:
- Faulty nodes can be bypassed
- Remaining hardware continues operating
- Entire clusters do not require shutdown
This dramatically improves availability and deployment flexibility.
Flexible Cluster Scheduling #
OCS enables:
- Dynamic cluster composition
- Fault-tolerant scheduling
- Efficient resource utilization
- Rapid hardware replacement
Even partially degraded clusters can continue training large models efficiently.
๐ก๏ธ Ironwood’s Hardware Reliability Innovations #
As cluster sizes grow, silent hardware errors become a major concern.
Ironwood introduces dedicated mechanisms to detect and mitigate these issues.
FBIST: Functional Built-In Self-Test #
FBIST continuously validates hardware throughout its lifecycle:
- Manufacturing
- Burn-in testing
- Data center deployment
- Production operation
Potential failures can be identified before impacting training jobs.
Vector Unit Hardware Replay #
Ironwood introduces hardware-level replay verification.
The mechanism:
- Uses idle execution slots
- Re-executes selected calculations
- Verifies computational correctness
- Identifies defective compute units
Because verification occurs during otherwise unused cycles, performance impact is effectively zero.
๐ Achieving High Effective Throughput #
Raw peak performance matters far less than effective throughput.
Google measures effective throughput by accounting for:
- Recovery operations
- Fault handling
- Idle time
- Communication overhead
Results are impressive:
| System | Effective Throughput |
|---|---|
| TPU v4 | 97% |
| TPU v5p (Gemini Training) | 93% |
Maintaining above 90% utilization across tens of thousands of chips is a significant engineering achievement.
๐ฏ Six Design Principles Behind TPU’s Success #
Over five generations, Google has distilled six core principles that continue to define TPU architecture.
1. Systolic Arrays for Matrix Computation #
Large matrix multiplications remain the dominant workload in modern AI.
2. Low-Precision, Large-Range Formats #
BF16, FP8, and FP4 prioritize dynamic range over unnecessary precision.
3. HBM as Primary External Memory #
High-bandwidth memory eliminates traditional memory bottlenecks.
4. Proprietary High-Speed Interconnects #
Thousands of chips can operate as a unified distributed system.
5. Software-Controlled Memory Management #
DMA scheduling replaces hardware cache complexity.
6. Dedicated Vector Processing Units #
Matrix and non-matrix workloads execute independently without resource contention.
๐ก TPU Innovations Rarely Found Elsewhere #
Two TPU innovations remain relatively unique within the industry.
Optical Circuit Switches #
OCS enables:
- Modular deployments
- Fault isolation
- Incremental cluster expansion
- Simplified maintenance
SparseCore #
SparseCore provides specialized acceleration for:
- Embedding operations
- Recommendation systems
- Distributed communication
- Decoding workloads
Few competing AI accelerators implement a dedicated sparse-processing engine at this scale.
๐ Why TPU Continues to Matter #
After eight years of continuous evolution, Google’s TPU ecosystem has developed several advantages that extend beyond raw hardware performance.
Simplified Programming Model #
Developers work with a small number of large compute engines rather than thousands of independent cores.
Hardware and Software Co-Design #
The XLA compiler and JAX ecosystem evolve alongside TPU hardware, reducing migration costs between generations.
Massive Unified Clusters #
Optical switching enables training jobs to span tens of thousands of chips while maintaining high utilization.
Predictable Upgrade Path #
Each generation improves:
- Compute performance
- Memory capacity
- Bandwidth
- Reliability
- Cluster scale
without disrupting the software ecosystem.
Sustainability and Energy Efficiency #
Every TPU generation improves performance per watt, reducing the environmental impact of large-scale AI training.
๐ Conclusion #
From TPU v2 to Ironwood, Google’s AI hardware strategy demonstrates that long-term architectural consistency can outperform constant reinvention.
While AI workloads evolved from RNNs to Transformers, diffusion models, and multimodal foundation models, TPU’s core design principles remained remarkably stable. Systolic arrays, BF16 arithmetic, software-managed memory, HBM, and large-scale distributed interconnects have continued to scale successfully across five generations.
Today, TPU serves as the computational backbone behind Gemini and many of Google’s most advanced AI systems. More importantly, it offers a blueprint for future AI infrastructure: build a strong foundation, evolve it systematically, and optimize relentlessly rather than rebuilding from scratch every few years.