CODA Rewrites Transformer Kernels for AI-Generated GPU Speed

Table of Contents

CODA Rewrites Transformer Kernels for AI-Generated GPU Speed

Modern large language model training is increasingly constrained not by raw compute throughput, but by memory movement.

As GPUs become dramatically faster at matrix multiplication through:

Tensor Cores
FP8 arithmetic
FP4 acceleration
Specialized AI hardware

a new bottleneck has emerged:

Memory bandwidth overhead between small Transformer operators

This is precisely the problem addressed by:

CODA

a new programming abstraction introduced by researchers from:

MIT
Princeton
Together AI
Meta

CODA fundamentally rethinks how Transformer computations are structured on GPUs by rewriting large portions of the Transformer pipeline into:

GEMM + epilogue programs

The result is both technically elegant and strategically important:

Faster Transformer execution
Reduced memory traffic
Simplified kernel engineering
AI-assisted kernel generation

Perhaps most remarkably, CODA demonstrates that even LLMs can generate near state-of-the-art GPU kernels when given the correct abstraction layer.

🚀 The Bigger Context: Why GPU Kernels Are Becoming Harder
#

Large-scale Transformer systems contain far more than just matrix multiplication.

While operations like:

Attention
GEMMs
Tensor contractions

consume most raw compute cycles, modern training pipelines also execute enormous numbers of smaller operations:

RMSNorm
SwiGLU
RoPE
Residual additions
Reductions
Cross-entropy calculations

Individually, these operations are lightweight.

Collectively, however, they create a massive hidden tax:

Global memory traffic

This is increasingly problematic because modern GPUs can compute much faster than they can move data.

⚠️ The Real Bottleneck Is Memory Movement
#

The core issue is not arithmetic throughput.

It is:

VRAM bandwidth

A typical Transformer pipeline repeatedly performs this pattern:

Compute → write tensor → reload tensor → compute again

Each operator boundary often forces intermediate tensors to leave fast on-chip memory and return to global memory.

This becomes extremely expensive at scale.

The problem worsens as lower precision formats accelerate GEMMs further.

With FP8 and FP4:

Matrix multiplication becomes dramatically faster
Memory transfer costs remain largely unchanged

Consequently, non-GEMM operations consume a growing percentage of total runtime.

🧠 Why PyTorch Struggles to Solve This
#

Frameworks like PyTorch are designed around:

Operator graphs

This abstraction is excellent for:

Flexibility
Autograd
Dynamic execution
Research iteration

But it introduces rigid boundaries between operations.

Those boundaries make aggressive cross-operation fusion difficult.

In practice, this means many unnecessary memory round-trips survive even highly optimized pipelines.

🔬 CODA’s Core Insight: Exploit the GEMM Epilogue
#

CODA’s central idea comes directly from GPU kernel architecture.

A high-performance GEMM kernel typically contains two phases:

Mainloop
#

The main compute stage that performs matrix multiplication.

Epilogue
#

The final stage before results are written back to VRAM.

The epilogue often handles tasks such as:

Bias addition
Scaling
Casting
Lightweight transformations

Critically:

The data is still inside registers during the epilogue

This creates an opportunity.

If additional Transformer computations can execute during the epilogue phase, intermediate tensors never need to touch global memory.

⚙️ Rewriting Transformers as GEMM-Epilogue Programs
#

CODA demonstrates that many Transformer operations can be algebraically rewritten to fit inside GEMM epilogues.

This is the paper’s major conceptual breakthrough.

Rather than executing:

GEMM → RMSNorm → GEMM

as three independent operators, CODA restructures the computation so parts of RMSNorm occur inside GEMM epilogues.

The practical effect is enormous:

Fewer kernel launches
Less VRAM traffic
Higher throughput
Better utilization of on-chip memory

📐 RMSNorm Example
#

One particularly elegant optimization involves RMSNorm scaling.

The scaling factor:

is shared across an entire row.

Because of its algebraic properties, CODA proves this scaling can be delayed and merged into the epilogue of the following GEMM.

Instead of writing normalized tensors back to memory:

Partial RMS statistics remain local
Reduction work becomes lightweight
Explicit RMSNorm kernels effectively disappear

This dramatically reduces memory traffic.

🔄 Other Transformer Operations CODA Optimizes
#

The same epilogue fusion strategy applies to many common Transformer components.

SwiGLU
#

Activation and gating operations can be fused into epilogues.

RoPE (Rotary Position Embeddings)
#

Position embedding transformations can execute directly inside fused kernels.

Cross-Entropy Loss
#

Loss computations can leverage stateful epilogue transformations.

Backward Propagation
#

Importantly, the same principles extend naturally into backward passes.

This matters because backward propagation is often even more memory-intensive than inference.

🧩 CODA’s Five Primitive Building Blocks
#

CODA is not just a collection of handcrafted kernels.

It introduces a generalized abstraction built around five primitive categories.

1. Elementwise Transformations
#

Examples:

Residual additions
Activations
RoPE

2. Vector Loads and Stores
#

Examples:

RMSNorm weight broadcasting

3. Block Loads and Stores
#

Examples:

Activation checkpointing
Intermediate caching

4. Block Reductions
#

Examples:

Partial RMS
Log-sum-exp

5. Stateful Transformations
#

Examples:

Online normalization statistics
Running maxima

These primitives allow developers to compose highly optimized Transformer kernels systematically.

🤖 The Most Interesting Part: LLMs Can Generate the Kernels
#

One of the paper’s most fascinating findings is that AI-generated kernels achieved near-human performance.

The researchers evaluated two approaches:

Human-Written Kernels
#

Hand-optimized implementations created by experts.

AI-Generated Kernels
#

Generated primarily using:

Claude Code

with light human supervision.

The result was remarkable:

AI-generated kernels performed nearly identically to expert-written versions

In some cases, they even marginally exceeded human implementations.

💡 Why This Matters So Much
#

Historically, GPU kernel engineering has been one of the most specialized and difficult areas of systems programming.

Writing elite CUDA kernels typically requires deep expertise in:

GPU architecture
Memory hierarchies
Warp scheduling
Tensor Core utilization
Occupancy tuning

CODA changes the equation.

By constraining optimization inside a clean abstraction layer, the problem becomes far more tractable for AI systems.

This is a major shift.

📊 Performance Results
#

CODA was benchmarked against highly optimized baselines including:

cuBLAS
torch.compile
Liger Kernel
FlashInfer

The results were highly competitive.

⚡ Single-Operator Speedups
#

Patterns such as:

GEMM-RMSNorm-GEMM

consistently outperformed:

cuBLAS + PyTorch

across model scales ranging from:

1B parameters
7B parameters
70B parameters

🔄 Backward Pass Improvements
#

Backward propagation saw particularly large gains.

Reported improvements included:

1.6x–1.8x speedups for RMSNorm-related backward kernels
1.4x–1.6x gains for SwiGLU backward passes

This is extremely important because backward passes dominate training cost.

📈 End-to-End Transformer Gains
#

Across full Transformer layers, CODA achieved:

5%–20% end-to-end forward speedups

with gains increasing at larger model sizes.

For hyperscale training clusters, even single-digit improvements are highly valuable.

🧮 Numerical Accuracy
#

Despite algebraically restructuring operations, CODA maintained numerical stability comparable to standard PyTorch implementations.

In some cases, numerical precision even improved due to:

Higher precision GEMM accumulators
Reduced intermediate rounding

🏗️ Built on CuTeDSL and CUTLASS
#

CODA leverages:

CuTeDSL

the Python DSL built on top of NVIDIA CUTLASS.

This provides:

High-performance GEMM infrastructure
Composable abstractions
Hardware-aware optimization

without requiring developers to manually engineer every CUDA detail.

⚠️ Current Limitations
#

CODA is still early-stage research infrastructure.

Current limitations include:

Single-GPU focus
No distributed training integration
Limited support outside standard Transformer layouts
No attention kernel replacement yet

Attention itself remains largely handled by systems such as:

FlashAttention

🔥 CODA Fits Into a Larger Industry Shift
#

CODA is part of a broader movement in AI infrastructure.

Other examples include:

FlashAttention
Triton
ThunderKittens
TileLang

All share a common philosophy:

Keep computations on-chip as long as possible

Modern GPU optimization is increasingly about:

Data locality
Fusion
Memory minimization
Scheduling efficiency

rather than purely arithmetic optimization.

🧠 The Deeper Significance
#

Perhaps the most important implication is conceptual.

CODA suggests that future AI infrastructure may increasingly be designed around abstractions optimized not only for humans, but also for AI code generation systems.

When abstractions become sufficiently structured:

LLMs can reason about them
AI systems can optimize them
Kernel engineering becomes more accessible

This creates a fascinating recursive loop:

AI systems helping optimize the infrastructure used to train AI systems

🎼 Why the Name “CODA” Is Surprisingly Appropriate
#

The name intentionally echoes both:

CUDA
Musical codas

In music, a coda is the concluding section that resolves a composition.

Technically, CODA focuses precisely on the:

GEMM epilogue

the “final passage” before data leaves fast GPU memory.

That final stage turns out to be one of the most valuable optimization opportunities in modern AI systems.

🔚 Final Thoughts
#

CODA is not simply another fused-kernel project.

It represents a broader evolution in how AI systems interact with hardware.

Historically:

High-level frameworks prioritized programmability
Low-level kernels prioritized performance

CODA attempts to bridge those worlds.

By expressing Transformer computations as structured GEMM-epilogue programs, it enables:

Strong performance portability
Reduced memory overhead
AI-assisted optimization
More accessible kernel engineering

Most importantly, CODA reinforces a key reality of modern AI infrastructure:

Performance is increasingly determined by data movement, not raw arithmetic

As GPUs continue scaling compute throughput faster than memory bandwidth, techniques like epilogue fusion may become fundamental to the next generation of AI training systems.

📚 References
#

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
Tri Dao Social Media Commentary
NVIDIA CUTLASS Documentation
CuTeDSL Documentation
FlashAttention Research Papers
PyTorch torch.compile Documentation
Triton Language Documentation
ThunderKittens GPU Kernel Framework
TileLang Compiler Infrastructure

Google TPU v8: The End of General-Purpose AI Accelerators

23 April 2026·599 words·3 mins

TPU Google Cloud AI Infrastructure Machine Learning Data Center

256GB DDR5 RDIMM Signals a New Era for AI Memory

23 May 2026·1121 words·6 mins

DDR5 DRAM Micron Nanya Technology AI Infrastructure Memory Technology Data Centers HBM Semiconductors Enterprise Servers

Optical-First Data Centers: CPO vs NPO vs XPO in 2026

20 April 2026·604 words·3 mins

Data Center Optical Interconnect CPO NPO XPO AI Infrastructure Networking

🚀 The Bigger Context: Why GPU Kernels Are Becoming Harder #

⚠️ The Real Bottleneck Is Memory Movement #

🧠 Why PyTorch Struggles to Solve This #

🔬 CODA’s Core Insight: Exploit the GEMM Epilogue #

Mainloop #

Epilogue #

⚙️ Rewriting Transformers as GEMM-Epilogue Programs #

📐 RMSNorm Example #

🔄 Other Transformer Operations CODA Optimizes #

SwiGLU #

RoPE (Rotary Position Embeddings) #

Cross-Entropy Loss #

Backward Propagation #

🧩 CODA’s Five Primitive Building Blocks #

1. Elementwise Transformations #

2. Vector Loads and Stores #

3. Block Loads and Stores #

4. Block Reductions #

5. Stateful Transformations #

🤖 The Most Interesting Part: LLMs Can Generate the Kernels #

Human-Written Kernels #

AI-Generated Kernels #

💡 Why This Matters So Much #

📊 Performance Results #

⚡ Single-Operator Speedups #

🔄 Backward Pass Improvements #

📈 End-to-End Transformer Gains #

🧮 Numerical Accuracy #

🏗️ Built on CuTeDSL and CUTLASS #

⚠️ Current Limitations #

🔥 CODA Fits Into a Larger Industry Shift #

🧠 The Deeper Significance #

🎼 Why the Name “CODA” Is Surprisingly Appropriate #

🔚 Final Thoughts #

📚 References #

Related

🚀 The Bigger Context: Why GPU Kernels Are Becoming Harder
#

⚠️ The Real Bottleneck Is Memory Movement
#

🧠 Why PyTorch Struggles to Solve This
#

🔬 CODA’s Core Insight: Exploit the GEMM Epilogue
#

Mainloop
#

Epilogue
#

⚙️ Rewriting Transformers as GEMM-Epilogue Programs
#

📐 RMSNorm Example
#

🔄 Other Transformer Operations CODA Optimizes
#

SwiGLU
#

RoPE (Rotary Position Embeddings)
#

Cross-Entropy Loss
#

Backward Propagation
#

🧩 CODA’s Five Primitive Building Blocks
#

1. Elementwise Transformations
#

2. Vector Loads and Stores
#

3. Block Loads and Stores
#

4. Block Reductions
#

5. Stateful Transformations
#

🤖 The Most Interesting Part: LLMs Can Generate the Kernels
#

Human-Written Kernels
#

AI-Generated Kernels
#

💡 Why This Matters So Much
#

📊 Performance Results
#

⚡ Single-Operator Speedups
#

🔄 Backward Pass Improvements
#

📈 End-to-End Transformer Gains
#

🧮 Numerical Accuracy
#

🏗️ Built on CuTeDSL and CUTLASS
#

⚠️ Current Limitations
#

🔥 CODA Fits Into a Larger Industry Shift
#

🧠 The Deeper Significance
#

🎼 Why the Name “CODA” Is Surprisingly Appropriate
#

🔚 Final Thoughts
#

📚 References
#