Skip to main content

CODA Rewrites Transformer Kernels for AI-Generated GPU Speed

·1467 words·7 mins
CUDA CODA Transformers LLM Training GPU Programming FlashAttention PyTorch CUTLASS AI Infrastructure Machine Learning
Table of Contents

CODA Rewrites Transformer Kernels for AI-Generated GPU Speed

Modern large language model training is increasingly constrained not by raw compute throughput, but by memory movement.

As GPUs become dramatically faster at matrix multiplication through:

  • Tensor Cores
  • FP8 arithmetic
  • FP4 acceleration
  • Specialized AI hardware

a new bottleneck has emerged:

Memory bandwidth overhead between small Transformer operators

This is precisely the problem addressed by:

CODA

a new programming abstraction introduced by researchers from:

  • MIT
  • Princeton
  • Together AI
  • Meta

CODA fundamentally rethinks how Transformer computations are structured on GPUs by rewriting large portions of the Transformer pipeline into:

GEMM + epilogue programs

The result is both technically elegant and strategically important:

  • Faster Transformer execution
  • Reduced memory traffic
  • Simplified kernel engineering
  • AI-assisted kernel generation

Perhaps most remarkably, CODA demonstrates that even LLMs can generate near state-of-the-art GPU kernels when given the correct abstraction layer.

๐Ÿš€ The Bigger Context: Why GPU Kernels Are Becoming Harder
#

Large-scale Transformer systems contain far more than just matrix multiplication.

While operations like:

  • Attention
  • GEMMs
  • Tensor contractions

consume most raw compute cycles, modern training pipelines also execute enormous numbers of smaller operations:

  • RMSNorm
  • SwiGLU
  • RoPE
  • Residual additions
  • Reductions
  • Cross-entropy calculations

Individually, these operations are lightweight.

Collectively, however, they create a massive hidden tax:

Global memory traffic

This is increasingly problematic because modern GPUs can compute much faster than they can move data.

โš ๏ธ The Real Bottleneck Is Memory Movement
#

The core issue is not arithmetic throughput.

It is:

VRAM bandwidth

A typical Transformer pipeline repeatedly performs this pattern:

Compute โ†’ write tensor โ†’ reload tensor โ†’ compute again

Each operator boundary often forces intermediate tensors to leave fast on-chip memory and return to global memory.

This becomes extremely expensive at scale.

The problem worsens as lower precision formats accelerate GEMMs further.

With FP8 and FP4:

  • Matrix multiplication becomes dramatically faster
  • Memory transfer costs remain largely unchanged

Consequently, non-GEMM operations consume a growing percentage of total runtime.

๐Ÿง  Why PyTorch Struggles to Solve This
#

Frameworks like PyTorch are designed around:

Operator graphs

This abstraction is excellent for:

  • Flexibility
  • Autograd
  • Dynamic execution
  • Research iteration

But it introduces rigid boundaries between operations.

Those boundaries make aggressive cross-operation fusion difficult.

In practice, this means many unnecessary memory round-trips survive even highly optimized pipelines.

๐Ÿ”ฌ CODAโ€™s Core Insight: Exploit the GEMM Epilogue
#

CODAโ€™s central idea comes directly from GPU kernel architecture.

A high-performance GEMM kernel typically contains two phases:

Mainloop
#

The main compute stage that performs matrix multiplication.

Epilogue
#

The final stage before results are written back to VRAM.

The epilogue often handles tasks such as:

  • Bias addition
  • Scaling
  • Casting
  • Lightweight transformations

Critically:

The data is still inside registers during the epilogue

This creates an opportunity.

If additional Transformer computations can execute during the epilogue phase, intermediate tensors never need to touch global memory.

โš™๏ธ Rewriting Transformers as GEMM-Epilogue Programs
#

CODA demonstrates that many Transformer operations can be algebraically rewritten to fit inside GEMM epilogues.

This is the paperโ€™s major conceptual breakthrough.

Rather than executing:

GEMM โ†’ RMSNorm โ†’ GEMM

as three independent operators, CODA restructures the computation so parts of RMSNorm occur inside GEMM epilogues.

The practical effect is enormous:

  • Fewer kernel launches
  • Less VRAM traffic
  • Higher throughput
  • Better utilization of on-chip memory

๐Ÿ“ RMSNorm Example
#

One particularly elegant optimization involves RMSNorm scaling.

The scaling factor:

r

is shared across an entire row.

Because of its algebraic properties, CODA proves this scaling can be delayed and merged into the epilogue of the following GEMM.

Instead of writing normalized tensors back to memory:

  • Partial RMS statistics remain local
  • Reduction work becomes lightweight
  • Explicit RMSNorm kernels effectively disappear

This dramatically reduces memory traffic.

๐Ÿ”„ Other Transformer Operations CODA Optimizes
#

The same epilogue fusion strategy applies to many common Transformer components.

SwiGLU
#

Activation and gating operations can be fused into epilogues.

RoPE (Rotary Position Embeddings)
#

Position embedding transformations can execute directly inside fused kernels.

Cross-Entropy Loss
#

Loss computations can leverage stateful epilogue transformations.

Backward Propagation
#

Importantly, the same principles extend naturally into backward passes.

This matters because backward propagation is often even more memory-intensive than inference.

๐Ÿงฉ CODAโ€™s Five Primitive Building Blocks
#

CODA is not just a collection of handcrafted kernels.

It introduces a generalized abstraction built around five primitive categories.

1. Elementwise Transformations
#

Examples:

  • Residual additions
  • Activations
  • RoPE

2. Vector Loads and Stores
#

Examples:

  • RMSNorm weight broadcasting

3. Block Loads and Stores
#

Examples:

  • Activation checkpointing
  • Intermediate caching

4. Block Reductions
#

Examples:

  • Partial RMS
  • Log-sum-exp

5. Stateful Transformations
#

Examples:

  • Online normalization statistics
  • Running maxima

These primitives allow developers to compose highly optimized Transformer kernels systematically.

๐Ÿค– The Most Interesting Part: LLMs Can Generate the Kernels
#

One of the paperโ€™s most fascinating findings is that AI-generated kernels achieved near-human performance.

The researchers evaluated two approaches:

Human-Written Kernels
#

Hand-optimized implementations created by experts.

AI-Generated Kernels
#

Generated primarily using:

Claude Code

with light human supervision.

The result was remarkable:

AI-generated kernels performed nearly identically to expert-written versions

In some cases, they even marginally exceeded human implementations.

๐Ÿ’ก Why This Matters So Much
#

Historically, GPU kernel engineering has been one of the most specialized and difficult areas of systems programming.

Writing elite CUDA kernels typically requires deep expertise in:

  • GPU architecture
  • Memory hierarchies
  • Warp scheduling
  • Tensor Core utilization
  • Occupancy tuning

CODA changes the equation.

By constraining optimization inside a clean abstraction layer, the problem becomes far more tractable for AI systems.

This is a major shift.

๐Ÿ“Š Performance Results
#

CODA was benchmarked against highly optimized baselines including:

  • cuBLAS
  • torch.compile
  • Liger Kernel
  • FlashInfer

The results were highly competitive.

โšก Single-Operator Speedups
#

Patterns such as:

GEMM-RMSNorm-GEMM

consistently outperformed:

cuBLAS + PyTorch

across model scales ranging from:

  • 1B parameters
  • 7B parameters
  • 70B parameters

๐Ÿ”„ Backward Pass Improvements
#

Backward propagation saw particularly large gains.

Reported improvements included:

  • 1.6xโ€“1.8x speedups for RMSNorm-related backward kernels
  • 1.4xโ€“1.6x gains for SwiGLU backward passes

This is extremely important because backward passes dominate training cost.

๐Ÿ“ˆ End-to-End Transformer Gains
#

Across full Transformer layers, CODA achieved:

5%โ€“20% end-to-end forward speedups

with gains increasing at larger model sizes.

For hyperscale training clusters, even single-digit improvements are highly valuable.

๐Ÿงฎ Numerical Accuracy
#

Despite algebraically restructuring operations, CODA maintained numerical stability comparable to standard PyTorch implementations.

In some cases, numerical precision even improved due to:

  • Higher precision GEMM accumulators
  • Reduced intermediate rounding

๐Ÿ—๏ธ Built on CuTeDSL and CUTLASS
#

CODA leverages:

CuTeDSL

the Python DSL built on top of NVIDIA CUTLASS.

This provides:

  • High-performance GEMM infrastructure
  • Composable abstractions
  • Hardware-aware optimization

without requiring developers to manually engineer every CUDA detail.

โš ๏ธ Current Limitations
#

CODA is still early-stage research infrastructure.

Current limitations include:

  • Single-GPU focus
  • No distributed training integration
  • Limited support outside standard Transformer layouts
  • No attention kernel replacement yet

Attention itself remains largely handled by systems such as:

FlashAttention

๐Ÿ”ฅ CODA Fits Into a Larger Industry Shift
#

CODA is part of a broader movement in AI infrastructure.

Other examples include:

  • FlashAttention
  • Triton
  • ThunderKittens
  • TileLang

All share a common philosophy:

Keep computations on-chip as long as possible

Modern GPU optimization is increasingly about:

  • Data locality
  • Fusion
  • Memory minimization
  • Scheduling efficiency

rather than purely arithmetic optimization.

๐Ÿง  The Deeper Significance
#

Perhaps the most important implication is conceptual.

CODA suggests that future AI infrastructure may increasingly be designed around abstractions optimized not only for humans, but also for AI code generation systems.

When abstractions become sufficiently structured:

  • LLMs can reason about them
  • AI systems can optimize them
  • Kernel engineering becomes more accessible

This creates a fascinating recursive loop:

AI systems helping optimize the infrastructure used to train AI systems

๐ŸŽผ Why the Name โ€œCODAโ€ Is Surprisingly Appropriate
#

The name intentionally echoes both:

  • CUDA
  • Musical codas

In music, a coda is the concluding section that resolves a composition.

Technically, CODA focuses precisely on the:

GEMM epilogue

the โ€œfinal passageโ€ before data leaves fast GPU memory.

That final stage turns out to be one of the most valuable optimization opportunities in modern AI systems.

๐Ÿ”š Final Thoughts
#

CODA is not simply another fused-kernel project.

It represents a broader evolution in how AI systems interact with hardware.

Historically:

  • High-level frameworks prioritized programmability
  • Low-level kernels prioritized performance

CODA attempts to bridge those worlds.

By expressing Transformer computations as structured GEMM-epilogue programs, it enables:

  • Strong performance portability
  • Reduced memory overhead
  • AI-assisted optimization
  • More accessible kernel engineering

Most importantly, CODA reinforces a key reality of modern AI infrastructure:

Performance is increasingly determined by data movement, not raw arithmetic

As GPUs continue scaling compute throughput faster than memory bandwidth, techniques like epilogue fusion may become fundamental to the next generation of AI training systems.

๐Ÿ“š References
#

  • CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
  • Tri Dao Social Media Commentary
  • NVIDIA CUTLASS Documentation
  • CuTeDSL Documentation
  • FlashAttention Research Papers
  • PyTorch torch.compile Documentation
  • Triton Language Documentation
  • ThunderKittens GPU Kernel Framework
  • TileLang Compiler Infrastructure

Related

Google TPU v8: The End of General-Purpose AI Accelerators
·599 words·3 mins
TPU Google Cloud AI Infrastructure Machine Learning Data Center
256GB DDR5 RDIMM Signals a New Era for AI Memory
·1121 words·6 mins
DDR5 DRAM Micron Nanya Technology AI Infrastructure Memory Technology Data Centers HBM Semiconductors Enterprise Servers
Optical-First Data Centers: CPO vs NPO vs XPO in 2026
·604 words·3 mins
Data Center Optical Interconnect CPO NPO XPO AI Infrastructure Networking