CODA Rewrites Transformer Kernels for AI-Generated GPU Speed
Modern large language model training is increasingly constrained not by raw compute throughput, but by memory movement.
As GPUs become dramatically faster at matrix multiplication through:
- Tensor Cores
- FP8 arithmetic
- FP4 acceleration
- Specialized AI hardware
a new bottleneck has emerged:
Memory bandwidth overhead between small Transformer operators
This is precisely the problem addressed by:
CODA
a new programming abstraction introduced by researchers from:
- MIT
- Princeton
- Together AI
- Meta
CODA fundamentally rethinks how Transformer computations are structured on GPUs by rewriting large portions of the Transformer pipeline into:
GEMM + epilogue programs
The result is both technically elegant and strategically important:
- Faster Transformer execution
- Reduced memory traffic
- Simplified kernel engineering
- AI-assisted kernel generation
Perhaps most remarkably, CODA demonstrates that even LLMs can generate near state-of-the-art GPU kernels when given the correct abstraction layer.
๐ The Bigger Context: Why GPU Kernels Are Becoming Harder #
Large-scale Transformer systems contain far more than just matrix multiplication.
While operations like:
- Attention
- GEMMs
- Tensor contractions
consume most raw compute cycles, modern training pipelines also execute enormous numbers of smaller operations:
- RMSNorm
- SwiGLU
- RoPE
- Residual additions
- Reductions
- Cross-entropy calculations
Individually, these operations are lightweight.
Collectively, however, they create a massive hidden tax:
Global memory traffic
This is increasingly problematic because modern GPUs can compute much faster than they can move data.
โ ๏ธ The Real Bottleneck Is Memory Movement #
The core issue is not arithmetic throughput.
It is:
VRAM bandwidth
A typical Transformer pipeline repeatedly performs this pattern:
Compute โ write tensor โ reload tensor โ compute again
Each operator boundary often forces intermediate tensors to leave fast on-chip memory and return to global memory.
This becomes extremely expensive at scale.
The problem worsens as lower precision formats accelerate GEMMs further.
With FP8 and FP4:
- Matrix multiplication becomes dramatically faster
- Memory transfer costs remain largely unchanged
Consequently, non-GEMM operations consume a growing percentage of total runtime.
๐ง Why PyTorch Struggles to Solve This #
Frameworks like PyTorch are designed around:
Operator graphs
This abstraction is excellent for:
- Flexibility
- Autograd
- Dynamic execution
- Research iteration
But it introduces rigid boundaries between operations.
Those boundaries make aggressive cross-operation fusion difficult.
In practice, this means many unnecessary memory round-trips survive even highly optimized pipelines.
๐ฌ CODAโs Core Insight: Exploit the GEMM Epilogue #
CODAโs central idea comes directly from GPU kernel architecture.
A high-performance GEMM kernel typically contains two phases:
Mainloop #
The main compute stage that performs matrix multiplication.
Epilogue #
The final stage before results are written back to VRAM.
The epilogue often handles tasks such as:
- Bias addition
- Scaling
- Casting
- Lightweight transformations
Critically:
The data is still inside registers during the epilogue
This creates an opportunity.
If additional Transformer computations can execute during the epilogue phase, intermediate tensors never need to touch global memory.
โ๏ธ Rewriting Transformers as GEMM-Epilogue Programs #
CODA demonstrates that many Transformer operations can be algebraically rewritten to fit inside GEMM epilogues.
This is the paperโs major conceptual breakthrough.
Rather than executing:
GEMM โ RMSNorm โ GEMM
as three independent operators, CODA restructures the computation so parts of RMSNorm occur inside GEMM epilogues.
The practical effect is enormous:
- Fewer kernel launches
- Less VRAM traffic
- Higher throughput
- Better utilization of on-chip memory
๐ RMSNorm Example #
One particularly elegant optimization involves RMSNorm scaling.
The scaling factor:
r
is shared across an entire row.
Because of its algebraic properties, CODA proves this scaling can be delayed and merged into the epilogue of the following GEMM.
Instead of writing normalized tensors back to memory:
- Partial RMS statistics remain local
- Reduction work becomes lightweight
- Explicit RMSNorm kernels effectively disappear
This dramatically reduces memory traffic.
๐ Other Transformer Operations CODA Optimizes #
The same epilogue fusion strategy applies to many common Transformer components.
SwiGLU #
Activation and gating operations can be fused into epilogues.
RoPE (Rotary Position Embeddings) #
Position embedding transformations can execute directly inside fused kernels.
Cross-Entropy Loss #
Loss computations can leverage stateful epilogue transformations.
Backward Propagation #
Importantly, the same principles extend naturally into backward passes.
This matters because backward propagation is often even more memory-intensive than inference.
๐งฉ CODAโs Five Primitive Building Blocks #
CODA is not just a collection of handcrafted kernels.
It introduces a generalized abstraction built around five primitive categories.
1. Elementwise Transformations #
Examples:
- Residual additions
- Activations
- RoPE
2. Vector Loads and Stores #
Examples:
- RMSNorm weight broadcasting
3. Block Loads and Stores #
Examples:
- Activation checkpointing
- Intermediate caching
4. Block Reductions #
Examples:
- Partial RMS
- Log-sum-exp
5. Stateful Transformations #
Examples:
- Online normalization statistics
- Running maxima
These primitives allow developers to compose highly optimized Transformer kernels systematically.
๐ค The Most Interesting Part: LLMs Can Generate the Kernels #
One of the paperโs most fascinating findings is that AI-generated kernels achieved near-human performance.
The researchers evaluated two approaches:
Human-Written Kernels #
Hand-optimized implementations created by experts.
AI-Generated Kernels #
Generated primarily using:
Claude Code
with light human supervision.
The result was remarkable:
AI-generated kernels performed nearly identically to expert-written versions
In some cases, they even marginally exceeded human implementations.
๐ก Why This Matters So Much #
Historically, GPU kernel engineering has been one of the most specialized and difficult areas of systems programming.
Writing elite CUDA kernels typically requires deep expertise in:
- GPU architecture
- Memory hierarchies
- Warp scheduling
- Tensor Core utilization
- Occupancy tuning
CODA changes the equation.
By constraining optimization inside a clean abstraction layer, the problem becomes far more tractable for AI systems.
This is a major shift.
๐ Performance Results #
CODA was benchmarked against highly optimized baselines including:
- cuBLAS
- torch.compile
- Liger Kernel
- FlashInfer
The results were highly competitive.
โก Single-Operator Speedups #
Patterns such as:
GEMM-RMSNorm-GEMM
consistently outperformed:
cuBLAS + PyTorch
across model scales ranging from:
- 1B parameters
- 7B parameters
- 70B parameters
๐ Backward Pass Improvements #
Backward propagation saw particularly large gains.
Reported improvements included:
- 1.6xโ1.8x speedups for RMSNorm-related backward kernels
- 1.4xโ1.6x gains for SwiGLU backward passes
This is extremely important because backward passes dominate training cost.
๐ End-to-End Transformer Gains #
Across full Transformer layers, CODA achieved:
5%โ20% end-to-end forward speedups
with gains increasing at larger model sizes.
For hyperscale training clusters, even single-digit improvements are highly valuable.
๐งฎ Numerical Accuracy #
Despite algebraically restructuring operations, CODA maintained numerical stability comparable to standard PyTorch implementations.
In some cases, numerical precision even improved due to:
- Higher precision GEMM accumulators
- Reduced intermediate rounding
๐๏ธ Built on CuTeDSL and CUTLASS #
CODA leverages:
CuTeDSL
the Python DSL built on top of NVIDIA CUTLASS.
This provides:
- High-performance GEMM infrastructure
- Composable abstractions
- Hardware-aware optimization
without requiring developers to manually engineer every CUDA detail.
โ ๏ธ Current Limitations #
CODA is still early-stage research infrastructure.
Current limitations include:
- Single-GPU focus
- No distributed training integration
- Limited support outside standard Transformer layouts
- No attention kernel replacement yet
Attention itself remains largely handled by systems such as:
FlashAttention
๐ฅ CODA Fits Into a Larger Industry Shift #
CODA is part of a broader movement in AI infrastructure.
Other examples include:
- FlashAttention
- Triton
- ThunderKittens
- TileLang
All share a common philosophy:
Keep computations on-chip as long as possible
Modern GPU optimization is increasingly about:
- Data locality
- Fusion
- Memory minimization
- Scheduling efficiency
rather than purely arithmetic optimization.
๐ง The Deeper Significance #
Perhaps the most important implication is conceptual.
CODA suggests that future AI infrastructure may increasingly be designed around abstractions optimized not only for humans, but also for AI code generation systems.
When abstractions become sufficiently structured:
- LLMs can reason about them
- AI systems can optimize them
- Kernel engineering becomes more accessible
This creates a fascinating recursive loop:
AI systems helping optimize the infrastructure used to train AI systems
๐ผ Why the Name โCODAโ Is Surprisingly Appropriate #
The name intentionally echoes both:
- CUDA
- Musical codas
In music, a coda is the concluding section that resolves a composition.
Technically, CODA focuses precisely on the:
GEMM epilogue
the โfinal passageโ before data leaves fast GPU memory.
That final stage turns out to be one of the most valuable optimization opportunities in modern AI systems.
๐ Final Thoughts #
CODA is not simply another fused-kernel project.
It represents a broader evolution in how AI systems interact with hardware.
Historically:
- High-level frameworks prioritized programmability
- Low-level kernels prioritized performance
CODA attempts to bridge those worlds.
By expressing Transformer computations as structured GEMM-epilogue programs, it enables:
- Strong performance portability
- Reduced memory overhead
- AI-assisted optimization
- More accessible kernel engineering
Most importantly, CODA reinforces a key reality of modern AI infrastructure:
Performance is increasingly determined by data movement, not raw arithmetic
As GPUs continue scaling compute throughput faster than memory bandwidth, techniques like epilogue fusion may become fundamental to the next generation of AI training systems.
๐ References #
- CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
- Tri Dao Social Media Commentary
- NVIDIA CUTLASS Documentation
- CuTeDSL Documentation
- FlashAttention Research Papers
- PyTorch torch.compile Documentation
- Triton Language Documentation
- ThunderKittens GPU Kernel Framework
- TileLang Compiler Infrastructure