D-Matrix Targets Fast AI Tokens With 3D Memory and Ultra-Low-Latency NICs

Table of Contents

⚡ The Push for Faster AI Tokens
#

As demand for faster large language model (LLM) token generation accelerates, D-Matrix is positioning itself around one core idea: latency is becoming just as important as raw throughput.

Speaking with EE Times, Sree Ganesan, Vice President of Product at D-Matrix, explained that the rise of reasoning models, chain-of-thought techniques, and agentic AI is dramatically increasing token volume and tightening latency requirements. Models are increasingly communicating with one another, no longer constrained by human reading speeds.

Even small language models (SLMs)—those under 1 billion parameters—are contributing to the pressure. More agents mean more tokens, and more tokens mean memory bandwidth, not compute, is becoming the dominant bottleneck.

🧠 Compute Keeps Scaling, Memory Does Not
#

According to Ganesan, the industry is running into a familiar problem.

Compute performance continues to scale at a healthy pace, but memory bandwidth is lagging, widening the gap between processing capability and data delivery. This “memory wall” is especially problematic for inference workloads that demand ultra-low latency.

D-Matrix’s answer is its Corsair inference accelerator, which uses a proprietary compute-in-memory architecture. In Corsair, multiplication is performed directly inside custom SRAM cells, with results aggregated through a digital adder tree. This approach delivers enormous bandwidth—hundreds of terabytes per second—by eliminating much of the data movement that dominates conventional architectures.

However, SRAM introduces a different limitation.

🧱 From Bandwidth to Capacity: Going Vertical
#

While compute-in-SRAM delivers exceptional bandwidth, SRAM does not scale well in capacity, especially on advanced process nodes. To address this second barrier, D-Matrix is moving into 3D memory stacking.

The company is developing custom stacked DRAM dies that sit beneath the logic and SRAM compute layers, connected vertically through the interposer. Future D-Matrix accelerators will continue to use two types of memory:

Performance memory: Modified SRAM that performs computation
Capacity memory: Off-die DRAM used to store larger datasets

By stacking DRAM vertically, D-Matrix significantly increases memory capacity without sacrificing bandwidth. The full surface area of the die remains available for communication, preserving the bandwidth advantages of the compute-in-memory approach.

🧪 3D DRAM Is Already Running
#

D-Matrix’s 3D custom DRAM test chip, Pavehawk, is already operational in the company’s lab. The next-generation product, Raptor, will integrate this technology and targets:

10× higher memory bandwidth
10× better energy efficiency

Compared to simply moving to HBM4, which is increasingly expensive and supply-constrained, D-Matrix sees its 3D approach as both more scalable and more sustainable.

While 3D stacking raises yield and thermal challenges, those risks are mitigated by using small dies—well below reticle size—and by minimizing energy per bit to keep heat under control.

🔌 I/O Becomes the Next Bottleneck
#

Memory is not the only constraint. As inference workloads scale across racks, I/O latency becomes a critical limiter.

To address this, D-Matrix developed Jetstream, a custom PCIe Gen5 NIC now in production. Jetstream delivers:

400 Gbps bandwidth
~2 µs latency
150 W TDP

Jetstream is designed specifically to support Corsair’s communication patterns, enabling device-initiated, asynchronous communication without involving the host CPU. This separation of the data plane and control plane allows communication to keep pace with compute.

🌐 Scaling Performance Memory Across Racks
#

On a single server, an eight-card Corsair node can hold an 8–10B parameter (8-bit) model entirely in performance memory. With fast enough interconnects, that concept scales much further.

Using Jetstream, a single rack could support 100B-parameter models fully resident in performance memory, delivering ultra-low latency inference across distributed systems. Standard PCIe and Ethernet were not fast enough to make this practical; Jetstream closes that gap.

The NIC combines optimized portions of the PCIe stack with selected Ethernet features, reducing software overhead while maintaining industry compatibility. Cards plug into standard NIC slots and connect to top-of-rack switches, enabling clusters of 500–1,000 Corsair cards—a configuration D-Matrix believes matches near-term market demand.

🧬 An I/O Roadmap, Not a One-Off
#

Jetstream is only the first step. D-Matrix now treats I/O as a first-class design dimension.

While Jetstream supports the current Corsair generation, the Raptor platform will require a different integration strategy. The company plans to develop electrical I/O chiplets aligned with industry standards and integrate them directly into future accelerators.

Looking further ahead, the third-generation compute-in-memory architecture, Lightning, is expected to incorporate some form of optical I/O, reflecting the long-term direction of large-scale AI systems.

🧩 Inference Is Becoming Heterogeneous
#

Another major trend shaping D-Matrix’s strategy is hardware heterogeneity.

Inference workloads are increasingly split into prefill and decode stages, each with different compute and memory characteristics. While D-Matrix uses the same hardware for both, Corsair can be configured differently depending on workload needs.

Compute-heavy prefill phases can rely on capacity memory, while latency-critical decode stages can be shifted into performance memory. Beyond prefill and decode, customers are identifying additional workload segments that are extremely latency-sensitive and require small batch sizes—an area where Corsair is gaining interest.

🤝 Coexisting With GPUs
#

Heterogeneity also extends beyond workload stages. In practice, D-Matrix accelerators may be deployed alongside NVIDIA GPUs, offloading latency-critical inference components while GPUs handle throughput-heavy tasks.

This complementary deployment model reflects a broader industry shift away from monolithic architectures toward specialized hardware pools optimized for different phases of inference.

🧭 A Bet on Low-Latency Inference
#

D-Matrix reports growing interest from hyperscalers and neocloud providers, with multiple Corsair trials already underway. The company’s roadmap—spanning compute-in-memory, 3D stacked DRAM, and custom low-latency networking—reflects a clear thesis.

As token counts rise and AI systems become more distributed and agent-driven, latency, bandwidth, and heterogeneity will define the next phase of inference infrastructure. D-Matrix is betting that solving those problems together, rather than in isolation, is the only way forward.