NVIDIA Open-Sources Ultra-Low Latency ASR for Real-Time Voice Agents

Table of Contents

NVIDIA has open-sourced its latest Nemotron Speech ASR model, a speech recognition system purpose-built for low-latency, real-time streaming applications.

Anyone who has worked on AI voice systems knows that ASR (Automatic Speech Recognition) has long been one of the hardest components to get right in interactive scenarios. Streaming ASR in particular has struggled with a persistent trade-off between accuracy, latency, and computational cost.

Traditional streaming architectures often suffer from cumulative latency. As audio length increases, the model repeatedly reprocesses historical context, causing recognition to slow down over time. NVIDIA’s Nemotron Speech ASR breaks this limitation with a distinctly engineering-driven approach.

⚡ 24ms Ultra-Fast Locking Time
#

Single-utterance transcription locking completes in just 24 milliseconds. In practical terms, the moment a user finishes speaking, the system has already finalized the transcription and is ready to respond.

This response time is approaching — and in some cases exceeding — typical human neural reaction speed, making it well-suited for real-time voice agents and conversational AI.

🧠 Core Capabilities
#

🚀 Cache-Aware Streaming Architecture
#

The key to Nemotron’s ultra-low latency lies in its cache-aware design, optimized specifically for continuous audio streams.

Instead of re-encoding previously processed speech, intermediate features are cached directly in GPU memory (VRAM). When new audio frames arrive, the model performs incremental computation, processing only the newly arrived frames rather than the entire history.

Conceptually, this works like a bookmark while reading: previously processed content is remembered, and only the next page is read. This approach eliminates the primary bottleneck in long-form streaming speech recognition.

📈 Improved Throughput and Cost Efficiency
#

Compared with traditional buffered streaming methods, Nemotron Speech ASR achieves significantly higher throughput under the same GPU memory constraints. This enables:

More concurrent audio streams per GPU
Lower operational cost at scale
Stable end-to-end latency within 500ms

For production voice systems, this translates directly into improved scalability and reduced infrastructure expense.

🧩 Flexible Dynamic Runtime Modes
#

Nemotron Speech ASR supports multiple runtime latency configurations without retraining:

80ms / 160ms — Ultra-low latency modes for interactive use cases such as in-game voice chat or live translation
560ms / 1.12s — Higher-accuracy modes suitable for meeting transcription and documentation

Latency mode selection is controlled entirely via inference-time parameters. A single model adapts to multiple application scenarios, simplifying deployment and maintenance. The model also provides native punctuation and capitalization, reducing post-processing overhead.

🧪 Voice Agent Stack Integration
#

NVIDIA positions Nemotron Speech ASR as part of a complete, runnable voice agent stack rather than a standalone model.

Component	Model
ASR	Nemotron Speech ASR
LLM	Nemotron 3 Nano 30B
TTS	Magpie

This integrated approach lowers the barrier to building production-ready voice agents.

🛠️ Quick Start with NVIDIA NeMo
#

The initial open-source release includes the 0.6B-parameter Nemotron Speech ASR model. Training, fine-tuning, and inference are all handled via NVIDIA NeMo.

Installation
#

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]

Loading the Pretrained Model
#

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.ASRModel.from_pretrained(
    model_name="nvidia/nemotron-speech-streaming-en-0.6b"
)

Streaming Inference (Script-Based)
#

NVIDIA provides a ready-to-run inference script within the NeMo repository:

cd NeMo
python examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py \
    model_path=<model_path> \
    dataset_manifest=<dataset_manifest> \
    batch_size=<batch_size> \
    att_context_size="[70,13]" \
    output_path=<output_folder>

The second value in att_context_size controls right-context latency and can be adjusted dynamically.

Streaming Inference (Pipeline API)
#

Streaming inference can also be executed via NeMo’s pipeline interface:

from nemo.collections.asr.inference.factory.pipeline_builder import PipelineBuilder
from omegaconf import OmegaConf

cfg_path = "cache_aware_rnnt.yaml"
cfg = OmegaConf.load(cfg_path)

audios = ["/path/to/your/audio.wav"]

pipeline = PipelineBuilder.build_pipeline(cfg)
output = pipeline.run(audios)

for entry in output:
    print(entry["text"])

🔮 Final Thoughts
#

By open-sourcing Nemotron Speech ASR, NVIDIA is signaling a broader shift in real-time voice systems. With recognition latency effectively solved and long-form speech no longer degrading performance, the primary bottleneck is moving away from algorithms and toward application design.

This release enables developers to focus on building richer, more capable voice agents — where responsiveness is no longer a limiting factor, but a given.