Intel has officially launched its next-generation Gaudi 3 AI accelerator, originally announced in April, now positioned directly against NVIDIA’s H100 GPU in the high-performance AI compute market. With the Blackwell series also approaching production, competition in AI silicon has never been fiercer.
According to industry forecasts, the global semiconductor market could reach $1 trillion by 2030, driven primarily by AI workloads. Yet, as of 2023, only 10% of companies had successfully commercialized their AIGC (AI-Generated Content) projects—highlighting both the opportunity and the challenge ahead.
⚙️ Gaudi 2: The Foundation of Intel’s AI Play #
Intel’s Gaudi 2, launched in 2022 (and later introduced to China in 2023), set a strong baseline with remarkable deep learning performance and value efficiency.
Fabricated on TSMC’s 7nm process, Gaudi 2 integrates:
- 24 Tenor Processor Cores (TPC)
- 48MB SRAM cache
- 21× 200Gb Ethernet interfaces (ROCEv2 RDMA)
- 96GB HBM2E memory (2.4TB/s bandwidth)
- PCIe 4.0 x16 interface
- 800W peak power consumption
The design targets large-scale AI training and inference workloads—particularly LLMs and generative AI.
🚀 Gaudi 3: A Massive Architectural Leap #
The new Gaudi 3 brings dramatic generational upgrades across compute, memory, and networking.
- Process: TSMC 5nm
- TPCs: 64 (up from 24)
- MMEs (Matrix Multiplication Engines): 8 (up from 2)
- Media decoders: 14 (up from 8)
- SRAM cache: 96MB (2× increase)
- SRAM bandwidth: 12.8TB/s (2× increase)
Core Performance #
- MME BF16/FP8: 1,835 TFlops (1.835 petaflops)
- Vector BF16: 28.8 TFlops
→ 3.2× / 1.1× / 1.6× performance gains respectively over Gaudi 2.
Memory and I/O #
- HBM2E: 128GB (8 stacks, up from 96GB)
- Memory bandwidth: 3.7TB/s
- RDMA interfaces: 24× 200Gb Ethernet
- Bidirectional interconnect: 1.2TB/s
- Host interface bandwidth: 128GB/s
- System bus: PCIe 5.0 x16
🧠 Performance vs NVIDIA H100 #
Intel claims that Gaudi 3 delivers:
- 50% faster inference on large language models (LLMs)
- 40% faster training times
- 2× better price-performance ratio versus NVIDIA’s H100
It integrates seamlessly with the PyTorch framework, Hugging Face Transformers, and Diffusion model pipelines.
Training times for Llama 2 (7B/13B) and GPT-3 (175B) models are significantly reduced, with strong inference throughput for Llama 70B and Falcon 180B as well.
🌐 Scalable, Open Architecture #
Gaudi 3 embraces an open, Ethernet-based networking design, enabling flexible scaling from single-node to supercluster deployments. It supports large-scale training, fine-tuning, and inference—all without proprietary interconnects.
🧩 Deployment Options #
Intel offers three form factors for Gaudi 3 to fit different infrastructure needs:
-
OAM 2.0 Mezzanine Card
- Passive: 900W | Liquid-cooled: 1200W
- 48× 112Gb PAM4 SerDes links
-
HLB-325 Universal Baseboard
- Supports up to 8 Gaudi 3 accelerators
-
HL-338 PCIe 5.0 x16 Expansion Card
- Passive 600W peak
- Supports quad-card interconnect configurations
🤝 Ecosystem and Partners #
Intel’s Gaudi accelerators are already deployed or being adopted by:
NAVER, Bosch, IBM, Ola/Krutrim, NielsenIQ, Seekr, IFF, CtrlS Group, Bharti Airtel, Landing AI, Roboflow, and Infosys.
Notably, IBM plans to integrate Gaudi 3 into its cloud AI services.
A China-specific variant reportedly exists, capped at 450W (for both OAM and PCIe modules) to meet export and regulatory limits. Performance will likely be reduced, but exact specifications remain undisclosed.
✅ Conclusion #
With 5nm fabrication, 8× matrix engines, and 128GB of HBM2E, Intel’s Gaudi 3 marks a significant step forward in open, scalable AI compute infrastructure.
While NVIDIA’s H100 remains dominant in ecosystem maturity, Gaudi 3 delivers a compelling alternative—faster performance, better efficiency, and lower cost—especially for enterprises building large-scale LLM or generative AI infrastructure.
Intel’s focus on Ethernet-based interconnects and open software support could make Gaudi 3 a serious contender in the global AI accelerator race.