NVIDIA GH200: Big-Memory Superchip for the Desktop

Table of Contents

Researchers who study microprocessors may recall that the original Intel 8086/8088 systems lacked a Floating-Point Unit (FPU). Motherboards typically included an extra socket for the optional 8087 math coprocessor. Today, FPUs are fully integrated into CPUs—yet in another sense, we are now experiencing a new “coprocessor moment.”

Modern systems still rely on an external SIMD processor—the GPU—for massive parallel math workloads. With the arrival of the NVIDIA GH200 Grace Hopper Superchip and AMD Instinct MI300A APU, the industry is witnessing a similar evolutionary shift: integrating CPU and GPU into a tightly coupled package with a unified memory system, dramatically boosting performance for HPC and GenAI workloads.

Farewell to PCIe Bottlenecks
#

Traditionally, GPUs communicate with CPUs over the PCIe bus. CPU and GPU memory remain separate, requiring data to move back and forth across the PCIe interface.

Even with PCIe Gen5 x16, bandwidth tops out at ~63 GB/s, limiting data movement for memory-intensive workloads.

The NVIDIA GH200 eliminates this bottleneck using NVLink-C2C, delivering 900 GB/s of bi-directional bandwidth—a 14× increase over PCIe. More importantly, GH200 provides a single coherent CPU–GPU memory domain. The Grace CPU includes up to 480 GB LPDDR5X (with ECC), while the Hopper GPU includes 96 GB HBM3 or 144 GB HBM3e. Together, the system delivers 576 GB to 624 GB of unified memory.

Figure 1. Logical Overview of the NVIDIA GH200 Grace Hopper Superchip

AMD’s MI300A also uses a unified memory design, offering 128 GB HBM3 shared between CPU and GPU via Infinity Fabric at 5.3 TB/s. While MI300A currently lacks DDR expansion, future CXL-based memory growth is expected.

The key breakthrough of both GH200 and MI300A is the presentation of a single large memory domain—critical for HPC and GenAI, where model size and memory locality dominate performance. Traditional GPU memory limits often require distributed computing, but GH200 nodes can scale unified memory further via NVLink fabrics (e.g., AWS NLV32 reaching 20 TB).

A Superchip Workstation on Your Desk
#

A major trend in computing is the transition of high-end technologies into commodity systems. Just as multi-core CPUs and high-bandwidth memory migrated from supercomputers to consumer devices, unified CPU–GPU memory architectures will eventually follow.

Recently, Phoronix tester Michael Larabel benchmarked a GH200 workstation provided by GPTshop.ai in Germany. This workstation places a full GH200 node—with 576 GB unified memory—directly beside a desk.

The tower system (Figure 2) includes:

One GH200 Grace Hopper Superchip
Dual 2000W+ power supplies
QCT motherboard
Options for SSDs and Bluefield/ConnectX adapters
Programmable TDP from 450W to 1000W
25 dB air cooling (liquid cooling optional)

The starting price for the 576 GB model is €47,500 (~$41,000 excluding VAT). While expensive, consider that a single NVIDIA H100 PCIe (80 GB) costs $30k–$35k, and still requires a host system—and lacks unified memory.

Figure 2. Internal view of the GPTshop NVIDIA GH200 Workstation. (Source: GPTshop.ai)

For HPC and GenAI developers, having half a terabyte of fully coherent CPU–GPU memory on a workstation is groundbreaking.

Preliminary Benchmarks
#

GPTshop granted Phoronix remote access to the GH200 machine for early testing. These results are preliminary and do not use the Hopper GPU—only the Grace CPU. Full GPU-accelerated benchmarks are planned.

The test environment included:

Ubuntu 23.10
Linux kernel 6.5
GCC 13
Comparative testing against Intel Xeon Scalable, AMD EPYC, and Ampere Altra Max

Power metrics were unavailable, as current GH200 systems do not expose standard Linux power interfaces (e.g., RAPL, HWMON).

Still, this marks the first public independent benchmark data for GH200 outside NVIDIA.

HPCG: Strong Memory Bandwidth Performance
#

The first major benchmark is HPCG, heavily limited by memory bandwidth.

Figure 3. Results of the NVIDIA GH200 running the HPCG benchmark (Source: Phoronix)

The 72-core Grace CPU reached 42 GFLOPS, comparable to:

Xeon Platinum 8380 (2P): 40 GFLOPS
EPYC 9654 Genoa (2P): 44 GFLOPS

Notably, Grace outperformed the 128-core Ampere Altra Max by nearly 2×.

Other results were similarly impressive. In the NWChem (C240 Buckyball) benchmark:

Figure 4. Results of the NVIDIA GH200 running the NWChem benchmark (Source: Phoronix)

The GH200 completed the test in 1404 seconds, second only to the dual-socket EPYC 9554 at 1323 seconds.

Future Outlook
#

The GH200 and MI300A represent a transformative architecture shift. Just as the 8087 coprocessor ultimately merged into mainstream CPU designs, high-end GPU and SIMD acceleration is now being drawn into an integrated CPU–GPU package.

While these systems remain expensive today, demand from GenAI, HPC, and scientific computing will likely push these architectures down into more affordable markets over time.

Having a personal workstation capable of running large LLMs—or memory-intensive GPU-optimized HPC codes—marks a significant milestone. Cloud and data centers will continue to dominate scale-out computing, but the ability to have a powerful local system “with a reset button” is invaluable for researchers and developers alike.