Nvidia DGX GH200: 100 Terabyte GPU Memory System by MacsHeadroom

Share This Article

Sed ut perspiciatis unde.

At COMPUTEX 2023, NVIDIA announced NVIDIA DGX GH200, which marks another breakthrough in GPU-accelerated computing to power the most demanding giant AI workloads. In addition to describing critical aspects of the NVIDIA DGX GH200 architecture, this post discusses how NVIDIA Base Command enables rapid deployment, accelerates the onboarding of users, and simplifies system management.

The unified memory programming model of GPUs has been the cornerstone of various breakthroughs in complex accelerated computing applications over the last 7 years. In 2016, NVIDIA introduced NVLink technology and the Unified Memory Programming model with CUDA-6, designed to increase the memory available to GPU-accelerated workloads.

Since then, the core of every DGX system is a GPU complex on a baseboard interconnected with NVLink in which each GPU can access the other’s memory at NVLink speed. Many such DGX with GPU complexes are interconnected with high-speed networking to form larger supercomputers such as the NVIDIA Selene supercomputer. Yet an emerging class of giant, trillion-parameter AI models will require either several months to train or cannot be solved even on today’s best supercomputers.

To empower the scientists in need of an advanced platform that can solve these extraordinary challenges, NVIDIA paired NVIDIA Grace Hopper Superchip with the NVLink Switch System, uniting up to 256 GPUs in an NVIDIA DGX GH200 system. In the DGX GH200 system, 144 terabytes of memory will be accessible to the GPU shared memory programming model at high speed over NVLink.

Compared to a single NVIDIA DGX A100 320 GB system, NVIDIA DGX GH200 provides nearly 500x more memory to the GPU shared memory programming model over NVLink, forming a giant data center-sized GPU. NVIDIA DGX GH200 is the first supercomputer to break the 100-terabyte barrier for memory accessible to GPUs over NVLink.

NVIDIA DGX GH200 system architecture

NVIDIA Grace Hopper Superchip and NVLink Switch System are the building blocks of NVIDIA DGX GH200 architecture. NVIDIA Grace Hopper Superchip combines the Grace and Hopper architectures using NVIDIA NVLink-C2C to deliver a CPU + GPU coherent memory model. The NVLink Switch System, powered by the fourth generation of NVLink technology, extends NVLink connection across superchips to create a seamless, high-bandwidth, multi-GPU system.

Each NVIDIA Grace Hopper Superchip in NVIDIA DGX GH200 has 480 GB LPDDR5 CPU memory, at eighth of the power per GB, compared with DDR5 and 96 GB of fast HBM3. NVIDIA Grace CPU and Hopper GPU are interconnected with NVLink-C2C, providing 7x more bandwidth than PCIe Gen5 at one-fifth the power.

NVLink Switch System forms a two-level, non-blocking, fat-tree NVLink fabric to fully connect 256 Grace Hopper Superchips in a DGX GH

Nvidia DGX GH200: 100 Terabyte GPU Memory System by MacsHeadroom

Nvidia DGX GH200: 100 Terabyte GPU Memory System by MacsHeadroom

Share This Article

Newsletter

NVIDIA DGX GH200 system architecture

HackTech

Leave a comment Cancel reply

Editor's Choice

Nvidia DGX GH200: 100 Terabyte GPU Memory System by MacsHeadroom

Nvidia DGX GH200: 100 Terabyte GPU Memory System by MacsHeadroom

Share This Article

Newsletter

NVIDIA DGX GH200 system architecture

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter