GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity —Panmnesia's CXL IP claims double-digit nanosecond latency (2024)

GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity —Panmnesia's CXL IP claims double-digit nanosecond latency (1)

Modern GPUs for AI and HPC applications come with a finite amount of high-bandwidth memory (HBM) built into the device, limiting their performance in AI and other workloads. However, new tech will allow companies to expand GPU memory capacity by slotting in more memory with devices connected to the PCIe bus instead of being limited to the memory built into the GPU — it even allows using SSDs for memory capacity expansion, too. Panmnesia, a company backed by South Korea's renowned KAIST research institute, has developed alow-latency CXL IPthat could be used to expand GPU memory using CXL memory expanders.

The memory requirements of more advanced datasets for AI training are growing rapidly, which means that AI companies either have to buy new GPUs, use less sophisticated datasets, or use CPU memory at the cost of performance. Although CXL is a protocol that formally works on top of a PCIe link, thus enabling users to connect more memory to a system via the PCIe bus, the technology has to be recognized by an ASIC and its subsystem, so just adding a CXL controller is not enough to make the technology work, especially on a GPU.

Panmnesia faced challenges integrating CXL for GPU memory expansion due to the absence of a CXL logic fabric and subsystems that support DRAM and/or SSD endpoints in GPUs. In addition, GPU cache and memory subsystems do not recognize any expansions except unified virtual memory (UVM), which tends to be slow.

GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity —Panmnesia's CXL IP claims double-digit nanosecond latency (2)

To address this, Panmnesia developed a CXL 3.1-compliant root complex (RC) equipped with multiple root ports (RPs) that support external memory over PCIe) and a host bridge with a host-managed device memory (HDM) decoder that connects to the GPU's system bus. The HDM decoder, responsible for managing the address ranges of system memory, essentially makes the GPU's memory subsystem 'think' that it is dealing with system memory, but in reality, the subsystem uses PCIe-connected DRAM or NAND.That means either DDR5 or SSDs can be used to expand the GPU memory pool.

GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity —Panmnesia's CXL IP claims double-digit nanosecond latency (3)

The solution (based on a custom GPU and marked as CXL-Opt) underwent extensive testing, showing a two-digit nanosecond round-trip latency (compared to 250ns in the case of prototypes developed by Samsung and Meta, which is marked as CXL-Proto in the graphs below), including the time needed for protocol conversion between standard memory operations and CXL flit transmissions, according to Panmnesia. It has been successfully integrated into both memory expanders and GPU/CPU prototypes at the hardware RTL, demonstrating its compatibility with various computing hardware.

GPUs can now use PCIe-attached memory or SSDs to boost VRAM capacity —Panmnesia's CXL IP claims double-digit nanosecond latency (4)

As tested by Panmnesia, UVM performs the worst among all tested GPU kernels due to overhead from host runtime intervention during page faults and transferring data at the page level, which often exceeds the GPU's needs. In contrast, CXL allows direct access to expanded storage via load/store instructions, eliminating these issues.

Consequently, CXL-Proto's execution time is 1.94 times shorter than UVM. Panmnesia's CXL-Opt further reduces execution time by 1.66 times, with an optimized controller achieving two-digit nanosecond latency and minimizing read/write latency. This pattern is also evident in another figure, which displays IPC values recorded during GPU kernel execution. It reveals that Panmnesia's CXL-Opt achieves performance speeds 3.22 times and 1.65 times faster than UVM and CXL-Proto, respectively.

In general, CXL support can do a lot for AI/HPC GPUs, but performance is a big question. Additionally, whether companies like AMD and Nvidia will add CXL support to their GPUs remains to be seen. If the approach of using PCIe-attached memory for GPUs does gather steam, only time will tell if the industry heavyweights will use IP blocks from companies like Panmnesia or simply develop their own tech.

Stay On the Cutting Edge: Get the Tom's Hardware Newsletter

Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Anton Shilov

Contributing Writer

Most Popular

AMD EPYC CPU hacked onto B650 motherboard, hits 6.6 GHz with liquid nitrogen — $159 EPYC 4124P shows immense overclocking potential

AMD’s laptop OEMs decry poor support, chip supply, and communication — OEM complains the company has "left billions of US dollars lying around" due to poor execution: Reports

Intel cleared to get $3.5 billion to make advanced chips for Pentagon — Secure Enclave program ushers leading-edge CPUs to the military

Linux dev swatted and handcuffed live during a development video stream — perps remain unidentified

AMD hides Taiwan branding on Ryzen CPU packaging as it preps new chips for China market release — company uses black sticker to erase origin information

71-TiB NAS with twenty-four 4TB drives hasn't had a single drive failure for ten years — owner outlines key approaches to assure HDD longevity

Intel preps Xeon R1S CPUs with 136 PCIe 5.0 lanes — Granite Rapids rumored with up to 80 cores for single-socket platform

Elon Musk and Larry Ellison begged Nvidia CEO Jensen Huang for AI GPUs at dinner

Crucial MX500 SSD firmware susceptible to buffer overflow security vulnerability

HighPoint's blazing-fast 8-slot NVMe Gen 4 RAID card is now compliant with immersion-cooled server environments to boost efficiency and reliability

Intel Core Ultra 200 CPU specs allegedly leaked — Arrow Lake tops out at 24 cores and 5.7 GHz boost clock at 250W