Nvidia Hits 200 TeraFLOP Emulated FP64 For Scientific Computing

Nvidia is employing software emulation to enhance double-precision floating-point computation (FP64) performance in its AI accelerators for high-performance computing (HPC) and scientific applications, according to The Register. This strategy comes as the company unveils its Rubin GPUs, which deliver 33 teraFLOPS of peak FP64 performance, a decrease of 1 teraFLOP from the H100 GPU.

Nvidia’s CUDA libraries can achieve up to 200 teraFLOPS of FP64 matrix performance through software emulation, representing a 4.4x increase over the Blackwell accelerators’ hardware capabilities. Dan Ernst, Nvidia’s senior director of supercomputing products, stated the accuracy of emulation matches or exceeds that of tensor core hardware. However, Nicholas Malaya, an AMD fellow, questioned the efficacy of emulated FP64 in physical scientific simulations compared to benchmarks.

FP64 remains critical for scientific computing due to its dynamic range, capable of expressing over 18.44 quintillion unique values, in contrast to FP8’s 256 unique values used in AI models. HPC simulations, unlike AI workloads, require high precision to prevent error propagation that can lead to system instability, according to Malaya.

The concept of using lower-precision data types to emulate FP64 dates back to the mid-20th century. In early 2024, researchers from the Tokyo and Shibaura institutes of technology published a paper demonstrating that FP64 matrix operations could be decomposed into multiple INT8 operations on Nvidia’s tensor cores, achieving higher-than-native performance. This method, known as the Ozaki scheme, forms the basis for Nvidia’s FP64 emulation libraries, released late last year. Ernst confirmed the emulated computation maintains FP64 precision, differing only in its hardware execution method.

Modern GPUs feature low-precision tensor cores, such as those in Rubin, which offer 35 petaFLOPS of dense FP4 compute. These cores are over 1,000x faster than FP64-specific components. Ernst explained that the efficiency of these low-precision cores led to exploring their use for FP64 emulation, aligning with the historical trend in supercomputing of leveraging available hardware.

AMD has expressed reservations regarding the accuracy of FP64 emulation. Malaya noted that the approach performs well for well-conditioned numerical systems, such as High Performance Linpack (HPL) benchmarks, but can falter in less-conditioned systems found in material science or combustion codes. He also highlighted that Nvidia’s algorithms for FP64 emulation are not fully IEEE compliant, failing to account for nuances such as positive versus negative zeros or “not a number” errors. These discrepancies can lead to small errors propagating and affecting final results. Malaya added that the Ozaki scheme approximately doubles memory consumption for FP64 matrices. AMD’s upcoming MI430X will specifically enhance double and single-precision hardware performance using its chiplet architecture.

Ernst acknowledged some limitations but contended that issues like positive/negative zeros are not critical for most HPC practitioners. Nvidia has developed supplemental algorithms to detect and mitigate issues like non-numbers and infinite numbers. He stated that increased memory overhead is relative to the operation, not the entire application, with typical matrices being a few gigabytes. Ernst argued that IEEE compliance issues often do not arise in matrix multiplication cases, especially in DGEMM operations.

Emulation primarily benefits a subset of HPC applications relying on dense general matrix multiply (DGEMM) operations. Malaya estimated that 60% to 70% of HPC workloads, particularly those relying on vector FMA, see little to no benefit from emulation. For vector-heavy workloads like computational fluid dynamics, Nvidia’s Rubin GPUs must use slower FP64 vector accelerators within their CUDA cores. Ernst countered that theoretical FLOPS do not always translate to usable performance, particularly when memory bandwidth acts as a bottleneck. Rubin, with 22 TB/s of HBM4 memory, is expected to deliver higher real-world performance in these workloads despite slower vector FP64 performance.

The viability of FP64 emulation will be tested as new supercomputers incorporating Nvidia’s Blackwell and Rubin GPUs become operational. The algorithms can improve over time given their software-based nature. Malaya indicated that AMD is also exploring FP64 emulation on chips like the MI355X via software flags. He emphasized that IEEE compliance would validate the approach by guaranteeing result consistency with dedicated silicon. Malaya suggested that the community should establish a suite of applications to evaluate the reliability of emulation across different use cases.

Featured image credit

Tags: Nvidia