MLPerf finds NVIDIA partners train AI models the fastest with GPU based systems

Tue, 6th Jul 2021

FYI, this story is more than a year old

According to the latest MLPerf results, NVIDIA's partners deliver the fastest GPU-accelerated systems to train AI models.

Seven companies put at least a dozen commercially available systems, the majority NVIDIA-Certified, to be tested in the industry benchmarks.

Dell, Fujitsu, GIGABYTE, Inspur, Lenovo, Nettrix, and Supermicro, joined NVIDIA to demonstrate the results of neural network training with NVIDIA A100 Tensor Core GPUs.

Compared to last years scores, they delivered up to 3.5x more performance. For the larger jobs requiring more resources, a record 4,096 GPUs were used, more than any other submission.

The benchmarks are based on popular AI workloads and scenarios, covering computer vision, natural-language processing, recommendation systems, reinforcement learning, and more. And the training benchmarks focus on the time to train a new AI model.

Selene, based on an NVIDIA DGX SuperPOD, set all eight records on commercially available systems.

“We ran the at-scale tests on Selene, the fastest commercial AI supercomputer in the world, according to the latest TOP500 rankings,” says NVIDIA.

“It's based on the same NVIDIA DGX SuperPOD architecture that powers a dozen other systems on the list. The ability to scale to large clusters is the toughest challenge in AI and one of our core strengths.”

The MLPerf results demonstrate performance across a variety of NVIDIA-based AI platforms with new systems. They span entry-level edge servers to AI supercomputers that accommodate thousands of GPUs.

The seven partners participating in the latest benchmarks are among nearly two dozen cloud-service providers and OEMs with products or plans for online instances, servers and PCIe cards using NVIDIA A100 GPUs, including almost 40 NVIDIA-Certified Systems.

“Our ecosystem offers customers choices in a wide range of deployment models, from instances that are rentable by the minute to on-prem servers and managed services,” says NVIDIA.

“Results across all the MLPerf tests show our performance keeps rising over time. That comes from a platform with software that's mature and constantly improving, so teams can get started fast with systems that keep getting better.”

The benchmarks can help users find AI products to meet the requirements of some of the world's largest and most advanced factories. For example, TSMC, a large chip manufacturing company, uses machine learning to improve optical proximity correction and etch simulation.

“To fully realise the potential of machine learning in model training and inference, we're working with the NVIDIA engineering team to port our Maxwell simulation and inverse lithography technology engine to GPUs, and see very significant speedups,” says TSMC director of OPC department, Danping Peng.

“The MLPerf benchmark is an important factor in our decision making.”