Today, MLCommons® announced new results for its industry-standard MLPerf® Inference v5.0 benchmark suite, which delivers machine learning (ML) system performance benchmarking in an architecture-neutral, representative, and reproducible manner. The results highlight that the AI community is focusing much of its attention and efforts on generative AI scenarios, and that the combination of recent hardware and software advances optimized for generative AI have led to dramatic performance improvements over the past year.
The MLPerf Inference benchmark suite, which encompasses both datacenter and edge systems, is designed to measure how quickly systems can run AI and ML models across a variety of workloads. The open-source and peer-reviewed benchmark suite creates a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI systems. This round of MLPerf Inference results also includes tests for four new benchmarks: Llama 3.1 405B, Llama 2 70B Interactive for low-latency applications, RGAT, and Automotive PointPainting for 3D object detection.
Llama 2 70B generative AI test takes center stage
The Inference v5.0 results show that generative AI scenarios have gained momentum. Over the last year submissions have increased 2.5x to the Llama 2 70B benchmark test, which implements a large generative AI inference workload based on a widely-referenced open-source model. With the v5.0 release, Llama 2 70B has now supplanted Resnet50 as the highest submission rate test.

The performance results for Llama 2 70B have also skyrocketed since a year ago: the median submitted score has doubled, and the best score is 3.3 times faster compared to Inference v4.0.

“It’s clear now that much of the ecosystem is focused squarely on deploying generative AI, and that the performance benchmarking feedback loop is working,” said David Kanter, head of MLPerf at MLCommons. “We’re seeing an unprecedented flood of new generations of accelerators. The hardware is paired with new software techniques, including aligned support across hardware and software for the FP4 data format. With these advances, the community is setting new records for generative AI inference performance.”
The benchmark results for this round include results for six newly available or soon-to-be-shipped processors:
- AMD Instinct MI325X
- Intel Xeon 6980P “Granite Rapids”
- Google TPU Trillium (TPU v6e)
- NVIDIA B200
- NVIDIA Jetson AGX Thor 128
- NVIDIA GB200