Nvidia’s GPUs dominated MLPerf’s AI benchmark, showcasing top performance in generative AI tasks, but AMD and MangoBoost managed to snag a win in one category. The tests, organized by MLCommons, assessed AI inference speeds, including new benchmarks for large language models (LLMs) and graph neural networks.
SuperMicro, Hewlett Packard Enterprise, and Lenovo, among others, used systems with up to eight Nvidia chips to secure top positions. The MLPerf benchmark now incorporates tests for common generative AI applications, like Meta’s Llama 3.1 405b and an interactive version of Llama 2 70b, simulating chatbot response times by measuring the first token output speed.
Additionally, the benchmark introduced tests for graph neural networks, relevant to programs using generative AI, and LiDAR sensing data processing for automobile mapping. Nvidia’s GPUs generally led in the closed division, which enforces strict software setup rules.
AMD’s MI300X GPU, however, outperformed Nvidia in two Llama 2 70b tests, achieving 103,182 tokens per second. This AMD system was built by MangoBoost, a new MLPerf participant specializing in GPU data transfer and gen AI serving software like LLMboost.
Nvidia contested the AMD comparison, arguing for score normalization based on the number of chips and computer nodes used. Dave Salvator, Nvidia’s director of accelerated computing products, stated that MangoBoost used 32 MI300X GPUs compared to Nvidia’s 8 B200s, yet achieved only a 3.83% higher result. He added, “NVIDIA’s 8x B200 submission actually outperformed MangoBoost’s x32 AMD MI300X GPUs in the Llama 2 70B server submission.”
Google’s Trillium chip, the sixth TPU iteration, was also tested but lagged behind Nvidia’s Blackwell in Stable Diffusion image-generation query speed. The recent MLPerf benchmarks saw fewer Nvidia competitors than previous rounds; Intel’s Habana and Qualcomm had no submissions.
Intel did see success in the datacenter closed division, where its Xeon microprocessor powered seven of the top 11 systems as the host processor, compared to three for AMD’s EPYC. Nvidia, however, built the top-performing system for processing Meta’s Llama 3.1 405b using its Grace-Blackwell 200 chip, combining the Blackwell GPU with Nvidia’s Grace microprocessor.




