Nvidia has released benchmark data showing its GB300 NVL72 systems with Blackwell Ultra GPUs deliver significant performance gains for low-latency AI workloads, targeting the growing market for agentic AI applications and coding assistants.
The company’s Blackwell Ultra Tensor Cores provide 1.5x more compute performance than standard Blackwell GPUs, while attention-layer processing has doubled through accelerated softmax execution. This addresses bottlenecks in transformer attention layers used by reasoning models with large context windows. Additionally, Nvidia’s TensorRT-LLM inference library has improved, with SemiAnalysis benchmarks showing throughput per GPU doubled at some interactivity levels since October 2025. The combination of these hardware and software advances resulted in a 10x boost in tokens per second per user and a 5x improvement in tokens per second per megawatt versus the previous Hopper platform, yielding a reported 50x increase in AI factory output and 35x lower cost per token.
“As inference moves to the center of AI production, long-context performance and token efficiency become critical,” said Chen Goldberg, senior vice president of engineering at CoreWeave. “Grace Blackwell NVL72 addresses that challenge directly.”
Major cloud providers are deploying GB300 NVL72 infrastructure. CoreWeave announced in 2025 that it was the first AI cloud provider to deploy the systems in production, integrating them with its Kubernetes-based cloud stack. Microsoft deployed what it called the world’s first large-scale GB300 NVL72 supercomputing cluster, achieving over 1.1 million tokens per second on a single rack in testing validated by Signal65. Oracle’s OCI platform is deploying GB300 NVL72 systems with plans to scale its Superclusters beyond 100,000 Blackwell GPUs to meet inference workload demand.
Cost reductions are reshaping AI deployment economics, with leading inference providers including Baseten, DeepInfra, Fireworks AI, and Together AI reporting up to 10x cost reductions using the standard Blackwell platform. The Blackwell Ultra platform extends these gains for low-latency workloads, with the 35x lower cost per million tokens enabling more economically viable deployment of AI agents and coding assistants at scale.
Nvidia previewed its next-generation Rubin platform, claiming it will deliver another 10x performance improvement over Blackwell.




