From retail to gaming, from code generation to customer service, more and more organizations are running applications based on Large Language Models (LLMs), with 78% of organizations currently in the development or production phase. As the number and user base of generative AI applications continue to expand, the demand for high-performance, scalable, and easy-to-use inference technology has become crucial. Google Cloud is paving the way for the next rapid evolution of AI through the AI Hypercomputer.
At the Google Cloud Next ’25 conference, Google shared numerous updates on the AI Hypercomputer’s inference capabilities, showcasing Ironwood, the latest Tensor Processing Unit (TPU) designed specifically for inference. This hardware update is complemented by software enhancements, such as simple and efficient inference using vLLM on TPU, and the latest GKE inference features—GKE Inference Gateway and GKE Inference Quickstart. Through the AI Hypercomputer, performance continues to be boosted by optimized software, supported by strong benchmarks:
- Google’s JetStream inference engine combines new performance optimizations, integrating Pathways for ultra-low latency multi-host, disaggregated serving.
- MaxDiffusion, Google’s reference implementation for latent diffusion models, delivers exceptional performance on TPUs for computationally intensive image generation workloads, and now supports Flux, one of the largest text-to-image generation models to date.
- The latest performance results from MLPerf™ Inference v5.0 demonstrate the powerful capabilities and versatility of Google Cloud’s A3 Ultra (NVIDIA H200) and A4 (NVIDIA HGX B200) VMs in inference.
Optimizing JetStream Performance: Google’s JAX Inference Engine
To maximize performance and reduce inference costs, Google is pleased to offer more options when serving LLMs on TPUs, further enhancing JetStream and bringing vLLM support to TPUs, a widely adopted and efficient LLM serving library. Through vLLM on TPU and JetStream, with open-source contributions and community support from Google AI experts, low-latency, high-throughput inference is provided, achieving excellent cost-effectiveness.
JetStream is Google’s open-source, throughput- and memory-optimized inference engine, built for TPUs and based on the same inference stack used to serve Gemini models. Since Google announced JetStream last April, Google has invested significant resources to further improve its performance across various open-source models. When using JetStream, Google’s sixth-generation Trillium TPU now delivers 2.9 times higher throughput performance on Llama 2 70B than TPU v5e, and 2.8 times higher throughput performance on Mixtral 8x7B than TPU v5e (using Google’s reference implementation, MaxText).
Figure 1: JetStream throughput (output tokens per second). Google internal data. Measured using Llama2-70B (MaxText) on Cloud TPU v5e-8 and Trillium 8 chips, and Mixtral 8x7B (MaxText) on Cloud TPU v5e-4 and Trillium 4 chips. Max input length: 1024, max output length: 1024. As of April 2025.
Google’s Pathways runtime, now open to Google Cloud customers for the first time, has been integrated into JetStream, enabling multi-host inference and disaggregated serving—two important features that are becoming increasingly critical as model scale grows exponentially and generative AI demands evolve.
Multi-host inference with Pathways distributes models across multiple accelerator hosts during serving. This allows for inference of large models that cannot be accommodated on a single host. With multi-host inference, JetStream achieved a speed of 1703 tokens/s on Llama 3.1 405B on Trillium. This means three times the inference per dollar compared to TPU v5e.
Furthermore, with Pathways, the disaggregated serving feature allows workloads to independently and dynamically scale the decoding and prefilling stages of LLM inference. This helps to better utilize resources and can improve performance and efficiency, especially for large models. For Llama2-70B, using multiple hosts for disaggregated serving, the performance of prefilling (time to first token, TTFT) operations improved by seven times, while token generation (time per output token, TPOT) performance increased by almost three times, compared to interleaved prefilling and decoding of LLM requests on the same server on Trillium.
Figure 2: Measured using Llama2-70B (MaxText) on Cloud TPU Trillium 16 chips (8 chips allocated to prefill server, 8 chips allocated to decode server). Measured using OpenOrca dataset. Max input length: 1024, max output length: 1024. As of April 2025.
MaxDiffusion: High-Performance Diffusion Model Inference
In addition to large language models (LLMs), Trillium also demonstrates excellent performance in computationally intensive workloads such as image generation. MaxDiffusion provides a series of reference implementations for latent diffusion models. Beyond Stable Diffusion inference, Google has extended MaxDiffusion to now support Flux, which has 12 billion parameters and is one of the largest open-source text-to-image models to date.
As demonstrated by MLPerf 5.0, Trillium now achieves 3.5 times higher throughput in queries per second on Stable Diffusion XL (SDXL) compared to the previous round of performance on its predecessor, TPU v5e. This further increased throughput by 12% since the MLPerf 4.1 submission.
Figure 3: MaxDiffusion throughput (images per second). Google internal data. Measured using SDXL model on Cloud TPU v5e-4 and Trillium 4 chips. Resolution: 1024×1024, batch size per device: 16, decoding steps: 20. As of April 2025.
With this throughput, MaxDiffusion provides a cost-effective solution. The cost of generating 1000 images on Trillium is as low as 22 cents, a 35% reduction compared to TPU v5e.
Figure 4: Diffusion cost to generate 1000 images. Google internal data. Measured using SDXL model on Cloud TPU v5e-4 and Cloud TPU Trillium 4 chips. Resolution: 1024×1024, batch size per device: 2, decoding steps: 4. Cost based on 3-year CUD prices for Cloud TPU v5e-4 and Cloud TPU Trillium 4 chips in the US. As of April 2025.
A3 Ultra and A4 VM MLPerf 5.0 Inference Results
For MLPerf™ Inference v5.0, Google Cloud submitted 15 results, including the first submissions for A3 Ultra (NVIDIA H200) and A4 (NVIDIA HGX B200) VMs. The A3 Ultra VM is powered by eight NVIDIA H200 Tensor Core GPUs, offering 3.2 Tbps of GPU-to-GPU non-blocking network bandwidth, and twice the High Bandwidth Memory (HBM) compared to the A3 Mega with NVIDIA H100 GPUs. Google Cloud’s A3 Ultra demonstrated highly competitive performance, achieving results comparable to NVIDIA’s peak GPU submissions for LLM, MoE, image, and recommendation models.
Google Cloud is the only cloud provider to submit results on NVIDIA HGX B200 GPUs, showcasing the excellent performance of A4 VMs in serving LLMs (including Llama 3.1 405B, a new benchmark introduced in MLPerf 5.0). Both A3 Ultra and A4 VMs provide powerful inference performance, demonstrating Google’s deep collaboration with NVIDIA to provide infrastructure for the most demanding AI workloads.
AI Hypercomputer is Driving the AI Inference Era
Google’s innovations in AI inference, including hardware advancements in Google Cloud TPU and NVIDIA GPUs, as well as software innovations like JetStream, MaxText, and MaxDiffusion, are driving breakthroughs in AI through integrated software frameworks and hardware accelerators.
Google’s launch of the AI Hypercomputer has significantly boosted cloud computing power, and Microfusion Technology, as a Google Cloud Premier Partner, is honored! We look forward to providing you with more first-hand AI-related information in the future. Please stay tuned to Google’s event messages, and we look forward to seeing you at the events! If you have any questions or needs, please feel free to contact Microfusion Technology.