NVIDIA® Blackwell Clusters: Available Soon

FLUX on B200 vs H100: Real-Time Image Inference with WaveSpeedAI

7 min read
FLUX on B200 vs H100: Real-Time Image Inference with WaveSpeedAI

WaveSpeedAI has teamed up with European GPU cloud provider DataCrunch to achieve a breakthrough in generative image and video model deployment. By optimizing the open-weight FLUX-dev model on DataCrunch’s cutting-edge NVIDIA B200 GPU, our collaboration delivers up to 6× faster image inference compared to industry-standard baselines.

In this post, we provide a technical overview of the FLUX-dev model and the B200 GPU, discuss the challenges of scaling FLUX-dev with standard inference stacks, and share benchmark results demonstrating how WaveSpeedAI’s proprietary framework significantly improves latency and cost-efficiency. Enterprise ML teams will learn how this WaveSpeedAI + DataCrunch solution translates to faster API responses and significantly reduced cost per image – empowering real-world AI applications. (WaveSpeedAI was founded by Zeyi Cheng, who leads our mission to accelerate generative AI inference.)

This blog is cross-posted to WaveSpeedAI blog.

FLUX-Dev: SOTA image generation model

FLUX-dev is a state-of-the-art (SOTA) open-source image generation model capable of text-to-image and image-to-image generation. Its capabilities include good world understanding and prompt adherence (thanks to the T5 text encoder), style diversity, complex scene semantics and composition understanding. The model output quality is comparable to or can surpass popular close-source models like Midjourney v6.0, DALL·E 3 (HD), and SD3-Ultra. FLUX-dev has rapidly become the most popular image generation model in the open-source community, setting a new benchmark for quality, versatility, and prompt alignment.

FLUX-dev uses flow matching, and its model architecture is based on a hybrid architecture of multimodal and parallel diffusion transformer blocks. The architecture has 12B parameters, approximately 33 GB fp16/bf16. Therefore, FLUX-dev is computationally demanding with this large parameter count and iterative diffusion process. Efficient inference is essential for large-scale inference scenarios where user experience is crucial.

NVIDIA’s Blackwell GPU architecture: B200

Blackwell architecture includes new features such as 5th-generation tensor cores (fp8, fp4), Tensor Memory (TMEM) and CTA pairs (2 CTA).

Performance numbers without sparsity

Technical Specifications
H100 SXM HGX B200
FP16/BF16 0.989 PFLOPS 2.25 PFLOPS
INT8 1.979 PFLOPS 4.5 PFLOPS
FP8 1.979 PFLOPS 4.5 PFLOPS
FP4 NaN 9 PFLOPS
GPU Memory 80 GB HBM3 180GB HBM3E
GPU Memory Bandwidth 3.35 TB/s 7.7TB/s
NVLink bandwidth per GPU 900GB/s 1,800GB/s

Operator level micro-benchmarking of GEMM and attention shows the following:

The benchmarking results suggest that the B200 is exceptionally well-suited for large-scale AI workloads, especially generative models requiring high memory throughput and dense computing.

Challenges with Standard Inference Stacks

Running FLUX-dev on typical inference pipelines (e.g., PyTorch + Hugging Face Diffusers), even on high-end GPUs like H100, presents several challenges:

The optimization objectives of serving large-scale and cheap inference are higher throughput and lower latency, reducing the image generation cost.

WaveSpeedAI’s Proprietary Inference Framework

WaveSpeedAI addresses these bottlenecks with a proprietary framework purpose-built for generative inference. Developed by founder Zeyi Cheng, this framework is our in-house high-performance inference engine optimized specifically for state-of-the-art diffusion transformer models such as FLUX-dev and Wan 2.1. Key innovations in the inference engine include:

Our inference engine follows an HW-SW co-design, fully utilizing the B200's compute and memory capacity. It represents a significant leap forward in AI model serving, allowing us to deliver ultra-low latency and high-efficiency inference at the production scale.

We decompose the problem of efficient AI inference into multiple dimensions, like quantization, kernel optimizations, distributed inference (see ParaAttention) and algorithmic efficiency using DiT caching — some images require fewer denoising steps to attain a reasonable quality than others (e.g. AdaCache).

We decompose the problem of efficient AI inference into multiple dimensions, like quantization, compiler level with kernel optimizations, distributed inference (see ParaAttention) and algorithmic efficiency using DiT caching — some images require fewer denoising steps to attain a reasonable quality than others (e.g. AdaCache).

We evaluate how these optimizations impact output quality, prioritizing lossless vs. loosely optimizations. That is, we do not apply optimization that could significantly reduce model capabilities or completely collapse visible output quality, like text rendering and scene semantics.

Benchmark: WaveSpeedAI on B200 vs. H100 Baseline

FLEX-dev optimization settings for B200 vs. H100 benchmarking

Model outputs using different optimizations settings:

FLUX-dev image output with WaveSpeedAI's inference framework and DataCrunch's B200 GPUs

Prompt: photograph of an alternative female with an orange bandana, light brown long hair, clear frame glasses, septum piecing [sic], beige overalls hanging off one shoulder, a white tube top underneath, she is sitting in her apartment on a bohemian rug, in the style of a vogue magazine shoot

Implications

The performance improvements translate into:

We have achieved B200 equal H100 cost-performance ratio but half the generation latency. Thus, the cost per generation doesn’t increase while now enabling new real-time possibilities without sacrificing model capabilities. Sometimes more isn’t more but different, and here we have achieved a new stage of performance, providing a new level of user experience in image generation using SOTA models.

This enables responsive, creative tools, scalable content platforms, and sustainable cost structures for generative AI at scale.

Conclusion and Next Steps

"The FLUX-dev using B200 deployment demonstrates what is possible when world-class hardware meets best-in-class software. We are pushing the frontiers of inference speed and efficiency at WaveSpeedAI, founded by Zeyi Cheng — creator of stable-fast, ParaAttention, and our in-house inference engine. In the next releases, we will focus on efficient video generation inference and how to achieve close to real-time inference. Our partnership with DataCrunch represents an opportunity to access cutting-edge GPUs like B200 and the upcoming NVIDIA GB200 NVL72 (Pre-order NVL72 GB200 clusters from DataCrunch) while co-developing a critical inference infrastructure stack."

Get Started Today:

Join us as we build the world's fastest generative inference infrastructure.