NVIDIA® H200 Instances and Clusters Available

NVIDIA L40S Cloud GPU Performance and H100 / A100 Comparison

7 min read
NVIDIA L40S Cloud GPU Performance and H100 / A100 Comparison

With so many high-performance GPUs launched by NVIDIA lately, it’s difficult to keep track of the unique benefits of each model.  

The NVIDIA L40S has received less attention than many other GPUs, but it has found a unique position in high performance computing and specific deep learning use-cases. 

Nvidia l40s GPU

Let’s go through what you need to know about the specs and performance of the L40S compared to two more popular models, the A100 and H100

What is the L40S 

The L40S is an adaptation of NVIDIA’s Ada Lovelace GPU architecture. You can consider it an upgraded version of the L40 and a distant relative of the RTX 4090 high-end gaming graphics card. The L40 was originally designed for data center graphics and simulation workloads. It found a new form of life in the form of the L40S because of the huge demand seen in for GPUs in machine learning training and inference. 

The L40S was released in October 2022 and billed by NVIDIA as “the most powerful universal GPU.” On paper it is powerful indeed. It includes 4th Generation Tensor Cores, 142 RT Cores and 48GB GDDR6 memory optimized for graphics performance. It’s also compatible with NVIDIA’s Transformer Engine technology found in the Hopper-series architecture. 

What’s the Difference Between the NVIDIA L40 and L40S?

The NVIDIA L40 and L40S share a common lineage, but the L40S represents a significant upgrade tailored for AI and enterprise workloads. While both GPUs leverage the Ada Lovelace architecture, the L40S offers enhanced performance, memory capacity, and optimized efficiency.

Key differences include:

  1. Performance Boost: The L40S delivers higher TFLOPs for both FP32 and mixed-precision operations, making it better suited for intensive AI training and inference tasks.

  2. Memory Advantage: The L40S features increased GPU memory, allowing it to handle larger datasets and more complex AI models compared to the L40.

  3. Target Use Cases: The L40 is positioned as a versatile GPU for general-purpose workloads, including graphics and light AI applications. The L40S, on the other hand, is purpose-built for high-performance AI and enterprise deployments.

These upgrades make the L40S a compelling option for users seeking maximum performance in cloud environments, offering an ideal balance of power and efficiency for demanding AI workflows.

The L40S became popular due to lack of availability of both the A100 and H100. These two are also the best comparisons in terms of specs and performance. 

L40S vs A100 vs H100 Specs Comparison 

GPU Features 

NVIDIA A100 

NVIDIA L40S 

NVIDIA H100 SXM5 

GPU Architecture 

Ampere 

Ada Lovelace 

Hopper 

GPU Board Form Factor 

SXM4 

Dual Slot PCIe 

SXM5 

GPU Memory 

40 or 80GB 

48GB 

80GB 

Memory Bandwidth 

1.6 to 2 TB/sec 

864 GB/sec 

3.35 TB/sec 

CUDA Cores 

6912 

18176 

14592 

FP64 TFLOPS 

9.7 

N/A 

33.5 

FP32 TFLOPS 

19.5 

91.6 

67 

TF32 Tensor Core Flops* 

156 | 312 

183 | 366 

378 | 756 

FP16 Tensor Core Flops* 

312 | 624 

362 | 733 

756 | 1513 

FP8 Tensor Core TFLOPS 

N/A 

733 | 1446 

3958 TFOPS 

Peak INT8 TOPS* 

624 | 1248 

733 | 1446  

1513 | 3026 

L2 Cache 

40MB 

96MB 

50MB 

Max thermal design power (TDP) 

400 Watts 

350 Watts 

700 Watts 

*Without and with structured sparsity. 

Looking for more details on your options? Explore A100 specs and H100 specs in more detail. 

Performance Comparison 

There are clear differences in performance between the L40S, A100, and H100 in FP64 (double-precision), FP32 (single-precision), and FP16 (half-precision) computations.  

FP64 (Double-Precision)

The L40S does not natively support FP64. In applications that require high precision, the L40S may not perform as well as the A100 and H100. The H100, with its significantly higher FP64 performance, is particularly well-suited for these demanding tasks in today’s GPU landscape. 

FP32 (Single-Precision)

In FP32 Tensor Core performance the L40S substantially outshines the A100 40GB and on paper it also has a good top line performance compared to the H100. However, in memory-intensive ML-related cases this performance is likely to be balanced out by the GPUs lower memory bandwidth compared to both the A100 80GB and the H100. 

FP16 (Half-Precision)

The L40S, although capable, may not be the optimal choice for the most demanding AI/ML workloads. It has similar performance to the A100 40GB but is clearly outperformed by the A100 80GB and the H100. 

nvidia l40s performance

Lower memory bandwidth in the L40S 

Theoretical peak FLOPS does not give you a full picture. For machine learning use cases memory bandwidth has a major role in training and inference. The L40S uses GDDR6 SGRAM memory, a common type of graphics random-access memory known for its balance of cost and performance. However, GDDR6 inherently has lower bandwidth capabilities compared to HBM (High Bandwidth Memory) solutions. 

The A100 and H100, on the other hand, leverage HBM2e and HBM3, respectively. These HBM technologies offer significantly higher bandwidth due to their stacked architecture and wider data interfaces. This allows for a much faster data transfer rate between the GPU and its memory, which is crucial for high-performance computing tasks where large datasets are involved. 

The L40S's GDDR6 memory, while suitable for general-purpose workloads, becomes a bottleneck when handling massive data transfers required for high-precision calculations and complex AI/ML models. The HBM implementations in the A100 and H100 address this bottleneck, enabling them to achieve significantly higher performance in those compute-intensive scenarios. 

Power efficiency comparison 

The L40S has a maximum thermal design power (TDP) of 350W, which is lower than both the A100 SXM4 (400W) and the H100 (700W). While lower power consumption can be better, this is not the case with high-performance computing. It's important to note that the L40S also has lower performance compared to the A100 and H100.  

The H100, despite having the highest TDP, also offers the highest performance across all categories (FP16, FP32, and FP64). As a result, the H100 has better performance-per-watt than the A100 and L40S. 

L40S Price Comparison with A100 and H100 

While demand for high-performance GPUs remains high, the availability of L40S on cloud GPU platforms like DataCrunch is improving. Here is how it compares in cost per hour with the A100 and H100. 

 

A100 40GB Cost 

L40S Cost 

A100 40GB Cost 

H100 SXM5 Cost 

On-demand instance 

$1.29/hour 

$1.10/hour 

$1.75/hour 

$2.65/hour 

↳ 2-year price 

$0.97/hour 

$0.83/hour 

$1.31/hour 

$1.99/hour 

8GPU On-demand instance 

$10.32/hour 

$8.80/hour 

$14.00/hour 

$21.20/hour 

↳ 2 year price 

$7.74/hour 

$6.60/hour 

$10.50/hour 

$15.90/hour 

Key point about costs: the price per hour of the L40S is comparable to the A100 40GB and is substantially lower than the H100 on a 2-year contract.

L40S Cloud GPU Pricing

With DataCrunch you can also deploy the L40S using dynamic pricing, where the hourly price is often a lot lower.

Bottom line on the L40S 

You can consider NVIDIA L40S as an outlier in today’s competitive field of computing accelerators. While it doesn’t have the raw performance capability of the H100 or new models, it has many areas where it compares favourably to the A100 and earlier GPUs. 

L40S strengths 

L40S limitations 

In today’s market you shouldn’t dismiss the L40S. You can expect lower cost in the long term and better availability than the A100 80GB or the H100. It is a versatile GPU for machine learning projects where absolute compute speed is not your most important decision factor.