With so many high-performance GPUs launched by NVIDIA lately, it’s difficult to keep track of the unique benefits of each model.
The NVIDIA L40S has received less attention than many other GPUs, but it has found a unique position in high performance computing and specific deep learning use-cases.
Let’s go through what you need to know about the specs and performance of the L40S compared to two more popular models, the A100 and H100.
What is the L40S
The L40S is an adaptation of NVIDIA’s Ada Lovelace GPU architecture. You can consider it an upgraded version of the L40 and a distant relative of the RTX 4090 high-end gaming graphics card. The L40 was originally designed for data center graphics and simulation workloads. It found a new form of life in the form of the L40S because of the huge demand seen in for GPUs in machine learning training and inference.
The L40S was released in October 2022 and billed by NVIDIA as “the most powerful universal GPU.” On paper it is powerful indeed. It includes 4th Generation Tensor Cores, 142 RT Cores and 48GB GDDR6 memory optimized for graphics performance. It’s also compatible with NVIDIA’s Transformer Engine technology found in the Hopper-series architecture.
The L40S became popular due to lack of availability of both the A100 and H100. These two are also the best comparisons in terms of specs and performance.
L40S vs A100 vs H100 Specs Comparison
GPU Features | NVIDIA A100 | NVIDIA L40S | NVIDIA H100 SXM5 |
---|---|---|---|
GPU Architecture | Ampere | Ada Lovelace | Hopper |
GPU Board Form Factor | SXM4 | Dual Slot PCIe | SXM5 |
GPU Memory | 40 or 80GB | 48GB | 80GB |
Memory Bandwidth | 1.6 to 2 TB/sec | 864 GB/sec | 3.35 TB/sec |
CUDA Cores | 6912 | 18176 | 14592 |
FP64 TFLOPS | 9.7 | N/A | 33.5 |
FP32 TFLOPS | 19.5 | 91.6 | 67 |
TF32 Tensor Core Flops* | 156 | 312 | 183 | 366 | 378 | 756 |
FP16 Tensor Core Flops* | 312 | 624 | 362 | 733 | 756 | 1513 |
FP8 Tensor Core TFLOPS | N/A | 733 | 1446 | 3958 TFOPS |
Peak INT8 TOPS* | 624 | 1248 | 733 | 1446 | 1513 | 3026 |
L2 Cache | 40MB | 96MB | 50MB |
Max thermal design power (TDP) | 400 Watts | 350 Watts | 700 Watts |
*Without and with structured sparsity.
Looking for more details on your options? Explore A100 specs and H100 specs in more detail.
Performance Comparison
There are clear differences in performance between the L40S, A100, and H100 in FP64 (double-precision), FP32 (single-precision), and FP16 (half-precision) computations.
FP64 (Double-Precision)
The L40S does not natively support FP64. In applications that require high precision, the L40S may not perform as well as the A100 and H100. The H100, with its significantly higher FP64 performance, is particularly well-suited for these demanding tasks in today’s GPU landscape.
FP32 (Single-Precision)
In FP32 Tensor Core performance the L40S substantially outshines the A100 40GB and on paper it also has a good top line performance compared to the H100. However, in memory-intensive ML-related cases this performance is likely to be balanced out by the GPUs lower memory bandwidth compared to both the A100 80GB and the H100.
FP16 (Half-Precision)
The L40S, although capable, may not be the optimal choice for the most demanding AI/ML workloads. It has similar performance to the A100 40GB but is clearly outperformed by the A100 80GB and the H100.
Lower memory bandwidth in the L40S
Theoretical peak FLOPS does not give you a full picture. For machine learning use cases memory bandwidth has a major role in training and inference. The L40S uses GDDR6 SGRAM memory, a common type of graphics random-access memory known for its balance of cost and performance. However, GDDR6 inherently has lower bandwidth capabilities compared to HBM (High Bandwidth Memory) solutions.
The A100 and H100, on the other hand, leverage HBM2e and HBM3, respectively. These HBM technologies offer significantly higher bandwidth due to their stacked architecture and wider data interfaces. This allows for a much faster data transfer rate between the GPU and its memory, which is crucial for high-performance computing tasks where large datasets are involved.
The L40S's GDDR6 memory, while suitable for general-purpose workloads, becomes a bottleneck when handling massive data transfers required for high-precision calculations and complex AI/ML models. The HBM implementations in the A100 and H100 address this bottleneck, enabling them to achieve significantly higher performance in those compute-intensive scenarios.
Power efficiency comparison
The L40S has a maximum thermal design power (TDP) of 350W, which is lower than both the A100 SXM4 (400W) and the H100 (700W). While lower power consumption can be better, this is not the case with high-performance computing. It's important to note that the L40S also has lower performance compared to the A100 and H100.
The H100, despite having the highest TDP, also offers the highest performance across all categories (FP16, FP32, and FP64). As a result, the H100 has better performance-per-watt than the A100 and L40S.
L40S Price Comparison with A100 and H100
While demand for high-performance GPUs remains high, the availability of L40S on cloud GPU platforms like DataCrunch is improving. Here is how it compares in cost per hour with the A100 and H100.
| ||||
---|---|---|---|---|
On-demand instance | $1.29/hour | $1.36/hour | $1.75/hour | $3.17/hour |
↳ 2-year price | $0.97/hour | $1.02/hour | $1.31/hour | $2.38/hour |
8GPU On-demand instance | $10.32/hour | $10.86/hour | $14.00/hour | $25.36/hour |
↳ 2 year price | $7.74/hour | $8.14/hour | $10.50/hour | $19.02/hour |
Key point about costs: the price per hour of the L40S is comparable to the A100 40GB and is substantially lower than the H100 on a 2-year contract.
With DataCrunch you can also deploy the L40S using dynamic pricing, where the hourly price is often a lot lower.
Bottom line on the L40S
You can consider NVIDIA L40S as an outlier in today’s competitive field of computing accelerators. While it doesn’t have the raw performance capability of the H100 or new models, it has many areas where it compares favourably to the A100 and earlier GPUs.
L40S strengths
Versatile performance: The L40S aims to strike a balance between a high number of CUDA cores and RT Cores, making it suitable for a broader range of graphics-intensive workloads.
Large memory capacity: Its 48GB of GDDR6 memory gives it a small numerical edge over the A100 40GB, allowing it to handle some larger models (yet, with a lower memory bandwidth.)
L40S limitations
Raw computing power: For the absolute highest training throughput on massive models, the A100 or H100 hold an advantage due to their higher raw compute capabilities and high HBM memory bandwidth.
Lack of precision: the L40S is incapable of handling FP64, which can be an issue in matrix maths and training of many AI models.
In today’s market you shouldn’t dismiss the L40S. You can expect lower cost in the long term and better availability than the A100 80GB or the H100. It is a versatile GPU for machine learning projects where absolute compute speed is not your most important decision factor.