While NVIDIA has released more powerful GPUs, both the A100 and V100 remain high-performance accelerators for various machine learning training and inference projects.
Compared to newer GPUs, the A100 and V100 both have better availability on cloud GPU platforms like DataCrunch and you’ll also often see lower total costs per hour for on-demand access.
You don’t need to assume that a newer GPU instance or cluster is better. Here is a detailed outline of specs, performance factors and price that may make you consider the A100 or the V100.
V100 vs A100 vs H100 Datasheet Comparison
GPU Features | NVIDIA V100 | NVIDIA A100 | NVIDIA H100 |
---|---|---|---|
SMs | 80 | 108 | 132 |
TPCs | 40 | 54 | 66 |
FP32 Cores / SM | 64 | 64 | 128 |
FP32 Cores / GPU | 5020 | 6912 | 16896 |
FP64 Cores / SM (excl. Tensor) | 32 | 32 | 64 |
FP64 Cores / GPU (excl. Tensor) | 2560 | 3456 | 8448 |
INT32 Cores / SM | 64 | 64 | 64 |
INT32 Cores / GPU | 5120 | 6912 | 8448 |
Tensor Cores / SM | 8 | 4 | 4 |
Tensor Cores / GPU | 640 | 432 | 528 |
Texture Units | 320 | 432 | 528 |
Memory Interface | 4096-bit HBM2 | 5120-bit HBM2 | 5120-bit HBM3 |
Memory Bandwidth | 900 GB/sec | 1555 GB/sec | 3000 GB/sec |
Transistors | 21.1 billion | 54.2 billion | 80 billion |
Max thermal design power (TDP) | 300 Watts | 400 Watts | 700 Watts |
* see more detailed comparisons of A100 vs H100.
Overview of the NVIDIA V100 GPU
The NVIDIA V100, launched in 2017, marked a significant leap in GPU technology with the introduction of Tensor Cores. These cores were designed to accelerate matrix operations, which are fundamental to deep learning and AI workloads. Here are some key features and capabilities of the V100:
Tensor Cores: The V100 was the first GPU to incorporate Tensor Cores, providing up to 12x performance improvement for deep learning training compared to its predecessors.
Memory: It features 16 GB of HBM2 memory, with a memory bandwidth of 900 GB/s, enabling it to handle large datasets efficiently.
Performance: With 640 Tensor Cores and 5,120 CUDA Cores, the V100 delivers 125 teraflops of deep learning performance.
The V100 has been widely adopted in AI research, autonomous driving, medical imaging, and other AI-heavy industries. Famously OpenAI used over 10,000 V100s in the training of the GPT-3 large language model.
Overview of the NVIDIA A100 GPU
Building on the V100's foundation, the NVIDIA A100, introduced in 2020, represented another major advancement in GPU technology for AI and HPC. It included several new advances designed to meet the growing demands of AI workloads:
Enhanced Tensor Cores: The A100 features third-generation Tensor Cores that support a new data type, TensorFloat-32 (TF32), which delivers up to 20x performance improvement for AI training compared to the V100.
Memory: The A100 comes with either 40 GB or 80GB of HBM2 memory and a significantly larger L2 cache of 40 MB, increasing its ability to handle even larger datasets and more complex models.
Performance: With 6,912 CUDA Cores and 432 Tensor Cores, the A100 offers 312 teraflops of deep learning performance, making it a powerhouse for AI applications.
Multi-Instance GPU (MIG): One of the standout features of the A100 is its ability to partition itself into up to seven independent instances, allowing multiple networks to be trained or inferred simultaneously on a single GPU.
V100 and A100 architecture compared
The architectural improvements in the A100's Streaming Multiprocessors (SMs) play an important role in its performance gains over the V100. While the V100's SMs were already highly efficient, the A100's SMs have been significantly optimized:
V100 SM Architecture: The V100's SM architecture includes 64 CUDA Cores per SM, with a total of 5,120 CUDA Cores across the GPU. Each SM also includes eight Tensor Cores, designed to accelerate matrix multiplications.
A100 SM Architecture: The A100's SM architecture includes 128 CUDA Cores per SM, resulting in a total of 6,912 CUDA Cores. Each SM also features four third-generation Tensor Cores, which support TF32 and fine-grained structured sparsity, further boosting AI performance.
Difference in SXM socket solutions
Both the V100 and A100 come with NVIDIA's proprietary SXM (Server PCI Express Module) high-bandwidth socket solutions.
The V100 comes with either a SXM2 or SXM3 socket, while the A100 utilizes the more advanced SXM4. See a comparison of the A100 PCIe and SXM4 options.
Shift from 2nd to 3nd generation Tensor Core
There is a major shift from the 2nd generation Tensor Cores found in the V100 to the 3rd generation tensor cores in the A100:
V100 Tensor Cores: The V100's Tensor Cores primarily support FP16 precision.
A100 Tensor Cores: The A100 introduces third-generation Tensor Cores that support TF32, a new precision format designed to deliver the performance of FP16 with the ease of use of FP32.
A100 and V100 Performance Benchmarks
Both the V100 and A100 were designed with high-performance workloads in mind.
ML training performance:
V100: The V100 was the first GPU to pass the 100 terraflops barrier for deep learning performance, clocking an impressive 120 terraflops, equivalent of the performance of 100 CPUs.
A100: The A100, with its 312 teraflops of deep learning performance using TF32 precision, provides up to 20x speedup compared to the V100 for AI training tasks.
Inference performance:
V100: The V100 is highly effective for inference tasks, with optimized support for FP16 and INT8 precision, allowing for efficient deployment of trained models.
A100: The A100 further enhances inference performance with its support for TF32 and mixed-precision capabilities. The GPU's ability to handle multiple precision formats and its increased compute power enable faster and more efficient inference, crucial for real-time AI applications.
Real-World application benchmarks
In addition to the theoretical benchmarks, it’s valuable to see how the V100 and A100 compare when used with common frameworks like PyTorch and Tensorflow. According to real-world benchmarks developed by NVIDIA:
1 x A100 is around 60-70% faster than 1 x V100, when training a convolutional neural network (ConvNet) on PyTorch, with mixed precision.
1 x A100 is around 100-120% faster than 1 x V100, when training a ConvNet on TensorFlow, with mixed precision.
8 x A100 is around 70-80% faster than 8 x V100, when training a ConvNet on PyTorch, with mixed precision.
8 x A100 is around 70-80% faster than 8 x V100, when training a ConvNet on TensorFlow, with mixed precision.
V100 and A100 Pricing
Both the V100 and A100 are now widely available as on-demand instances or GPU clusters. Current on-demand prices for instances at DataCrunch:
80 GB A100 SXM4: $1.89/hour
40 GB A100 SXM4: $1.29/hour
16 GB V100: $0.62/hour
*a detailed summary of all cloud GPU instance prices can be found here.
Bottom line on the V100 and A100
While both the NVIDIA V100 and A100 are no longer top-of-the-range GPUs, they are still extremely powerful options to consider for AI training and inference.
The NVIDIA A100 Tensor Core GPU represents a significant leap forward from its predecessor, the V100, in terms of performance, efficiency, and versatility. With its 3rd Generation Tensor Cores, increased memory capacity, and new features like Multi-Instance GPU (MIG) technology, the A100 is well-suited for many AI and HPC workloads.
Even so, the wide availability (and lower cost per hour) of the V100 make it a perfectly viable option for many projects that require less memory bandwidth and speed. The V100 remains one of the most commonly used chips in AI research today, and can be a solid option for inference and fine-tuning.
Now that you have a better understanding of the V100 and A100, why not get some practical experience with either GPU. Spin up an on-demand instance on DataCrunch and compare performance yourself.