NVIDIA® H200 Instances and Clusters Available

Deploy DeepSeek-R1 671B on 8x NVIDIA H200 with SGLang

1 min read
Deploy DeepSeek-R1 671B on 8x NVIDIA H200 with SGLang

In this guide, we show how to deploy DeepSeek-R1 with SGLang on 8x NVIDIA H200 GPUs, which are available on-demand through the DataCrunch Cloud Platform, and run a performance benchmark.

Inference engine: SGLang

SGLang is the recommended inference engine for deploying DeepSeek models, in particular DeepSeek-V3/R1. SGLang currently supports MLA optimizations, DP Attention, FP8 (W8A8), FP8 KV Cache, and Torch Compile, enabling it to deliver state-of-the-art latency and throughput performance among other open-source frameworks.

Notably, SGLang v0.4.1 fully supports running DeepSeek-V3 on both NVIDIA and AMD GPUs, making it a highly versatile and robust solution. SGLang also supports multi-node tensor parallelism, enabling you to run this model on multiple network-connected machines.

Multi-Token Prediction (MTP) is in development, and progress can be tracked in the optimization plan (e.g. FusedMoE H200 aware-tuning) and custom kernels development.

We have been providing the SGLang team with GPU infrastructure targeting H200 aware-tunning for optimal performance. (see H200 DeepSeek V3/R1 benchmarking).

Deploying DeepSeek-R1

  1. The original sglang docker image is used as recommended:
docker pull lmsysorg/sglang:latest
  1. The following command will create a valid docker container to host DeepSeek R1.
docker run --gpus all \
    --shm-size 32g \
    --network=host \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --name deepseek_r1 \
    -it \
    --env "HF_TOKEN=$HF_TOKEN" \
    --ipc=host \
    lmsysorg/sglang:latest \
    bash
  1. This will create an interactive session inside the container where the following command should launch the server with DeepSeek R1.
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code --enable-dp-attention

Benchmarking DeepSeek-R1

  1. The following command will perform a benchmark workload for 1 batch, 128 inputs token, and 256 outputs token:
python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-R1  --trust-remote-code --tp 8 --enable-torch-compile --torch-compile-max-bs 1
Prefill. latency: 1.91032 s, throughput:     67.00 token/s
Decode.  latency: 1.04900 s, throughput:      0.95 token/s
Decode.  latency: 0.02175 s, throughput:     45.99 token/s
Decode.  latency: 0.02097 s, throughput:     47.69 token/s
Decode.  latency: 0.02097 s, throughput:     47.68 token/s
Decode.  latency: 0.02080 s, throughput:     48.07 token/s
Decode.  median latency: 0.02097 s, median throughput:     47.68 token/s
Total. latency:  3.086 s, throughput:     44.07 token/s
Benchmark ...
Prefill. latency: 0.19635 s, throughput:    651.90 token/s
Decode.  latency: 0.02100 s, throughput:     47.62 token/s
Decode.  latency: 0.02078 s, throughput:     48.13 token/s
Decode.  latency: 0.02092 s, throughput:     47.80 token/s
Decode.  latency: 0.02086 s, throughput:     47.93 token/s
Decode.  latency: 0.02085 s, throughput:     47.97 token/s
Decode.  median latency: 0.02098 s, median throughput:     47.67 token/s
Total. latency:  5.537 s, throughput:     69.35 token/s

Two benchmarks are run as sanity checks. The final user will perceive the decode latency as shown below:

Decode.  median latency: 0.02098 s, median throughput:     47.67 token/s

Next steps

In order to reproduce the steps above, you can access the required 8x NVIDIA H200 GPUs through the DataCrunch Cloud Platform. DataCrunch is trusted by the leading AI researchers and engineers for training and inference workloads, ranging in scale and use cases.

Get started now →

References

Subscribe to our newsletter

Get the latest updates on GPU benchmarks and AI research