This technical guide covers the deployment flow of containerized workloads using Enroot and Pyxis within the DataCrunch Instant Clusters. Through this guide, we detail the setup process, from initial configuration and incremental testing of multi-node distributed workloads using TorchTitan. All artifacts used can be found in our GitHub repository.
Introduction
Enroot is a lightweight, unprivileged container runtime optimized for HPC environments, designed specifically to execute containerized applications with minimal overhead. Enroot seamlessly converts Docker or OCI images into SquashFS files, enabling rapid deployment across HPC nodes and ensuring efficient parallel workload execution.
Pyxis is a Slurm plugin that provides native integration of container runtimes like Enroot within the Slurm resource manager. By extending Slurm’s job submission commands (i.e., sbatch
and srun
), Pyxis allows users to specify container images directly in job scripts. Pyxis ensures automated handling of the container lifecycle, including image pulling, caching, and execution. Thus, Pyxis provides containerized environments and ensures reproducibility in the experimentations.
Job Submission Workflow
- Submitting Slurm jobs with Pyxis-specific options
- Pulling and converting Docker images into Enroot bundles with Pyxis
- Launching containerized jobs over the worker nodes
Testing Environment
- A cluster of 16xH200s allocated within two worker nodes: 8x WorkerNode1 (WN1) and 8x WorkerNode2 (WN2)
- Both WNs are accessible from the HeadNode (HN or jump host), which is a CPU-only node
- WNs and HN share an NFS (Network File System) over
/home
, sharing all the users and their home directories residing in the cluster
Prerequisites
We want each WN to store its own artifacts (Docker and Enroot images, intermediate data, caches, runtimes) on local storage to avoid any possible race conditions and concurrent access, resulting in a clean and isolated setup.
Docker Configuration
Note: Docker configuration would be needed only if we were to use the Docker Hub for downloading images and then convert them to enroot .sqsh format.
- Modify Docker’s root directory on each WR for local NVMe storage:
sudo mkdir -p /mnt/local_disk/docker
sudo vim /etc/docker/daemon.json
- Add configuration.
{
"data-root": "/mnt/local_disk/docker",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
- Restart Docker
sudo systemctl restart docker
- Verify setup
docker info | grep "Docker Root Dir"
Enroot Configuration
Enroot configuration resides in /etc/enroot/enroot.conf
. The default config points to:
ENROOT_LIBRARY_PATH /usr/lib/enroot # Path to library sources
ENROOT_SYSCONF_PATH /etc/enroot # Path to system configuration files
ENROOT_RUNTIME_PATH ${XDG_RUNTIME_DIR}/enroot # Path to the runtime working directory
ENROOT_CONFIG_PATH ${XDG_CONFIG_HOME}/enroot # Path to user configuration files
ENROOT_CACHE_PATH ${XDG_CACHE_HOME}/enroot # Path to user image/credentials cache
ENROOT_DATA_PATH ${XDG_DATA_HOME}/enroot # Path to user container storage
ENROOT_TEMP_PATH ${TMPDIR} # Path to temporary directory
XDG variables point to /home/user, which is the NFS (Network File System; the shared storage over HN and WNs).
As we would like to avoid concurrent enroot file conflicts and enable a multiuser and multijobs setup, we have to set per-node local directories in /etc/enroot/enroot.conf:
ENROOT_LIBRARY_PATH /usr/lib/enroot
ENROOT_SYSCONF_PATH /etc/enroot
ENROOT_RUNTIME_PATH /mnt/local_disk/enroot/runtime/$UID
ENROOT_CONFIG_PATH /mnt/local_disk/enroot/config
ENROOT_CACHE_PATH /mnt/local_disk/enroot/cache/$UID
ENROOT_DATA_PATH /mnt/local_disk/enroot/data/$UID
ENROOT_TEMP_PATH ${TMPDIR:-/tmp}
With this configuration, we encounter errors such as:
slurmstepd: error: pyxis: mkdir: cannot create directory ‘/mnt/local_disk/enroot/cache/.tokens.1000’: Permission denied
.
This is due to the permission issues from different users writes. The problem resides in Slurm having trouble transferring Linux groups to Slurm scripts environment.
As a solution, we propose to make the Enroot folder world-writable:
sudo chmod -R a+rwx /mnt/local_disk/enroot
We found another permission issue caused by a bug in the libraries libnvidia-container-tools libnvidia-container1 nvidia-container-toolkit nvidia-container-toolkit-base version 1.17.7
:
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis: nvidia-container-cli: container error: file lookup failed: /proc/55900/root/mnt/local_disk/enroot/data/pyxis_torchtitan_singlenode/etc/debian_version: permission denied
slurmstepd: error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: 1x-instant-cluster-testing-1: task 0: Exited with exit code 1
The proposed solution is to downgrade libnvidia-container-tools libnvidia-container1 nvidia-container-toolkit nvidia-container-toolkit-base
to 1.17.6-1
:
sudo apt-get install -y \
libnvidia-container1=1.17.6-1 \
libnvidia-container-tools=1.17.6-1 \
nvidia-container-toolkit-base=1.17.6-1 \
nvidia-container-toolkit=1.17.6-1 \
--allow-downgrades --allow-change-held-packages
With the proper configuration set up, we proceed to dissect, check, and test each incremental step towards a complete workload.
Enroot Health Check
This script verifies that Enroot is correctly installed and operational on the host machine. It performs a systematic validation by checking Enroot's presence, confirming the version, importing a minimal Docker image (ubuntu
), creating and launching a test container, and retrieving basic runtime information such as the container's PID and operating system details. It also ensures proper cleanup after execution, removing any residual test artifacts.
#!/bin/bash
set -e
echo "Checking Enroot installation..."
command -v enroot || { echo "Enroot not installed."; exit 1; }
echo "Running Enroot version check..."
enroot version || exit 1
TEST_IMG="ubuntu"
TEST_CONT="enroot_healthcheck"
[ -f ${TEST_IMG}.sqsh ] && rm -f ${TEST_IMG}.sqsh
enroot import docker://${TEST_IMG}
enroot create -n ${TEST_CONT} ${TEST_IMG}.sqsh
PID_OUTPUT=$(enroot start ${TEST_CONT} sh -c 'echo $$')
OS_OUTPUT=$(enroot start ${TEST_CONT} sh -c 'grep PRETTY /etc/os-release')
echo "PID: $PID_OUTPUT"
echo "OS: $OS_OUTPUT"
enroot remove -f ${TEST_CONT}
rm -f ${TEST_IMG}.sqsh
echo "Enroot health check PASSED."
Pyxis Health Check
There are two alternatives for performing this health check.
- Run the
srun
command with pyxis flag:
srun --container-image=ubuntu grep PRETTY /etc/os-release
- Execute the Slurm script with SBATCH-extended flags:
#!/bin/bash
#SBATCH --job-name=pyxis_test
#SBATCH --output=pyxis_test.out
#SBATCH --container-name=pyxis_test
#SBATCH --container-image=docker://ubuntu
echo "Running inside container:"
grep PRETTY /etc/os-release
Now that we have tested and checked the incremental steps required, we can start an end-to-end training workload. First, we need to ensure that the framework performs correctly in a single-node setup before moving to a multi-node one.
TorchTitan Single-node Testing
Next, we perform a testing training of Llama 8B using the c4_test dataset from tests/assets/c4_test
included in TorchTitan repository. The container will perform a 10-step training that will be reflected on the /home/ubuntu/slurm_logging/headnode/%x_%j_headnode.err
file.
Prerequisites: TorchTitan custom image
torchtitan.dockerfile:
Note: HF_TOKEN must be configured as environment variable
# Usinsg official Pytorch 2.7 + CUDA 12.8 base image
FROM pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime
#FROM ubuntu:22.04
ARG HF_TOKEN
ENV HF_TOKEN=${HF_TOKEN}
# makes sure the shell used for subsequent RUN commands is exactly Bash, as located in /bin.
SHELL ["/bin/bash", "-c"]
# Install dependencies
# llamacpp gcc compilation tools
RUN apt-get update && apt-get install -y \\
build-essential \\
fzf \\
ripgrep \\
nvtop \\
sudo \\
kmod \\
wget \\
vim \\
git \\
curl \\
bzip2 \\
ca-certificates \\
libglib2.0-0 \\
libxext6 \\
libsm6 \\
libxrender1 \\
libssl-dev \\
libibverbs1 \\
ibverbs-utils \\
libmlx5-1 \\
infiniband-diags
# Cleanup command to remove the apt cache and reduce the image size: # IMPORTANT: Enforces using sudo apt update when entering the container
#&& rm -rf /var/lib/apt/lists/*
# Cloning the repo
RUN git clone <https://github.com/pytorch/torchtitan>
# Change to the repo directory using WORKDIR
WORKDIR /workspace/torchtitan
RUN mkdir -p /root/.cache/huggingface
RUN pip install -r requirements.txt
# For CUDA 12.8 on worker nodes
RUN pip install torch torchvision torchaudio --index-url <https://download.pytorch.org/whl/cu128>.
# Download the tokenizer
RUN python3 scripts/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3.1-8B --tokenizer_path "original"
# docker build -f torchtitan.dockerfile --build-arg HF_TOKEN="$HF_TOKEN" -t torchtitan_cuda128_torch27 .
# docker run --gpus all --shm-size 32g --network=host -v /home/ubuntu/.cache/huggingface:/root/.cache/huggingface --name torchtitan_workload -it --rm --ipc=host torchtitan_cuda128_torch27 bash -c 'CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh'
Build the Docker image:
docker build -f torchtitan.dockerfile --build-arg HF_TOKEN="$HF_TOKEN" -t torchtitan_cuda128_torch27 .
Import the image to Enroot:
With these schemes for the enroot import command, enroot can extract images from:
docker://[USER@][REGISTRY#]IMAGE[:TAG] # Import a Docker image from a registry
dockerd://IMAGE[:TAG] # Import a Docker image from the Docker daemon
podman://IMAGE[:TAG] # Import a Docker image from a local podman repository
In order to extract the previously created image, we use:
enroot import -o /home/ubuntu/torchtitan_cuda128_torch27.sqsh dockerd://torchtitan_cuda128_torch27
Slurm script:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=50
#SBATCH --partition=gpus
#SBATCH --job-name=torchtitan_singlenode
#SBATCH -o /home/ubuntu/slurm_logging/headnode/%x_%j.out
#SBATCH -e /home/ubuntu/slurm_logging/headnode/%x_%j.err
CONFIG_FILE=${CONFIG_FILE:-"torchtitan/models/llama3/train_configs/debug_model.toml"}
srun --container-image=/home/ubuntu/torchtitan_cuda128_torch27.sqsh \
--container-name=torchtitan_singlenode \
--container-mounts=/home/ubuntu/.cache/huggingface:/root/.cache/huggingface \
--container-writable \
--no-container-mount-home \
bash run_train.sh --job.config_file ${CONFIG_FILE}
The Pyxis flags included are as follow:
-container-image=/home/ubuntu/torchtitan_cuda128_torch27.sqsh
Specifies the SquashFS file (or Enroot URI) to use as the container filesystem. In this case, we’re pointing at the previously created torchtitan_cuda128_torch27.sqsh
image.
-container-name=torchtitan_singlenode
Name of the container. It will be cached in the Enroot data (in our config: /mnt/local_disk/enroot/data/$UID
) as pyxis_torchtitan_singlenode.
-container-mounts=/home/ubuntu/.cache/huggingface:/root/.cache/huggingface
Bind-mounts our WNs Hugging Face cache directory into the container at /root/.cache/huggingface
.
-no-container-mount-home
Prevents Pyxis from automatically bind-mounting our home directory into the container. It is recommended to isolate the container’s view of our home and avoid conflicts with NFS permissions.
-container-writable
Makes the container filesystem writable (by default, SquashFS images are mounted read-only). Allows any in-container writes (e.g,. installing packages and writing checkpoints) without additional mounts.
Resolve possible "Read-only file system" errors with:
#SBATCH --container-writable
We propose the following command for interactive use and debugging of the deployed container:
srun \\
--partition=gpus --gres=gpu:1 --ntasks=1 --cpus-per-task=4 \\
--container-image=/home/ubuntu/torchtitan_cuda128_torch27.sqsh \\
--container-name=interactive_torchtitan --pty bash
TorchTitan Multi-node Testing
Note: The previously created torchtitan_cuda128_torch27.sqsh
image is required.
We now reproduce a production-ready workload for the distributed training of the Llama 70B model with the c4 dataset.
Slurm job configuration with Pyxis:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=50
#SBATCH --partition=gpus
#SBATCH --job-name=torchtitan_multinode
#SBATCH -o /home/ubuntu/slurm_logging/headnode/%x_%j.out
#SBATCH -e /home/ubuntu/slurm_logging/headnode/%x_%j.err
# === Compute these HOST-side ===
HEADNODE_HOST=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
MASTER_ADDR=$(getent hosts "$HEADNODE_HOST" | grep -Eo '10\.[0-9]+\.[0-9]+\.[0-9]+' | head -n1)
MASTER_PORT=$((5000 + SLURM_JOB_ID % 10000))
CONFIG_FILE=${CONFIG_FILE:-"torchtitan/models/llama3/train_configs/llama3_70b.toml"}
echo "======== Distributed Config ========"
echo "HEADNODE_HOST: $HEADNODE_HOST"
echo "Resolved MASTER_ADDR: $MASTER_ADDR"
echo "Assigned MASTER_PORT: $MASTER_PORT"
echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"
echo "All Hosts:"
scontrol show hostnames "$SLURM_JOB_NODELIST"
echo "===================================="
# === Launch the container job ===
srun \
--container-image=/home/ubuntu/torchtitan_cuda128_torch27.sqsh \
--container-mounts=/home/ubuntu/.cache/huggingface:/root/.cache/huggingface \
--container-writable \
--export=ALL,HEADNODE_HOST=$HEADNODE_HOST,MASTER_ADDR=$MASTER_ADDR,MASTER_PORT=$MASTER_PORT,NCCL_DEBUG=INFO,NCCL_DEBUG_SUBSYS=ALL \
torchrun --nnodes=2 --nproc_per_node=8 \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
-m torchtitan.train --job.config_file ${CONFIG_FILE}
As some computation is required outside the container for resolving hostnames for the WNs and configuring the master address, we move the Pyxis variables from the prelude to the srun
command part.
For the Slurm script, two torchrun tasks are required: one task with 8 GPUs per each node. We assign output -o
and error -e
files for the HN. As the computation is performed inside the container (which doesn't address Slurm) no log files for the WNs are provided in the form of:
srun --output=/home/ubuntu/slurm_logging/workernodes/multinode_torch_test_%j_node%N.out --error=/home/ubuntu/slurm_logging/workernodes/multinode_torch_test_%j_node%N.err
Instead, the WNs logs will be printed out in these former files.
Conclusion
This guide provides a structured approach for integrating Pyxis and Enroot into distributed workloads in the DataCrunch Instant Clusters environment, which is also applicable to any other HPC environment. This allows for facilitating scalable and reproducible machine learning workloads using TorchTitan.
References
GitHub issues:
- run via slurm: permission denied
- slurmstepd: error: pyxis: mkdir: cannot create directory '/run/enroot': Permission denied
- mkdir permission denied pyxis ($uid)
- NVIDIA libs downgrade
Slurm docs and scripts:
- SLURM Quick Start User Guide
- SLURM simple guide
- torchtitan /multinode_trainer.slurm
- simple-gpt/gpt/run_multi_node.sh
- Mochi-Full-Finetuner/ ... /train_multi_nodes.sh
- Multi-Node Deployment – SGLang
- modal multinode training guide github
- Crusoe slurm cloud solution
Cluster health checks:
- cupy distributed test comms
- Imbue 70B infrastructure
- Machine Learning Engineering Open Book
- Host-level health checks
- Linux Performance Analysis in 60,000 Milliseconds, Netflix technical blogs
- Pytorch Crusoe torchtitan
Pyxis + Enroot: