Pyxis and Enroot Integration for the DataCrunch Instant Clusters

This technical guide covers the deployment flow of containerized workloads using Enroot and Pyxis within the DataCrunch Instant Clusters. Through this guide, we detail the setup process, from initial configuration and incremental testing of multi-node distributed workloads using TorchTitan. All artifacts used can be found in our GitHub repository.

Introduction

Enroot is a lightweight, unprivileged container runtime optimized for HPC environments, designed specifically to execute containerized applications with minimal overhead. Enroot seamlessly converts Docker or OCI images into SquashFS files, enabling rapid deployment across HPC nodes and ensuring efficient parallel workload execution.

Pyxis is a Slurm plugin that provides native integration of container runtimes like Enroot within the Slurm resource manager. By extending Slurm’s job submission commands (i.e., sbatch and srun), Pyxis allows users to specify container images directly in job scripts. Pyxis ensures automated handling of the container lifecycle, including image pulling, caching, and execution. Thus, Pyxis provides containerized environments and ensures reproducibility in the experimentations.

Job Submission Workflow

Submitting Slurm jobs with Pyxis-specific options
Pulling and converting Docker images into Enroot bundles with Pyxis
Launching containerized jobs over the worker nodes

Testing Environment

A cluster of 16xH200s allocated within two worker nodes: 8x WorkerNode1 (WN1) and 8x WorkerNode2 (WN2)
Both WNs are accessible from the HeadNode (HN or jump host), which is a CPU-only node
WNs and HN share an NFS (Network File System) over /home, sharing all the users and their home directories residing in the cluster

Prerequisites

We want each WN to store its own artifacts (Docker and Enroot images, intermediate data, caches, runtimes) on local storage to avoid any possible race conditions and concurrent access, resulting in a clean and isolated setup.

Docker Configuration

Note: Docker configuration would be needed only if we were to use the Docker Hub for downloading images and then convert them to enroot .sqsh format.

Modify Docker’s root directory on each WR for local NVMe storage:

sudo mkdir -p /mnt/local_disk/docker
sudo vim /etc/docker/daemon.json

Add configuration.

{
    "data-root": "/mnt/local_disk/docker",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

Restart Docker

sudo systemctl restart docker

Verify setup

docker info | grep "Docker Root Dir"

Enroot Configuration

Enroot configuration resides in /etc/enroot/enroot.conf. The default config points to:

ENROOT_LIBRARY_PATH       /usr/lib/enroot             # Path to library sources
ENROOT_SYSCONF_PATH       /etc/enroot                 # Path to system configuration files
ENROOT_RUNTIME_PATH       ${XDG_RUNTIME_DIR}/enroot   # Path to the runtime working directory
ENROOT_CONFIG_PATH        ${XDG_CONFIG_HOME}/enroot   # Path to user configuration files
ENROOT_CACHE_PATH         ${XDG_CACHE_HOME}/enroot    # Path to user image/credentials cache
ENROOT_DATA_PATH          ${XDG_DATA_HOME}/enroot     # Path to user container storage
ENROOT_TEMP_PATH          ${TMPDIR}                   # Path to temporary directory

XDG variables point to /home/user, which is the NFS (Network File System; the shared storage over HN and WNs).

As we would like to avoid concurrent enroot file conflicts and enable a multiuser and multijobs setup, we have to set per-node local directories in /etc/enroot/enroot.conf:

ENROOT_LIBRARY_PATH        /usr/lib/enroot
ENROOT_SYSCONF_PATH        /etc/enroot
ENROOT_RUNTIME_PATH        /mnt/local_disk/enroot/runtime/$UID
ENROOT_CONFIG_PATH         /mnt/local_disk/enroot/config
ENROOT_CACHE_PATH          /mnt/local_disk/enroot/cache/$UID
ENROOT_DATA_PATH           /mnt/local_disk/enroot/data/$UID
ENROOT_TEMP_PATH           ${TMPDIR:-/tmp}

With this configuration, we encounter errors such as: slurmstepd: error: pyxis: mkdir: cannot create directory ‘/mnt/local_disk/enroot/cache/.tokens.1000’: Permission denied.

This is due to the permission issues from different users writes. The problem resides in Slurm having trouble transferring Linux groups to Slurm scripts environment.

As a solution, we propose to make the Enroot folder world-writable:

sudo chmod -R a+rwx /mnt/local_disk/enroot

We found another permission issue caused by a bug in the libraries libnvidia-container-tools libnvidia-container1 nvidia-container-toolkit nvidia-container-toolkit-base version 1.17.7:

slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     nvidia-container-cli: container error: file lookup failed: /proc/55900/root/mnt/local_disk/enroot/data/pyxis_torchtitan_singlenode/etc/debian_version: permission denied
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: 1x-instant-cluster-testing-1: task 0: Exited with exit code 1

The proposed solution is to downgrade libnvidia-container-tools libnvidia-container1 nvidia-container-toolkit nvidia-container-toolkit-base to 1.17.6-1:

sudo apt-get install -y \
  libnvidia-container1=1.17.6-1 \
  libnvidia-container-tools=1.17.6-1 \
  nvidia-container-toolkit-base=1.17.6-1 \
  nvidia-container-toolkit=1.17.6-1 \
  --allow-downgrades --allow-change-held-packages

With the proper configuration set up, we proceed to dissect, check, and test each incremental step towards a complete workload.

Enroot Health Check

This script verifies that Enroot is correctly installed and operational on the host machine. It performs a systematic validation by checking Enroot's presence, confirming the version, importing a minimal Docker image (ubuntu), creating and launching a test container, and retrieving basic runtime information such as the container's PID and operating system details. It also ensures proper cleanup after execution, removing any residual test artifacts.

#!/bin/bash
set -e
echo "Checking Enroot installation..."
command -v enroot || { echo "Enroot not installed."; exit 1; }

echo "Running Enroot version check..."
enroot version || exit 1

TEST_IMG="ubuntu"
TEST_CONT="enroot_healthcheck"

[ -f ${TEST_IMG}.sqsh ] && rm -f ${TEST_IMG}.sqsh
enroot import docker://${TEST_IMG}
enroot create -n ${TEST_CONT} ${TEST_IMG}.sqsh
PID_OUTPUT=$(enroot start ${TEST_CONT} sh -c 'echo $$')
OS_OUTPUT=$(enroot start ${TEST_CONT} sh -c 'grep PRETTY /etc/os-release')

echo "PID: $PID_OUTPUT"
echo "OS: $OS_OUTPUT"

enroot remove -f ${TEST_CONT}
rm -f ${TEST_IMG}.sqsh
echo "Enroot health check PASSED."

Pyxis Health Check

There are two alternatives for performing this health check.

Run the srun command with pyxis flag:

srun --container-image=ubuntu grep PRETTY /etc/os-release

Execute the Slurm script with SBATCH-extended flags:

#!/bin/bash
#SBATCH --job-name=pyxis_test
#SBATCH --output=pyxis_test.out
#SBATCH --container-name=pyxis_test
#SBATCH --container-image=docker://ubuntu

echo "Running inside container:"
grep PRETTY /etc/os-release

Now that we have tested and checked the incremental steps required, we can start an end-to-end training workload. First, we need to ensure that the framework performs correctly in a single-node setup before moving to a multi-node one.

TorchTitan Single-node Testing

Next, we perform a testing training of Llama 8B using the c4_test dataset from tests/assets/c4_test included in TorchTitan repository. The container will perform a 10-step training that will be reflected on the /home/ubuntu/slurm_logging/headnode/%x_%j_headnode.err file.

Prerequisites: TorchTitan custom image

torchtitan.dockerfile:

Note: HF_TOKEN must be configured as environment variable

# Usinsg official Pytorch 2.7 + CUDA 12.8 base image
FROM pytorch/pytorch:2.7.0-cuda12.8-cudnn9-runtime
#FROM ubuntu:22.04

ARG HF_TOKEN    
ENV HF_TOKEN=${HF_TOKEN}

# makes sure the shell used for subsequent RUN commands is exactly Bash, as located in /bin.
SHELL ["/bin/bash", "-c"]

# Install dependencies
# llamacpp gcc compilation tools
RUN apt-get update && apt-get install -y \\
    build-essential \\
    fzf \\
    ripgrep \\
    nvtop \\
    sudo \\
    kmod \\
    wget \\
    vim \\
    git \\
    curl \\
    bzip2 \\
    ca-certificates \\
    libglib2.0-0 \\
    libxext6 \\
    libsm6 \\
    libxrender1 \\
    libssl-dev \\
    libibverbs1 \\
    ibverbs-utils \\
    libmlx5-1 \\
    infiniband-diags
    # Cleanup command to remove the apt cache and reduce the image size: # IMPORTANT: Enforces using sudo apt update when entering the container
    #&& rm -rf /var/lib/apt/lists/*

# Cloning the repo
RUN git clone <https://github.com/pytorch/torchtitan>

# Change to the repo directory using WORKDIR
WORKDIR /workspace/torchtitan

RUN mkdir -p /root/.cache/huggingface

RUN pip install -r requirements.txt

# For CUDA 12.8 on worker nodes
RUN pip install torch torchvision torchaudio --index-url <https://download.pytorch.org/whl/cu128>.

# Download the tokenizer
RUN python3 scripts/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3.1-8B --tokenizer_path "original"      

# docker build -f torchtitan.dockerfile --build-arg HF_TOKEN="$HF_TOKEN" -t torchtitan_cuda128_torch27 .
# docker run --gpus all --shm-size 32g --network=host -v /home/ubuntu/.cache/huggingface:/root/.cache/huggingface --name torchtitan_workload -it --rm --ipc=host torchtitan_cuda128_torch27 bash -c 'CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh'

Build the Docker image:

docker build -f torchtitan.dockerfile --build-arg HF_TOKEN="$HF_TOKEN" -t torchtitan_cuda128_torch27 .

Import the image to Enroot:

With these schemes for the enroot import command, enroot can extract images from:

docker://[USER@][REGISTRY#]IMAGE[:TAG]  # Import a Docker image from a registry
dockerd://IMAGE[:TAG]                   # Import a Docker image from the Docker daemon
podman://IMAGE[:TAG]                    # Import a Docker image from a local podman repository

In order to extract the previously created image, we use:

enroot import -o /home/ubuntu/torchtitan_cuda128_torch27.sqsh dockerd://torchtitan_cuda128_torch27

Slurm script:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=50
#SBATCH --partition=gpus
#SBATCH --job-name=torchtitan_singlenode
#SBATCH -o /home/ubuntu/slurm_logging/headnode/%x_%j.out
#SBATCH -e /home/ubuntu/slurm_logging/headnode/%x_%j.err

CONFIG_FILE=${CONFIG_FILE:-"torchtitan/models/llama3/train_configs/debug_model.toml"}

srun --container-image=/home/ubuntu/torchtitan_cuda128_torch27.sqsh \
    --container-name=torchtitan_singlenode \
    --container-mounts=/home/ubuntu/.cache/huggingface:/root/.cache/huggingface \
   --container-writable \
   --no-container-mount-home \
   bash run_train.sh --job.config_file ${CONFIG_FILE}

The Pyxis flags included are as follow:

-container-image=/home/ubuntu/torchtitan_cuda128_torch27.sqsh

Specifies the SquashFS file (or Enroot URI) to use as the container filesystem. In this case, we’re pointing at the previously created torchtitan_cuda128_torch27.sqsh image.

-container-name=torchtitan_singlenode

Name of the container. It will be cached in the Enroot data (in our config: /mnt/local_disk/enroot/data/$UID) as pyxis_torchtitan_singlenode.

-container-mounts=/home/ubuntu/.cache/huggingface:/root/.cache/huggingface

Bind-mounts our WNs Hugging Face cache directory into the container at /root/.cache/huggingface.

-no-container-mount-home

Prevents Pyxis from automatically bind-mounting our home directory into the container. It is recommended to isolate the container’s view of our home and avoid conflicts with NFS permissions.

-container-writable

Makes the container filesystem writable (by default, SquashFS images are mounted read-only). Allows any in-container writes (e.g,. installing packages and writing checkpoints) without additional mounts.

Resolve possible "Read-only file system" errors with:

#SBATCH --container-writable

We propose the following command for interactive use and debugging of the deployed container:

srun   \\
--partition=gpus   --gres=gpu:1   --ntasks=1   --cpus-per-task=4   \\
--container-image=/home/ubuntu/torchtitan_cuda128_torch27.sqsh   \\
--container-name=interactive_torchtitan   --pty bash

TorchTitan Multi-node Testing

Note: The previously created torchtitan_cuda128_torch27.sqsh image is required.

We now reproduce a production-ready workload for the distributed training of the Llama 70B model with the c4 dataset.

Slurm job configuration with Pyxis:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=50
#SBATCH --partition=gpus
#SBATCH --job-name=torchtitan_multinode
#SBATCH -o /home/ubuntu/slurm_logging/headnode/%x_%j.out
#SBATCH -e /home/ubuntu/slurm_logging/headnode/%x_%j.err

# === Compute these HOST-side ===
HEADNODE_HOST=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
MASTER_ADDR=$(getent hosts "$HEADNODE_HOST" | grep -Eo '10\.[0-9]+\.[0-9]+\.[0-9]+' | head -n1)
MASTER_PORT=$((5000 + SLURM_JOB_ID % 10000))

CONFIG_FILE=${CONFIG_FILE:-"torchtitan/models/llama3/train_configs/llama3_70b.toml"}

echo "======== Distributed Config ========"
echo "HEADNODE_HOST: $HEADNODE_HOST"
echo "Resolved MASTER_ADDR: $MASTER_ADDR"
echo "Assigned MASTER_PORT: $MASTER_PORT"
echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"
echo "All Hosts:"
scontrol show hostnames "$SLURM_JOB_NODELIST"
echo "===================================="

# === Launch the container job ===
srun \
  --container-image=/home/ubuntu/torchtitan_cuda128_torch27.sqsh \
  --container-mounts=/home/ubuntu/.cache/huggingface:/root/.cache/huggingface \
  --container-writable \
  --export=ALL,HEADNODE_HOST=$HEADNODE_HOST,MASTER_ADDR=$MASTER_ADDR,MASTER_PORT=$MASTER_PORT,NCCL_DEBUG=INFO,NCCL_DEBUG_SUBSYS=ALL \
  torchrun --nnodes=2 --nproc_per_node=8 \
           --rdzv_backend=c10d \
           --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
           -m torchtitan.train --job.config_file ${CONFIG_FILE}

As some computation is required outside the container for resolving hostnames for the WNs and configuring the master address, we move the Pyxis variables from the prelude to the srun command part.

For the Slurm script, two torchrun tasks are required: one task with 8 GPUs per each node. We assign output -o and error -e files for the HN. As the computation is performed inside the container (which doesn't address Slurm) no log files for the WNs are provided in the form of:

srun --output=/home/ubuntu/slurm_logging/workernodes/multinode_torch_test_%j_node%N.out --error=/home/ubuntu/slurm_logging/workernodes/multinode_torch_test_%j_node%N.err

Instead, the WNs logs will be printed out in these former files.

Conclusion

This guide provides a structured approach for integrating Pyxis and Enroot into distributed workloads in the DataCrunch Instant Clusters environment, which is also applicable to any other HPC environment. This allows for facilitating scalable and reproducible machine learning workloads using TorchTitan.

References

GitHub issues:

Pyxis + Enroot: