This post describes how I use Slurm to compile and execute programs within a cluster environment.

If you have any suggestions, please let me know.

Summary

About Slurm
My Approach

About Slurm

Slurm is a job scheduler for task management and submission within cluster environments. For more details, please visit the official Slurm website.

My Approach

In my workflow, I use Slurm to compile and execute programs within a cluster environment. The process involves the use of 4 scripts:

build.sh: Compiles the program.
jobs_generator.py: Generates a file with all the commands for execution.
submit_to_slurm.sh: Submits a list of jobs to the Slurm scheduler.
run_job.py: Executes an individual job (invoked by submit_to_slurm.sh).

And few command lines.

Project Organization

The project is organized as follows:

my_project/
├── build # created by build.sh, contains the compiled program
│   └── my_program
├── outputs # contains the outputs of the program
│   ├── slurm_output # contains the outputs of slurm
│   │   ├── slurm_test_123_123456_001.out
│   │   ├── slurm_test_123_123457_002.out
│   │   ├── slurm_test_123_123458_003.out
│   │   └── ...
│   └── test_123 # contains the outputs of the program
│       ├── my_method_a
│       │   ├── instance_1_00.csv
│       │   ├── instance_1_01.csv
│       │   ├── instance_1_02.csv
│       │   ├── instance_2_00.csv
│       │   ├── instance_2_01.csv
│       │   ├── instance_3_02.csv
│       │   └── ...
│       └── my_method_b
│           ├── instance_1_00.csv
│           ├── instance_1_01.csv
│           ├── instance_1_02.csv
│           ├── instance_2_00.csv
│           ├── instance_2_01.csv
│           ├── instance_3_02.csv
│           └── ...
├── scripts # contains the scripts used to compile and run the program
│   ├── build.sh
│   ├── jobs_generator.py
│   ├── run_job.py
│   └── submit_to_slurm.sh
├── src # contains the source code of the program
│   └── main.cpp
├── CMakeLists.txt # cmake file
├── README.md
├── to_eval # contains the list of commands to run (1200 lines)
├── to_eval00 # contains the list of commands to run (1000 lines)
└── to_eval01 # contains the list of commands to run (200 lines)

Build the Program

First, compile the program using the scripts/build.sh script:

rm -rf build
mkdir build
cd build || exit

cmake -DCMAKE_BUILD_TYPE=Release ..
make -j
cd ..

This script creates a build folder and compiles the program my_program within it.

To compile with Slurm on the same nodes used for program execution, use the following command:

srun --partition=PART_NAME --time=00:10:00 --mem=2G ./scripts/build.sh

Replace PART_NAME with the name of the desired partition. To find partition names, use the sinfo command (check the first column) and refer to your cluster documentation for information about the CPU types available on each partition.

This command runs directly on the specified partition, using the terminal to display the compilation output.

Generating Job Commands

I use the scripts/jobs_generator.py script to create a to_eval file listing all the command lines for execution. This Python script incorporates specific parameters, instances, and random seeds to construct commands. Each line in the to_eval file represents a unique command to be executed by the program.

Example of a `to_eval` File:

"""
Generate to_eval file which list all command lines to execute
"""

import os

# prepare the parameters
parameters = [
    ("my_method_a", "parameters/my_method_a.json"),
    ("my_method_b", "parameters/my_method_b.json"),
    # ("my_method_c", "parameters/my_method_c.json"),
]

param_1 = {
    "instance_1": 50,
    "instance_2": 75,
    "instance_3": 20,
    "instance_4": 30,
}

param_2 = "true" # "true" "false"

# prepare the instances
with open("instances_hard.txt", "r", encoding="UTF8") as file:
    instances = [line[:-1] for line in file.readlines()]

rand_seeds = list(range(20))

time_limit = 3600 * 1

# prepare the output directory
output_directory = "PATH_TO_PROJECT/outputs/test_123"
os.mkdir(f"{output_directory}/")
for parameter, _ in parameters:
    os.mkdir(f"{output_directory}/{parameter[0]}")

# write the commands in the to_eval file
with open("to_eval", "w", encoding="UTF8") as file:
    for parameter in parameters:
        for instance in instances:
            for rand_seed in rand_seeds:
                file.write(
                    f"./my_program"
                    f" --instance {instance}"
                    f" --param_1 {param_1[instance]}"
                    f" --param_2 {param_2}"
                    f" --rand_seed {rand_seed}"
                    f" --time_limit {time_limit}"
                    f" --parameters ../{parameter[1]}"
                    f" --output_directory {output_directory}/{parameter[0]}"
                    "\n"
                )

After running this script, each line in the to_eval file represents a command to execute:

./my_program --instance instance_1 --param_1 50 --param_2 true --rand_seed 0 --time_limit 3600 --parameters ../parameters/my_method_a.json --output_directory PATH_TO_PROJECT/outputs/test_123/my_method_a
./my_program --instance instance_1 --param_1 50 --param_2 true --rand_seed 1 --time_limit 3600 --parameters ../parameters/my_method_a.json --output_directory PATH_TO_PROJECT/outputs/test_123/my_method_a
./my_program --instance instance_1 --param_1 50 --param_2 true --rand_seed 2 --time_limit 3600 --parameters ../parameters/my_method_a.json --output_directory PATH_TO_PROJECT/outputs/test_123/my_method_a
./my_program --instance instance_1 --param_1 50 --param_2 true --rand_seed 3 --time_limit 3600 --parameters ../parameters/my_method_a.json --output_directory PATH_TO_PROJECT/outputs/test_123/my_method_a
...

If the number of commands exceeds 1000 lines, you have to split the file using the following command (slurm limitations for job arrays):

split -l 1000 to_eval to_eval

This command creates several files, such as to_eval00, to_eval01, to_eval02, etc. If the original file (to_eval) contains 1200 lines, to_eval00 will contain 1000 lines, and to_eval01 will contain 200 lines.

Submit the Jobs to Slurm

To submit jobs, I use the scripts/submit_to_slurm.sh script:

#!/bin/bash

#SBATCH --job-name=test_123
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-1000
#SBATCH --time=01:10:00
#SBATCH --mem=4G
#SBATCH --partition=PART_NAME
#SBATCH --output=/PATH_TO_OUTPUTS/slurm_output/slurm-%x-%a-%j.out

python3 scripts/run_job.py "${SLURM_ARRAY_TASK_ID}" "$1"

I call this script with the following command:

# edit the script to change the array size to --array=1-1000
sbatch scripts/submit_to_slurm.sh to_eval00
# edit the script to change the array size to --array=1-200
sbatch scripts/submit_to_slurm.sh to_eval01
...

This script submits jobs to Slurm, creating a job array with 1000 jobs (matching the number of lines in the to_eval file). For each job, it executes the command python3 scripts/run_job.py "${SLURM_ARRAY_TASK_ID}" "$1". Here, SLURM_ARRAY_TASK_ID corresponds to the job number (ranging from 1 to 1000), and $1 takes the value of the first argument from the command line (to_eval00 or to_eval01). For to_eval01, with 200 lines, use --array=1-200 instead of --array=1-1000.

This script will submit the jobs to slurm. It will create a job array with 1000 jobs (the number of lines in the to_eval file) and will run the command python3 scripts/run_job.py "${SLURM_ARRAY_TASK_ID}" "$1" for each job. SLURM_ARRAY_TASK_ID will take the value of the job number (from 1 to 1000) and $1 will take the value of the first argument of the command line (here to_eval00 or to_eval01). Note that for to_eval01 as 200 lines, you will have to use --array=1-200 instead of --array=1-1000.

The SBATCH options enable you to define the resources allocated for your jobs.

After submitting the jobs, you can check their status using the following command:

squeue -u USER_NAME

To simplify the call to squeue, you can create an alias in your .bashrc file to display only the relevant information:

alias squ='squeue -u USER_NAME -o "%.20i %.5j %.2t %.5M %.3C %.3m %.3D %R" -S "P-t-pj"'

The command squ will display the information of all your jobs, ID, Status, Time, Memory, …

Run a Job

When a node becomes available, Slurm will execute the command python3 scripts/run_job.py "${SLURM_ARRAY_TASK_ID}" "$1" for each job, running the corresponding command line from the to_eval file.

You’ll need to customize the scripts/run_job.py script to suit your requirements; it serves as an example of what you can achieve.

"""
This script runs one job specified by the line_number in the given to_eval_file.

The syntax is:
    python3 run_job.py line_number to_eval_file

    line_number: the line number to run in the to_eval_file
    to_eval_file: the file to read to get the line to run
"""
import subprocess
import sys
import os

def get_line(line_number: int, to_eval_file: str) -> str:
    """Returns the line at the given line_number in the given file.

    Args:
        line_number (int): the line number to return
        to_eval_file (str): the file to read

    Returns:
        str: the line at the given line_number in the given file
    """
    with open(to_eval_file, encoding="utf8") as file:
        for i, line in enumerate(file.readlines()):
            if i + 1 == line_number:
                return line

def run_job(command: str) -> str:
    """Runs the given command line and returns the program's output.

    Args:
        command (str): the command line to run

    Returns:
        str: the output of the program
    """
    process = subprocess.Popen(
        command.split(),
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    stdout, stderr = process.communicate()
    if stderr:
        print("Error while running the job:", stderr.decode())
        print("Command:", command)
        sys.exit(1)
    return stdout.decode().strip()

def get_solution(file: str) -> dict:
    """Retrieves the solution from the given file."""
    return {}

def check_solution(solution: dict, line: str):
    """Checks the validity of the solution."""
    valid_solution = True

    # Check if the solution is valid

    if valid_solution:
        return

    print("Error: Invalid solution...")
    print(line)

def main():
    """Main function of the script."""
    if len(sys.argv) != 3:
        print("Error: Syntax should be python run_job.py line_number to_eval_file")
        sys.exit(1)

    line_number = int(sys.argv[1])
    to_eval_file = sys.argv[2]

    line = get_line(line_number, to_eval_file)
    if not line:
        # If the line is empty
        sys.exit(0)

    os.chdir("build_release")
    # In this example, the only thing sent to the stdout by the program
    # is the name of the output file
    output_file = run_job(line)
    os.chdir("..")
    if not output_file:
        print("Error: No output file returned")
        print(line)
        sys.exit(1)

    results = get_solution(output_file)
    check_solution(results, line)

if __name__ == "__main__":
    main()

It retrieves the command line from the to_eval file, runs the specified command, and checks the solution.

Check the Results

If everything is fine, the script will finish without printing anything, as all program output is ideally directed to a designated file. In case of an error, the script will print the error message (either from the program or during solution verification) along with the command line used. Consequently, the slurm-%x-%a-%j.out file will remain empty in the absence of issues, but will contain error details and the corresponding command line in case of errors.

To identify errors, you can use the following command to list non-empty files:

find outputs/slurm_output/slurm_test_123_* -not -size 0

If no output is listed, everything is in order. If there are outputs, you can inspect their contents using the following command:

find outputs/slurm_output/slurm_test_123_* -not -size 0 -ls -exec cat {} \;

To retrieve information about jobs that failed, you can use the following command:

find outputs/slurm_output/slurm_test_123_* -not -size 0 -exec tail -n 1 {} \;

These commands help in identifying any potential issues and provide insights into the specific errors encountered.

Useful Commands

Those commands are useful when working with Slurm:

# list all partitions
sinfo
# list all jobs (see alias `squ` in Submit the Jobs to Slurm)
squeue -u USER_NAME
# cancel all jobs
scancel -u USER_NAME
# cancel a specific job
scancel JOB_ID