UNC Cluster Computing Workshop - Slurm

1. What is SLURM?

The Simple Linux Utility for Resource Management or Slurm for short, is a job scheduling software used on HPC clusters. The primary job of Slurm is to manage and allocate resources between many users. It decides when jobs start, allocates the correct resources for a job, and tracks resources to make sure every job uses distinct GPUs and CPUs.

2. What is a job?

A job is any shell script i.e. job.sh that contains lines of code that specify:

  1. How many resources you need for your job

  2. What exactly you want to run

  3. Where to send the results of the job

Slurm jobs start with a shebang, a line of code that tells the computer what language your script is written in, and what interpreter the computer should use to read the script. There are several shell scripting languages, but the most common is bash. The bash shebang is: #!/bin/bash. Below is an example job script test.job

#!/bin/bash
#SBATCH -J testjob          
#SBATCH -p LocalQ          
#SBATCH -t 00:10:00         
#SBATCH --ntasks=1         

echo "Hello from Slurm!"

This job script starts with the shebang #!/bin/bash that tells the computer we are working in bash. Every job script script should start with the shebang. The next four lines contain lines of code that look like comments, but are actually called directives and are part of slurm. Normally, any line in a bash script that starts with # would be interpeted as a comment and not run. But with the #SBATCH directive, it is still seen by slurm and interpeted as a config command. Let’s break these down

  • #SBATCH -J testjob This tells slurm to name the job testjob

  • #SBATCH -p LocalQ This tells slurm with partition to run on

  • #SBATCH -t 00:10:00 This gives a the job a time limit of 10 minutes

  • #SBATCH --ntasks=1 This requests a single CPU core

2.1 Submitting a job

Let’s see how we would actually submit a job, and what the output from slurm would look like:

jldechow@river:~$ sbatch test.job
Submitted batch job 28
jldechow@river:~$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                28    LocalQ  TestJob  jldechow  PD       0:00      1 river.emes.unc.edu

Now, let’s break down what has happened line by line:

  • sbatch test.job We use sbatch to submit job test.job

  • Submitted batch job 28 Slurm reports our job was submitted with JOBID = 28

Next, we ran squeue to see our job in the queue and slurm reports back a variety of pieces of information. Let’s look at those now:

  • JOBID: 28 Slurm has given our job the JOBID of 28

  • PARITITION: LocalQ Slurm is running our job on the LocalQ partition

  • NAME: TestJob Our job is named TestJob

  • USER: jldechow This job was submitted by user jldechow

  • ST: PD The jobs current status is PD aka Pending

  • TIME: 0:00 This job has ran for 0:00 since it is still pending

  • NODES: 1 This job is running on node 1 (River only has 1 node)

  • NODELIST: river.emes.unc.edu Node 1 is named river.emes.unc.edu

2.2 Job Status Codes:

Here is a list of job common job status codes:

Code

Status

Description

PD

Pending

The job is awaiting resource allocation and has not yet started execution.

R

Running

The job currently has an allocation and is actively executing on the assigned nodes.

CD

Completed

The job has successfully terminated all processes on all allocated nodes and completed its execution.

F

Failed

The job terminated with a non-zero exit code or encountered another failure condition during execution.

CG

Completing

The job is in the process of completing, but some processes on some nodes may still be active.

CA

Cancelled

The job was explicitly cancelled by the user or a system administrator.

CF

Configuring

The job has been allocated resources, but is waiting for them to become ready for use (e.g., booting up).

S

Suspended

The job has an allocation, but its execution has been temporarily suspended.

ST

Stopped

The job has been stopped, but its cores are retained, unlike a suspended job which releases its cores.

TO

Timeout

The job reached its allocated time limit and was terminated.

OOM

Out of Memory

The job terminated due to an out-of-memory error.

NF

Node Fail

The job terminated due to the failure of one or more allocated nodes.

PR

Preempted

The job was terminated by another job, typically due to a higher priority or resource requirement.

3. Basic Slurm Commands

Much like the generic command line utilities discussed in P1, slurm has its own commands that you can use to start, check on, or cancel jobs. Below we will show the most common and useful ones

Command

Purpose

Example Usage

Explanation

sbatch

Submit a job script to the queue

sbatch myjob.sh

Sends your job script to Slurm to run when resources are available.

squeue

View the job queue

squeue -u $USER

Shows which jobs are running or waiting (for your user or everyone).

scancel

Cancel a running or pending job

scancel 12345

Stops the job with ID 12345

sinfo

View node and partition (queue) status

sinfo

Displays what parts of the cluster are available, busy, or down.

srun

Run a command interactively or in parallel

srun hostname

Starts a command directly through Slurm (often used inside job scripts).

sacct

View job history and resource usage

sacct -j 12345

Shows runtime, CPU, and memory use for a finished job.

module

Load software environments

module load julia

Loads software packages or environments available on the cluster.

ssh

Connect to remote servers or nodes

ssh user@cluster.edu

Connects securely to the cluster login node

Many cluster commands become much more useful when paired with flags, which are options that modify behavior. For example, running squeuewill show every job on the entire cluster, which often is not very useful.To see only your own jobs, you can add the -u flag followed by your username:squeue -u jldechow. Here, -u means user, so this command lists all jobs currently running or waiting that belong to user jldechow. Similarly, if you accidentally submit several jobs or lose track of your job IDs, you can cancel all your jobs at once with scancel -u jldechow. This tells Slurm to stop every job associated with user jldechow, saving you from hunting down each individual $JOBID.