UNC Cluster Computing Workshop - Slurm¶
1. What is SLURM?¶
The Simple Linux Utility for Resource Management or Slurm for short, is a job scheduling software used on HPC clusters. The primary job of Slurm is to manage and allocate resources between many users. It decides when jobs start, allocates the correct resources for a job, and tracks resources to make sure every job uses distinct GPUs and CPUs.
2. What is a job?¶
A job is any shell script i.e. job.sh that contains lines of code that specify:
How many resources you need for your job
What exactly you want to run
Where to send the results of the job
Slurm jobs start with a shebang, a line of code that tells the computer what language your script is written in, and what interpreter the computer should use to read the script. There are several shell scripting languages, but the most common is bash. The bash shebang is: #!/bin/bash. Below is an example job script test.job
#!/bin/bash
#SBATCH -J testjob
#SBATCH -p LocalQ
#SBATCH -t 00:10:00
#SBATCH --ntasks=1
echo "Hello from Slurm!"
This job script starts with the shebang #!/bin/bash that tells the computer we are working in bash. Every job script script should start with the shebang. The next four lines contain lines of code that look like comments, but are actually called directives and are part of slurm. Normally, any line in a bash script that starts with # would be interpeted as a comment and not run. But with the #SBATCH directive, it is still seen by slurm and interpeted as a config command. Let’s break these down
#SBATCH -J testjobThis tellsslurmto name the jobtestjob#SBATCH -p LocalQThis tellsslurmwith partition to run on#SBATCH -t 00:10:00This gives a the job a time limit of 10 minutes#SBATCH --ntasks=1This requests a single CPU core
2.1 Submitting a job¶
Let’s see how we would actually submit a job, and what the output from slurm would look like:
jldechow@river:~$ sbatch test.job
Submitted batch job 28
jldechow@river:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
28 LocalQ TestJob jldechow PD 0:00 1 river.emes.unc.edu
Now, let’s break down what has happened line by line:
sbatch test.jobWe usesbatchto submit jobtest.jobSubmitted batch job 28Slurm reports our job was submitted withJOBID=28
Next, we ran squeue to see our job in the queue and slurm reports back a variety of pieces of information. Let’s look at those now:
JOBID: 28Slurm has given our job theJOBIDof28PARITITION: LocalQSlurm is running our job on theLocalQpartitionNAME: TestJobOur job is namedTestJobUSER: jldechowThis job was submitted by userjldechowST: PDThe jobs current status isPDakaPendingTIME: 0:00This job has ran for0:00since it is still pendingNODES: 1This job is running on node 1 (River only has 1 node)NODELIST: river.emes.unc.eduNode 1 is namedriver.emes.unc.edu
2.2 Job Status Codes:¶
Here is a list of job common job status codes:
Code |
Status |
Description |
|---|---|---|
PD |
Pending |
The job is awaiting resource allocation and has not yet started execution. |
R |
Running |
The job currently has an allocation and is actively executing on the assigned nodes. |
CD |
Completed |
The job has successfully terminated all processes on all allocated nodes and completed its execution. |
F |
Failed |
The job terminated with a non-zero exit code or encountered another failure condition during execution. |
CG |
Completing |
The job is in the process of completing, but some processes on some nodes may still be active. |
CA |
Cancelled |
The job was explicitly cancelled by the user or a system administrator. |
CF |
Configuring |
The job has been allocated resources, but is waiting for them to become ready for use (e.g., booting up). |
S |
Suspended |
The job has an allocation, but its execution has been temporarily suspended. |
ST |
Stopped |
The job has been stopped, but its cores are retained, unlike a suspended job which releases its cores. |
TO |
Timeout |
The job reached its allocated time limit and was terminated. |
OOM |
Out of Memory |
The job terminated due to an out-of-memory error. |
NF |
Node Fail |
The job terminated due to the failure of one or more allocated nodes. |
PR |
Preempted |
The job was terminated by another job, typically due to a higher priority or resource requirement. |
3. Basic Slurm Commands¶
Much like the generic command line utilities discussed in P1, slurm has its own commands that you can use to start, check on, or cancel jobs. Below we will show the most common and useful ones
Command |
Purpose |
Example Usage |
Explanation |
|---|---|---|---|
|
Submit a job script to the queue |
|
Sends your job script to Slurm to run when resources are available. |
|
View the job queue |
|
Shows which jobs are running or waiting (for your user or everyone). |
|
Cancel a running or pending job |
|
Stops the job with ID |
|
View node and partition (queue) status |
|
Displays what parts of the cluster are available, busy, or down. |
|
Run a command interactively or in parallel |
|
Starts a command directly through Slurm (often used inside job scripts). |
|
View job history and resource usage |
|
Shows runtime, CPU, and memory use for a finished job. |
|
Load software environments |
|
Loads software packages or environments available on the cluster. |
|
Connect to remote servers or nodes |
|
Connects securely to the cluster login node |
Many cluster commands become much more useful when paired with flags, which are options that modify behavior. For example, running squeuewill show every job on the entire cluster, which often is not very useful.To see only your own jobs, you can add the -u flag followed by your username:squeue -u jldechow. Here, -u means user, so this command lists all jobs currently running or waiting that belong to user jldechow. Similarly, if you accidentally submit several jobs or lose track of your job IDs, you can cancel all your jobs at once with scancel -u jldechow. This tells Slurm to stop every job associated with user jldechow, saving you from hunting down each individual $JOBID.