# UNC Cluster Computing Workshop - Slurm
## 1. Introduction
The examples shown in `P2_Slurm` are simple jobs that only use basic UNIX commands. However, cluster computing is really about utilizing the large amount of compute resources to run complex or computationally expensive programs. In this section, we will go through how to run more complex jobs.

### 1.1 Running software on the command line
When you are in the command line environment, you can't just call a file name and have it run. For example, assume we have a `python` script named `test_add.py`. This file is available in the main repo, and is a simple python script that takes two inputs `a` and `b` and then adds the results and prints them to a file `result.txt`. Below is the code for `test_add.py`

```python
#!/usr/bin/env python3
import sys
import os

# Read input arguments 
try:
    a = int(sys.argv[1])
    b = int(sys.argv[2])
except ValueError:
    print("Error: both arguments must be integers.")
    sys.exit(1)

result = a + b
print(f"Adding {a} + {b} = {result}")

# Write results
with open("result.txt", "w") as f:
    f.write(f"{a} + {b} = {result}\n")
```

Hopefully, the comments in the script are self-explanatory. Now, let's try to run it. If I run `test_add.py` on the command line I get the following error:

```bash
jldechow@river:~/scratch$ test_add.py 5 8
test_add.py: command not found
```

The error `test_add.py: command not found` is due to the fact the system interprets `test_add.py` as a `bash` command, not a `python` command. To fix this, we instead run:

```bash
jldechow@river:~/scratch$ python3 test_add.py 5 8
Adding 5 + 8 = 13
```

Here the `python3` command in the `bash` command line is similar to the `shebang` discussed earlier. In spoken language, it goes like this: using the command `python3`, the `bash` terminal expects the next input `test_add.py` to be a `python` script. If this script didn't require anymore arguments, that would just run as is. Since `test_add.py` expects two more arguments (the intergers we will add together), we give those two arguments after we call the script itself. The `python` script printed to output to the command line, as well as wrote to a file `result.txt` in the folder `OUT/`. Let's check that out.

```bash
jldechow@river:~/scratch$ python3 test_add.py 5 8
Adding 5 + 8 = 13
jldechow@river:~/scratch$ ls
OUT  test_add.py
jldechow@river:~/scratch$ ls OUT/
result.txt
jldechow@river:~/scratch$ cat OUT/result.txt
5 + 8 = 13
jldechow@river:~/scratch$ 
```
Since we are running the in my scratch directory, the script made the `OUT/` directory inside the `scratch` directory. If we wanted out script to write to a different output folder, i.e. `~/OUT/`, we would need to specify that explicitly. When running in the command line environment, folder creation and output write is relatively to the folder you are CURRENTLY in when you execute scripts. 

### 1.2 Running software on the head node
In the previous example, when we ran `test_add.py` from the command line we didn't use `slurm` to request any resources. This is called running on the `head node`. Normally, you would only want to do this for a very small/simple job that doesn't require many resources. The `head node` refers to the (usually) small amount of resources `slurm` has allocated to handle the `overhead` of running the entire server/cluster. So, the `head node` has a very small amount of resources available to it, and using them directly impacts the performance of the cluster for all users.

### 1.3 Running software in job scripts
To avoid running on the head node, we instead will run software through job scripts. Normally, you would call a script/program directly through a job script. This is not true on the `River` cluster currently due to file permission setup. Instead, we are going to do it in two stages: First, with a job submission script, and the actual job script second. Let's start with the submission script. You can find this in `ExampleScripts/submit_testjob.sh`:

```bash
#!/bin/bash
# Helper script to deal with file permission issues
# Must run from $HOME directory (or adjust HOME paths below)

# Absolute scratch path (either the real path, or use the workdir symlink)
tmpdir="/not_backed_up/jldechow"            # real path behind the symlink

src_file="$HOME/ClusterComputingWorkshop/ExampleScripts/test_add.py"
job_file="$HOME/ClusterComputingWorkshop/ExampleJobs/test_add.job"

# Make sure target dirs exist
mkdir -p "$tmpdir"
mkdir -p "$HOME/OUT"

# Stage the python file to scratch and show contents
cd "$tmpdir"                    # Move to work dir
pwd                             #Print current dir
ls -l                           #List workdir contents
cp -f "$src_file" "$tmpdir/"    # Copy script to workdir
cp -f "$job_file" "$tmpdir/"    # Copy job file to workdir
ls -l                           #list again

# Submit the job FROM scratch so $SLURM_SUBMIT_DIR == $tmpdir
# Pass absolute path to the script we just staged
sbatch test_add.job test_add.py 5 8
sleep 30
pwd
cat result.txt
cp result.txt "$HOME/"
rm result.txt
rm test_add.job
rm test_add.py
```

Again, hopefully the comments in this script are self explanatory. But in short, we run this script to avoid the issue with permissions in `slurm`. All of the moving and copying happens in the `shell` script as opposed to the job script. First, we copy all the files to our working directory `not_backed_up/jldechow`. Then, we run the actual job script with `sbatch test_add.job test_add.py 5 8`. Here, the first argument `test_add.job` is the name of the job script. The enxt three arguements `test_add.py` `5` and `8` are inputs to the job script. Finally, we copy of output of the job script to our home directory, and remove the files we copied to the working directory. Let's take a look at the job script now:

```bash
#!/bin/bash
#SBATCH -J test_add_job
#SBATCH -p LocalQ
#SBATCH -t 00:05:00
#SBATCH --ntasks=1
#SBATCH --mem=1G
#SBATCH -o OUT/%x_%j.out
#SBATCH -e OUT/%x_%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=jldechow@unc.edu

set -x

# Args:
#   $1 = absolute path to test_add.py
#   $2 = first integer
#   $3 = second integer
python3 "$1" "$2" "$3" > result.txt

echo "Job done: $SLURM_JOB_NAME ($SLURM_JOB_ID)"
```

The first 10 lines of this script are the `shebang` followed by the `slurm directives`. The next line is `set -x` which is a `bash` debugging flag that writes all following inputs/outputs to the output file `OUT/%x_%j.out`. Next, we run our python script with `python3 "$1" "$2" "$3" > result.txt`. Thhis does the following:

- `python3`				: run this code with `python3`
- `"$1" "$2" "$3"`		: use the first three input arguments
- `> result.txt`		: write the outputs to the file `results.txt`

Given that our three inputs are `test_add.py` `5` and `8`, this line really looks like:

`python3 test_add.py 5= 8 > result.txt`

Which tells the cluster to run `test_add.py` with inputs `5` and `8`, which will add them together.