Query and Cancel Jobs
Querying the State of a Job
You can find out about the state of your job and all other jobs in the queue using the command squeue
.
>>> squeue --user=yourusername
This will give you the status of all your running and submitted jobs. You can also neglect the --user=yourusername
to view the entire job queue for all users.
If you are using OzSTAR then the output of squeue --user=yourusername
should look similar to the figure below.
The columns of the output are as follows:
JOBID
: The JOBID that is given to the job. This ID is unique amongst all jobs past, present and future.PARTITION
: The type of ‘queue’ that the job is in. This is usually given by the name of the type of CPU that will be running the job.NAME
: The name of the job.USER
: The username of the person that submitted the job.ST
: The status of the job.R
: Currently RunningPD
: Waiting for Resources (Pending)
TIME
: The length of time the job has been running. If the job is pending (ST = PD
) it will say0:00
.NODES
: The number of ‘nodes’ that the job has requested. A ‘node’ is a collection of many CPUs. OzSTAR has a few different types of nodes with different amounts of CPUs on each. For example thejohn
(PARTITION = skylake
) nodes have 32 CPUs each.NODELIST(REASON)
: If the job is currently running (ST = R
) this is the list of nodes that the job is using. If the job is pending (ST = PD
) this is why the job is pending.
You can also use the OzSTAR Job Monitor Website for a graphical view of all the jobs that are running and in the queue.
Canceling a Job
Sometimes you will have a job that you need to cancel for some reason. You can cancel a running or submitted job at any time with scancel jobid
.
>>> scancel 99999999
You can also cancel all of your jobs with scancel --user=yourusername
or you can only cancel your “Pending” jobs with scancel -t PD
.
Examples
Example 0
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --mem=100MB
#SBATCH --time=00:30:00
module purge
module load anaconda3/5.0.1
source activate py3
python example_python_job.py
This is the same example as shown throughout this tutorial.
Example 1
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --mem=100MB
#SBATCH --time=00:30:00
#SBATCH --job-name=Calculate_Mean
#SBATCH --output=slurm_output.txt
#SBATCH --mail-user=name@swin.edu.au
#SBATCH --mail-type=ALL
#SBATCH --account=oz999
module purge
module load anaconda3/5.0.1
source activate py3
python example_python_job.py
This is essentially the same as Example 0 but with a few additional parameters.
--job-name
is being used to give a more meaningful name to the job. This is the name that will show up asNAME
when usingsqueue
.--output
is defining the file that all of the Slurm output (i.e. print statements) will be directed.--mail-user
and--main-type
makes Slurm send an email when your job starts and completes.--account
is setting which group account this job belongs is associated with.
Summary
This tutorial is not meant as a comprehensive article covering all there is to know about using Slurm. Still, hopefully, by now you feel confident enough to be able to write your own bash scripts and get jobs running on OzSTAR.