Query and Cancel Jobs
Querying the State of a Job
You can find out about the state of your job and all other jobs in the queue using the command
>>> squeue --user=yourusername
This will give you the status of all your running and submitted jobs. You can also neglect the
--user=yourusername to view the entire job queue for all users.
If you are using OzSTAR then the output of
squeue --user=yourusername should look similar to the figure below.
The columns of the output are as follows:
JOBID: The JOBID that is given to the job. This ID is unique amongst all jobs past, present and future.
PARTITION: The type of ‘queue’ that the job is in. This is usually given by the name of the type of CPU that will be running the job.
NAME: The name of the job.
USER: The username of the person that submitted the job.
ST: The status of the job.
R: Currently Running
PD: Waiting for Resources (Pending)
TIME: The length of time the job has been running. If the job is pending (
ST = PD) it will say
NODES: The number of ‘nodes’ that the job has requested. A ‘node’ is a collection of many CPUs. OzSTAR has a few different types of nodes with different amounts of CPUs on each. For example the
PARTITION = skylake) nodes have 32 CPUs each.
NODELIST(REASON): If the job is currently running (
ST = R) this is the list of nodes that the job is using. If the job is pending (
ST = PD) this is why the job is pending.
You can also use the OzSTAR Job Monitor Website for a graphical view of all the jobs that are running and in the queue.
Canceling a Job
Sometimes you will have a job that you need to cancel for some reason. You can cancel a running or submitted job at any time with
>>> scancel 99999999
You can also cancel all of your jobs with
scancel --user=yourusername or you can only cancel your “Pending” jobs with
scancel -t PD.
#!/bin/bash #SBATCH --ntasks=1 #SBATCH --mem=100MB #SBATCH --time=00:30:00 module purge module load anaconda3/5.0.1 source activate py3 python example_python_job.py
This is the same example as shown throughout this tutorial.
#!/bin/bash #SBATCH --ntasks=1 #SBATCH --mem=100MB #SBATCH --time=00:30:00 #SBATCH --job-name=Calculate_Mean #SBATCH --output=slurm_output.txt #SBATCH --firstname.lastname@example.org #SBATCH --mail-type=ALL #SBATCH --account=oz999 module purge module load anaconda3/5.0.1 source activate py3 python example_python_job.py
This is essentially the same as Example 0 but with a few additional parameters.
--job-nameis being used to give a more meaningful name to the job. This is the name that will show up as
--outputis defining the file that all of the Slurm output (i.e. print statements) will be directed.
--main-typemakes Slurm send an email when your job starts and completes.
--accountis setting which group account this job belongs is associated with.
This tutorial is not meant as a comprehensive article covering all there is to know about using Slurm. Still, hopefully, by now you feel confident enough to be able to write your own bash scripts and get jobs running on OzSTAR.