A Beginners Guide to OzSTAR

Query and Cancel Jobs

Querying the State of a Job

You can find out about the state of your job and all other jobs in the queue using the command squeue.

>>> squeue --user=yourusername

This will give you the status of all your running and submitted jobs. You can also neglect the --user=yourusername to view the entire job queue for all users.

If you are using OzSTAR then the output of squeue --user=yourusername should look similar to the figure below.

"An image of the output of squeue after using sbatch on my_slurm_job.sh"

The columns of the output are as follows:

  • JOBID: The JOBID that is given to the job. This ID is unique amongst all jobs past, present and future.
  • PARTITION: The type of ‘queue’ that the job is in. This is usually given by the name of the type of CPU that will be running the job.
  • NAME: The name of the job.
  • USER: The username of the person that submitted the job.
  • ST: The status of the job.
    • R: Currently Running
    • PD: Waiting for Resources (Pending)
  • TIME: The length of time the job has been running. If the job is pending (ST = PD) it will say 0:00.
  • NODES: The number of ‘nodes’ that the job has requested. A ‘node’ is a collection of many CPUs. OzSTAR has a few different types of nodes with different amounts of CPUs on each. For example the john (PARTITION = skylake) nodes have 32 CPUs each.
  • NODELIST(REASON): If the job is currently running (ST = R) this is the list of nodes that the job is using. If the job is pending (ST = PD) this is why the job is pending.

You can also use the OzSTAR Job Monitor Website for a graphical view of all the jobs that are running and in the queue.

Canceling a Job

Sometimes you will have a job that you need to cancel for some reason. You can cancel a running or submitted job at any time with scancel jobid.

>>> scancel 99999999

You can also cancel all of your jobs with scancel --user=yourusername or you can only cancel your “Pending” jobs with scancel -t PD.

Examples

Example 0

#!/bin/bash

#SBATCH --ntasks=1
#SBATCH --mem=100MB
#SBATCH --time=00:30:00

module purge
module load anaconda3/5.0.1

source activate py3

python example_python_job.py

This is the same example as shown throughout this tutorial.

Example 1

#!/bin/bash

#SBATCH --ntasks=1
#SBATCH --mem=100MB
#SBATCH --time=00:30:00

#SBATCH --job-name=Calculate_Mean
#SBATCH --output=slurm_output.txt

#SBATCH --mail-user=name@swin.edu.au
#SBATCH --mail-type=ALL
#SBATCH --account=oz999

module purge
module load anaconda3/5.0.1

source activate py3

python example_python_job.py

This is essentially the same as Example 0 but with a few additional parameters.

  • --job-name is being used to give a more meaningful name to the job. This is the name that will show up as NAME when using squeue.
  • --output is defining the file that all of the Slurm output (i.e. print statements) will be directed.
  • --mail-user and --main-type makes Slurm send an email when your job starts and completes.
  • --account is setting which group account this job belongs is associated with.

Summary

This tutorial is not meant as a comprehensive article covering all there is to know about using Slurm. Still, hopefully, by now you feel confident enough to be able to write your own bash scripts and get jobs running on OzSTAR.

WordPress Theme built by Shufflehound.

Copyright © Astronomy Data and Compute Services

ADACS is delivered jointly by Swinburne University of Technology and Curtin University. ADACS is funded under Astronomy National Collaborative Research Infrastructure Strategy (NCRIS) Program via Astronomy Australia Ltd (AAL).