Conrad Chan undertook an ADACS internship from January to March 2018.
My internship at ADACS coincided with the launch of the OzSTAR supercomputer, so I had the opportunity to contribute to the development and setup of software for the users of OzSTAR in the months leading up to the launch.
The first project I contributed to was the setup of the Environment Module system. Researchers using the OzSTAR supercomputer often have highly specialised software configurations to meet their specific needs. For some, this is in the form of a specific software package that they require. For those who develop their own software, however, specific compilers and libraries are required. The components required to meet each researcher’s needs cannot coexist simultaneously without conflicting.
An Environment Module system enables software on the supercomputer to be selectively enabled according to each user’s requirements. On OzSTAR, we deployed Lmod, which is an implementation of Environment Modules. This allows for the use of scripts which allow for easy installation of new software by system administrators. Using Lmod, we have implemented a method for filtering the module list based on the currently loaded modules, which assists users in loading the correct version of each module. For example, even if a software package was compiled several times using different MPI libraries, only the package matching the currently loaded MPI library will be displayed. With the Module system set up, I then installed commonly used software packages, as well as specific software packages requested by users.
The second project I contributed to was the development of a job monitoring system for OzSTAR. Dashboards have already been used for monitoring OzSTAR, but these have been designed with engineers in mind — with a focus on providing useful metrics for hardware maintenance. There was a need for a job monitor to help users understand how their tasks were running on the machine and highlight areas for potential optimisation.
I developed the job monitor (now available on the OzSTAR website) using the React library for the frontend and Python for the backend. This monitor provides a list of running jobs, and allows users to inspect statistics such as CPU usage, memory usage, GPU usage, network transfer rates, and disk access rates. To guide users towards potential issues, it highlights jobs that are underutilising the requested resources. Often these problems can be easily corrected by adjusting the amount of resources they request, but it can also indicate a potential point of inefficiency in their code. This tool has helped users optimise their usage of OzSTAR, as well as helped administrators assist users get the most benefit out of their applications.
By contributing to the launch of OzSTAR, I gained an understanding of the teams, projects, and workflows required to provide supercomputing resources to researchers. I gained new skills in system administration and web application development. Along the way, I was also able to get hands-on experience by assisting other team members with their roles, and also contribute to the production of the OzSTAR promotional launch video.