Optimizing parallel Bilby (pBilby)

CI: – Dr. R. Smith

The proposal aims to diagnose and remove communication bottlenecks in the gravitational-wave inference code “parallel bilby” (pBilby). pBilby is a highly parallelized and scalable variant of bilby, and is one of the LIGO Scientific Collaboration’s flagship codes used to perform astrophysical inferences on gravitational waves sources. Since its adoption by the collaboration in early 2020, it has been used to analyze numerous exceptional candidate events, such as the recently announced observation of GW190412 (arXiv:2005.06544). In addition, pBilby is being used to analyze at least three other exceptional candidate events which will be published later this year. pBilby enables inferences using physical models of gravitational waves that would be too expensive to use otherwise and hence is crucial for precision measurement. pBilby is routinely deployed at scale (over 500 CPUs) on SSTAR. However, it was recently discovered by Robin Humble, an OzSTAR HPC consultant, that the code likely has communication issues which make the code is up to a factor of two less efficient than it could be. A large amount (sometimes around 50%) of the total CPU time is system time rather than user time, meaning that the bulk of the run time is spent by the program executing code in kernel space. From an algorithmic standpoint, there should be no reason that the code is not dominated by user time. Hence the aim is to understand why the code spends so much time in kernel space. Successfully addressing this issue will lead to much better utilization of OzSTAR and external resources for a mission-critical code of the LIGO collaboration. pBilby is a python code which uses MPI to parallelize tasks on worker nodes which are then pooled back and analyzed by the head-node process. We would require a few to a few tens of compute nodes on either OzSTAR, SSTAR or GSTAR in order to test and benchmark the performance of the code.