Data Intensive Astronomy Workshop – Abstracts

Invited Talks

Ada Nebot (Observatoire Astronomique de Strasbourg)

The CDS – an astronomy data centre for reference data, preparing for the era of big Data

The Strasbourg astronomy data centre (CDS) develops and maintains services that help the astronomical community use and share reference data: the astronomical database SIMBAD contains identifications, measurements and bibliography; the VizieR database publishes catalogues and data associated to journal publications and also large survey catalogues, and has associated a fast positional cross-match service and a photometric viewer; the sky atlas Aladin is a visualisation tool that allows to easily access and manipulate images and cube surveys, catalogues, and spectra collections within the Virtual Observatory. These services aim to make heterogeneous data accessible and interoperable via the use of standardised metadata, and considerable effort is being made to ensure their scalability in order to handle future large surveys. The recent developments of the Hierarchical Progressive Survey (HiPS) scheme and the Multi-order coverage maps (MOC), are a part of this approach. HiPS and MOC have proved to be practical solutions for managing all-sky data sets (images, cubes and catalogues) and open the door to new scientific usage, facilitating visualisation, access and interoperability in the challenging Era of Big Data in Astronomy. We will highlight the features of HiPS and MOC and their potential application to large data sets.

Andreas Wicenec (UWA / ICRAR)

A brief history, status and outlook of Data Intensive Astronomy

In this talk I will provide evidence that Astronomy was always data and even more labour intensive. The ability of the instruments our engineers are designing and building and astronomers are using have always been cutting edge and many times beyond anything ever built before. With the ability to observe more objects and more details of individual objects, the amount of work required to extract the knowledge goes up as well. When combining the two, i.e. collecting fine details of many objects, the amount of work required to derive the information and finally extract the anticipated new knowledge from such a data collection, goes up exponentially. This holds completely independently from the kind and form of the original data as can be easily illustrated by the enormous data collections in the form of photographic plates, collected by observatories all around the world for more than a century. The digital (r)evolution in the beginning just made collecting the data a lot more straight forward and less labour intensive, but with the enormous boost of a completely digital society, our ability to collect data from the universe goes plainly through the roof. The claim here is, that astronomy and even more astronomers in the meantime are seriously out of balance with the ability of collecting data versus deriving knowledge. In addition, thanks to the IVOA, every single person on the planet has access to absolutely stunning amounts of astronomical data sets. Thus, rather than just dealing with your own observations, it is straight forward, and indeed standard practice, to use more data from other sources and combine them. Within seconds on a network, it is possible to start downloading Terabytes worth of data of any given point in the sky from multiple archives in parallel. Stunningly, most astronomers are still using single desktop or laptop machines, or single institutional servers in much the same fashion and very often with the same software as in the 90s of last century. In my view the gap between the individual astronomers capabilities of gathering data and analysing it, is increasing extremely fast and might well already be the single most fundamental bottleneck to scientific advances in the field.
How should we as a community try to remove this bottleneck? The first step is to identify and acknowledge the problem both on the individual and the community level: Tick! That’s why we have this workshop! The second step is to acquire the required expertise and design and implement solutions. The third step is to deploy and offer these solutions back to the community and in fact individual astronomers. The term ‘solutions’ here should be understood in the broadest way possible, as in software and tools, but also expertise and training and dedicated personnel. It is virtually impossible and also undesirable for an individual astronomer and even a team of astronomers to cover expertise and excellence spanning from computer hardware and system administration to design and implementation of highly efficient distributed code and algorithms, massive data management and then all the way to the actual astronomy questions to be tackled. With the amount of data we have at hand, in order to be really efficient, we need to use the best possible solutions to these computing related challenges. In the most extreme cases, like ASKAP and the SKA, or the largest simulations with a trillion particles, good is not good enough anymore, we will need the best minds in those fields to think about and develop potential solutions.

Lister Staveley-Smith (ASTRO3D / UWA)

The ASTRO-3D Data Intensive Astronomy Program

The ASTRO-3D ARC Centre of Excellence started operations in July. Its Data Intensive Astronomy Program will provide researchers with expertise in advanced dataflow and process management techniques, easy dissemination of final data products via the Virtual Observatory, and provision of detailed simulations. Along with the ASTRO-3D simulations thread, the program will be the interface to institutional expertise and national providers represented at this workshop. ASTRO-3D will work with its international partner institutes in Europe, North America and China, many of whom also have expertise in state-of-the art HPC systems, data archives, and visualisation techniques.

Elaina Hyde (AAO)

Clustering and Grouping in Spectroscopic Data Sets

Clustering and grouping algorithms behave very differently in data sets with varying density and parameter spaces. I present a small case study of the Leo IV and Leo V system which illustrates some of the advantages and disadvantages of these routines.

Julie Banfield (ANU)

Citizen Science and Machine Learning

I will present the current status of the citizen science project Radio Galaxy Zoo. I will examine how citizen science and machine learning can benefit from each other and highlight the state of this research area for the upcoming pre-SKA era.

Andrew Treloar (ANDS)

Services and Skills in support of FAIR Data

FAIR (Findable, Accessible, Interoperable and Reusable) is increasingly being used to refer to desirable characteristics of research data across a range of domains. ANDS, Nectar and RDS are aligning their business plans for 2017-18 to deliver on a range of outcomes, one of which is support for a FAIR agenda. This presentation will provide an overview of how the three organisations are working to provide a range of services that support FAIR data, and a skills agenda that supports engagement with these services and the data.

Eric Thrane (Monash)

Big Data Challenges in Gravitational-Wave Astronomy

Gravitational-wave astronomy is transforming our understanding of the Universe, allowing us to observe the previously hidden world of black holes. The scientific breakthroughs of this new field rely on carefully-tuned algorithms–running on large clusters–to sift through large volumes of noise to find needle-in-the-haystack signals. Important analyses remain computationally limited. In this talk, I describe some if the key computational problems in gravitational-wave astronomy.

Contributed Talks

Christopher Fluke (Swinburne University of Technology)

Astronomy Accelerated

Modern astronomy is a petascale enterprise. High performance computing applications are enabling complex simulations with many billions of particles. The new generation of telescopes collect data at rates far in excess of terabytes per day. The immensity of the data demands new approaches and techniques to ensure that standard analysis tasks can be accomplished at all, let alone in reasonable time. Indeed, many of the basic techniques used to analyse, interpret, and explore data will be pushed beyond their breaking points. I will discuss the crucial role that graphics processing units (GPUs) can play in accelerating astronomy, with examples drawn from recent work on: interactive visualisation and analysis of spectral data cubes (GraphTIVA and Shwirl); kinematic model fitting for large-scale galaxy surveys (GBKFIT); and quasar microlensing parameter surveys (GERLUMPH).

Sarah Hegarty (Swinburne University of Technology)

A New eResearch Platform for Transient Classification

Deeper, Wider, Faster is a coordinated, simultaneous multi-wavelength observing program, which aims to detect and identify fast transient events. Even with a highly optimised pipeline for data processing and pre-screening of possible transients, DWF still identifies thousands of potentially interesting transient candidates per night. Team members must perform manual classification of these detections in real time, so that other observatories can rapidly be triggered to follow up the most promising candidates. I will present a new eResearch platform developed to streamline this process. Deployed as a browser-based dashboard, this tool allows for interactive, visualisation and classification of individual transient candidates, as well as exploration of the larger candidate database. Additionally, we have used this platform to conduct a meta-study of the classification workflow: I will discuss lessons learned about how human classifiers make decisions, with implications for optimising future discovery pipelines.

Tim Dykes (University of Portsmouth)

Comparative, Quantitative, and Interactive Web-based 3D Visualization

Observational and computational Astronomy frequently produces results representing real-world phenomena, from small planetary disk formation to large-scale universe structure and massive galaxy catalogues. Scientific visualization can be an effective tool for analysis and exploration of such results, which are often datasets including spatial dimensions and properties inherently suitable for visualization via e.g. mock imaging in 2D or volume rendering in 3D.
The presentation will introduce work-in-progress on a remote web-based visualization tool as a new science module for the virtual observatory TAO (Theoretical Astronomical Observatory). This new science module will provide a real-time 3D visualization that allows the user to interact, select, and filter datasets for quantitative visualization on a remote supercomputing system simply through their web browser. This is combined with a 2D observational view, allowing the user to perform real-time comparative visualization. The combination of quantitative and comparative visualization can assist the user in identifying and extracting interesting subsets of the data to download for further analysis. Common workflows can then be saved and reloaded, to retrieve past visualizations or repeat filtering on multiple datasets with immediate visual results.
The visualization tool is based upon a new client-server implementation of the Splotch high performance rendering software, and is designed to deliver HPC powered interactive data exploration to a user via the web. The aim is to allow a user to integrate interactive visualization of large and complex data to their existing workflow, allowing quantitative and comparative 3D visualization to augment the scientific process without interrupting their workflow or installing cumbersome software packages.

Colin Jacobs (Swinburne University of Technology)

Searching for strong lenses in large image surveys using neural networks

Strong gravitational lenses are direct probes of dark matter and a rich source of information for cosmologists, but the trick is finding them. Since the discovery of the first lensed quasar in 1979, a few hundred strong lenses have been confirmed, but many thousands lie waiting to be discovered in current and next-generation surveys. Sifting thousands of morphologically-diverse lenses from among hundreds of millions of sources in astronomical surveys is a challenge worthy of the latest techniques in computer vision and data mining. I will describe how we applied convolutional neural networks, a key deep learning algorithm, to finding strong lenses in the Canada-France-Hawaii Telescope Legacy Survey and the Dark Energy Survey, and the challenges to be overcome in applying the method to large survey imaging while conserving precious astronomer-hours in sorting through candidates.

Andy Casey (Monash University)

The Cannon: Enabling unprecedented stellar science through Bayesian machine learning

The Cannon is a data-driven method to estimate stellar labels (temperatures, surface gravities and chemical abundances). The approach falls into the category of Bayesian machine learning, and has huge potential for revolutionising our understanding of stellar astrophysics. Named after Annie Jump Cannon, who systematically categorised stellar spectra without the need for physical models, The Cannon relies on similarity between spectra, and a small subset of well-studied stars. The Cannon is seven orders of magnitude faster than classical analysis techniques, and delivers precise chemical abundances at S/N ~ 20, whereas existing methods achieve comparable precision at S/N > 200. This implies the same scientific inferences can be made with ~1/9th the observing time. Here I will discuss how the method works, and present new scientific inferences and discoveries that are only achievable given our increased precision.

JT Malarecki (ICRAR/UWA)

Organisation and Exploration of Very Large Imagery Data in the SKA

The volume of data that modern and upcoming radio telescopes, such as the Australian Square Kilometre Array Pathfinder (ASKAP) and the Square Kilometre Array (SKA), are expected to produce introduces technical and methodological challenges that may hinder the effective exploration of such data. Visualisation tools are very important in helping researchers to understand their data. Technology should enable interactive data exploration, however this is commonly achieved by reducing the data to manageable volumes. Data exploration strategies will need to evolve alongside the advances in measurement instruments in order to best utilise the wealth of new information.
Compression is a well explored technique that can make large volumes of data more manageable. Although lossy compression algorithms introduce some degree of uncertainty, they can greatly reduce the volume of data. Furthermore, it has been demonstrated that the loss introduced can be acceptable, but this depends on the scientific analysis that will be performed on the data. There is a difference between the high precision data found in datasets and the information required to produce an effective visualisation. Hence, lossy compression can be utilised to benefit visualisation applications, provided that an acceptable degree of loss, or a visually lossless threshold, is defined.
The aim of this research is to aid effective exploration of large-scale datasets to complement analysis. There are two common methodologies that existing solutions employ to achieve this. They may store multiple representations of a dataset and serve the most appropriate to meet a user’s visualisation needs. Alternatively, they may progressively transfer information to provide an approximate representation of the data that can then be refined within the regions the user deems to be interesting. Both of these methods greatly reduce the amount of data that needs to sent to the user and processed locally. JPEG2000 is a technology designed to handle large datasets and offers both of these functionalities. However, it builds detail by fidelity rather than resolution, which allows the visualisation to quickly reach a visually lossless level and thereby greatly improves efficiency when visually exploring data. I will discuss progress towards visualising and working with gigabyte to terabyte scale datasets using JPEG2000.

Amr Hassan (Swinburne University of Technology)

Real-time Data Analysis and Visualization on Commercial Cloud Infrastructures

Interactive visualisation and real-time data analysis are essential requirements for knowledge discovery. While achievable on the desktop for small datasets, access to High-Performance Computing (HPC) solutions is required for the Terabyte scale (and beyond) datasets. During this talk, I will discuss our approach for deploying GraphTIVA (Hassan et al. 2013), a tightly coupled distributed GPU-based framework for efficient analysis and visualisation of Terabyte-scale three-dimensional data sets, on Amazon Web Services (AWS).
My talk will discuss our approach to automate the deployment of an HPC-like environment on the cloud, compare between different File I/O solutions offered by AWS and the impact of each of them on the performance and the cost, and discuss the impact of latencies introduced by virtualization on the performance of tightly-couple distributed solutions.

Wasim Raja (CSIRO)

Reducing data from the Australian Square Kilometre Array Pathfinder

The Australian Square Kilometre Array Pathfinder (ASKAP) is a test-bed for several new technologies for future generation radio telescopes, including achieving wide instantaneous field of view. Each dish of ASKAP is equipped with an array of 188 Phased Array Feeds (PAFs) at their focal planes providing enormous flexibility in reliably measuring wide regions of the sky. While conventional telescopes scan the sky with just a single instantaneous beam (single field-of-view), the ASKAP PAFs can form up to 36 simultaneous beams on the sky. The resulting high volume of data presents new challenges in data processing, and require the algorithms and the processing pipelines to adopt high performance solutions in ensuring that the processing is done in near-real-time.
ASKAPsoft is the custom-written package for processing ASKAP data and has been designed from the start to work in a distributed supercomputing environment. In this talk, I will discuss the processing done in a pipeline that runs on the Galaxy supercomputer at the Pawsey Supercomputing Centre, and launches all calibration, imaging and source-finding jobs in a semi-automated fashion. The data products resulting from this pipeline are ingested to the CSIRO ASKAP Data Archive (CASDA) where, after quality control, they are made publicly available.

Simon O’Toole (AAO)

ASVO-AAO Data Central Launch

We are launching Data Central, the AAO’s astronomical data centre at the ADACS workshop. It is designed to meet the current and future needs of the Australian astronomical community. We have developed an archive infrastructure and interface that is extensible and scalable. During development we deployed two exemplar datasets: the first public data release of the SAMI galaxy survey and the 2nd GAMA survey data release. These surveys require a large range of database capabilities.

In this presentation we will give an overview of the project and demonstrate the various features of the system. There will also be balloons!

Liz Mannering (AAO)

Building a user interface for AAO Data Central

The Australian Astronomical Observatory’s flagship astronomy data archive: Data Central (the AAO node of the All-Sky Virtual Observatory (ASVO) project) is due to launch later this year, connecting researchers to a wealth of theoretical and observational data from telescopes across the globe. Data Central will host all past, present and future surveys carried out using the Anglo-Australian and UK Schmidt Telescopes, including legacy (2dFGRS, 6dFGS, WiggleZ, RAVE, GAMA), ongoing (SAMI, GALAH, OzDES, 2dFLenS) and future (TAIPAN, FunnelWeb, DEVILS) surveys, as well as serving raw observational data from both telescopes.
To access this wealth of information, Data Central has developed a modern, responsive, extendable and light-weight front-end web application, which currently provides a Single Object Viewer, Query Builder, SQL access, search functionality, interactive data visualization, and a CMS-like documentation system. The web-app is coupled with a dedicated web-based API, which provides a simple RESTful interface with lightweight JSON-formatted responses to power the ADC website’s many features (queries, data retrieval, feature requests, profile management, authentication, group management, support etc.).
In this talk I briefly discuss the importance of a good user interface in providing an intuitive means for users to connect with the data hosted at Data Central, the issues we faced in developing the data-access and web layer serving this array of heterogenous datasets, and the many benefits of adopting a decoupled architecture.

Patrick Clearwater (The University of Melbourne)

A virtual laboratory for gravitational wave data analysis

The discovery of gravitational waves by the Advanced Laser Interferometer Gravitational-wave Observatory (LIGO) has ushered in the era of gravitational wave astronomy. With the new opportunities that LIGO presents for observing the Universe comes the tangible problem of making accessible the big data LIGO produces and the tools to analyse those data to the astronomical community. We present progress on development of an interactive, web-based virtual laboratory to present the existing LIGO data processing tools and mature pipelines in an integrated and user-friendly manner.

Christian Reichardt (University of Melbourne)

Using grid computing for astronomy: the South Pole Telescope example

Modern cosmic microwave background experiments are producing petabyte data volumes, which can poses scalability challenges to the traditional analysis pipelines. A queue-based analysis pipeline has been developed for the SPT-3G camera on the South Pole Telescope. This pipeline is designed to operate on the Open Science Grid, an alternative to traditional supercomputers for scientific processing. I will present the pipeline, and discuss both the constraints grid computing places on the design and the advantages of grid computing for handling large data volumes.

BOF

Arna Karick

Tech Savvy Astronomy: Useful tools for research & for transitioning into tech careers

The aim of this BoF is to kickstart discussions about a more tech thinking way of approaching modern day, data–intensive astronomy research. It’s about challenging the status quo and creating a new culture of tech savvy astronomers, equipped with industry standard tech skills to complement scientific computing and domain specific data analysis expertise. Most importantly, it’s about letting go of preconceived ideas and thinking creatively. The technology sector is a very attractive career path, particularly for researchers with strong scientific computing and python programming skills. What useful tools can astronomers start using now to help them transition into the technology sector? What tools are the tech industry most excited about, and what should researchers start using now? What tools are most useful for building up project portfolios, while maintaining their everyday research?

ASVO Talks

Simon Murphy and Marc White

SkyMapper in the international Virtual Observatory

Since 2014, The Australian National University’s SkyMapper telescope at Siding Spring Observatory, near Coonabarabran NSW, has been conducting a multi-epoch, multi-colour digital survey of the entire southern sky. With its first all-sky data release (DR1) in June of this year, Australian astronomers now have access to some 36 TB of calibrated survey images. From 2.3 billion object detections within these images, the SkyMapper team has derived position, brightness and shape parameters for 318 million stars and galaxies, matched against existing all-sky, multi-wavelength catalogues. The range of high-impact science enabled by SkyMapper data is broad but includes identifying the oldest, most pristine stars in the Galaxy, the discovery of new satellite galaxies orbiting the Milky Way, finding high redshift quasars at the edge of the visible universe, to searching for supernovae and other transient objects. SkyMapper’s images and catalogues are accessible through web services developed against standards proposed by the International Virtual Observatory Alliance, which promotes the access, interoperability and provenance of astronomical data. In this talk we will briefly introduce SkyMapper and its data releases, showcase the different ways in which clients (both astronomer and machine) can interact with our data, and discuss how SkyMapper sits in the Australian and global landscape of increasingly ‘big data’ astronomy, including plans to increase interoperability between Australian VO nodes to simplify data discovery and maximise scientific return.

Andrew Williams

MWA-ASVO

The Murchison Widefield Array (MWA) is a low-frequency radio telescope operating between 80 and 300 MHz. It is located at the Murchison Radio-astronomy Observatory (MRO) in Western Australia, the planned site of the future Square Kilometre Array (SKA) low-band telescope, and is one of three telescopes designated as a Precursor for the SKA. Science operations began in mid-2013 and since then, over 16 PB of raw visibility data has been collected and archived (of which over 9 PB is available to the public). The MWA data archive is not easily accessible to researchers in the general public, as they require access to proprietary tools and knowledge to download, convert and process the data, which is only obtained through project membership.

The All-Sky Virtual Observatory (ASVO) aims to provide access to data from various astronomical facilities via a common data access mechanism using International Virtual Observatory Alliance (IVOA) compliant services. The MWA-ASVO Pilot project is being developed and will form the fourth ASVO node (alongside Skymapper, TAO and AAO). In this presentation, we discuss the rationale, status, design and development of the project, including ways in which this project will reduce the barriers faced by astronomers when accessing MWA data. We will also discuss the design goals for the node in potential future projects beyond the pilot.

James Dempsey and Minh Huynh

CSIRO ASKAP Science Data Archive

SKAP is an array of 36 radio antennas, located at the Murchison Radio Observatory, in Western Australia. Equipped with innovative phased array feed receivers, ASKAP has an extremely wide field-of-view and will carry out sensitive large-scale surveys of the Southern Sky.

The CSIRO ASKAP Science Data Archive (CASDA) will be the primary means by which astronomers interact with ASKAP data products. In full operations, CASDA will archive and manage around 5 PB of data each year.

In this talk I will show how to get quick access to specific data products via the user interface. I will also show how to use the Virtual Observatory (VO) services for access to larger volumes of data and for regular queries such as the latest survey data.