AXS: A framework for fast astronomical data processing based on Apache Spark

05/22/2019
by   Petar Zečević, et al.
0

We introduce AXS (Astronomy eXtensions for Spark), a scalable open-source astronomical data analysis framework built on Apache Spark, a widely used industry-standard engine for big data processing. Building on capabilities present in Spark, AXS aims to enable querying and analyzing almost arbitrarily large astronomical catalogs using familiar Python/AstroPy concepts, DataFrame APIs, and SQL statements. We achieve this by i) adding support to Spark for efficient on-line positional cross-matching and ii) supplying a Python library supporting commonly-used operations for astronomical data analysis. To support scalable cross-matching, we developed a variant of the ZONES algorithm there-goes_gray_2004 capable of operating in distributed, shared-nothing architecture. We couple this to a data partitioning scheme that enables fast catalog cross-matching and handles the data skew often present in deep all-sky data sets. The cross-match and other often-used functionalities are exposed to the end users through an easy-to-use Python API. We demonstrate AXS' technical and scientific performance on SDSS, ZTF, Gaia DR2, and AllWise catalogs. Using AXS we were able to perform on-the-fly cross-match of Gaia DR2 (1.8 billion rows) and AllWise (900 million rows) data sets in 30 seconds. We discuss how cloud-ready distributed systems like AXS provide a natural way to enable comprehensive end-user analyses of large datasets such as LSST.

READ FULL TEXT
research
09/01/2022

Python Implementation of the Dynamic Distributed Dimensional Data Model

Python has become a standard scientific computing language with fast-gro...
research
07/19/2020

High Performance Data Engineering Everywhere

The amazing advances being made in the fields of machine and deep learni...
research
08/24/2021

The benefits of prefetching for large-scale cloud-based neuroimaging analysis workflows

To support the growing demands of neuroscience applications, researchers...
research
10/31/2017

CMS Analysis and Data Reduction with Apache Spark

Experimental Particle Physics has been at the forefront of analyzing the...
research
09/14/2022

PAPyA: Performance Analysis of Large RDF Graphs Processing Made Easy

Prescriptive Performance Analysis (PPA) has shown to be more useful than...
research
01/06/2020

MREC: a fast and versatile framework for aligning and matching data with applications to single cell molecular data

Comparing and aligning large datasets is a pervasive problem occurring a...
research
01/06/2020

MREC: a fast and versatile framework for aligning and matching point clouds with applications to single cell molecular data

Comparing and aligning large datasets is a pervasive problem occurring a...

Please sign up or login with your details

Forgot password? Click here to reset