Abstract: As a new generation of large-scale telescopes are expected to produce single data products in the range of hundreds of GBs to multiple TBs, different approaches to I/O efficient data interaction and extraction need to be investigated and made available to researchers. This will become increasingly important as the downloading and distribution of TB scale data products will become unsustainable, and researchers will have to take their processing analysis to the data. We present a methodology to extract 3 dimensional spatial-spectral data from dimensionally modelled tables in Parquet format on a Hadoop system. The data is loaded into the Parquet tables from FITS cube files using a dedicated process. We compare the performance of extracting data using the Apache Spark parallel compute framework on top of the Parquet-Hadoop ecosystem with data extraction from the original source files on a shared file system. We have found that the Spark-Parquet-Hadoop solution provides significant performance benefits, particularly in a multi user environment. We present a detailed analysis of the single and multi-user experiments conducted and also discuss the benefits and limitations of the platform used for this study.
Credit: Duniam, Geoff;mKitaeff, Vyacheslav V.; Wicenec, Ande\reas
Site: https://github.com/GeoffDuniam/FITS-Cod ... master/ETL