Notes on Parquet as a released data format

(C.f. RFC-662 - Getting issue details... STATUS )

From the main agenda page:

We should be clear on our overall strategy for Parquet data products, including:
- Are we committed to support Parquet (or more generally a columnar data format) as a user facing format for LSST catalog data products.
- if so, how do we slice/tile the data within the files?
- How do we make these available? Bulk download? By sky region?
- What is the strategy on using catalog data in Parquet files for backup or disaster recovery.
- Who controls the schema for Parquet data products?
- Who validates the generated data against the schema?
We should also decide which documents, and how, need to be updated to reflect the decisions taken above.

gpdf's understanding is that the answer to the first question, "Are we committed to support Parquet (or more generally a columnar data format) as a user facing format for LSST catalog data products.", is:

Yes, but so far we have only seriously committed to it as a Rubin/LSST-hosted collection of data over which we provide the next-to-data analysis system.
- We have not thought through how we will manage the resulting files as an exportable data product.

It would be very useful to decide, promptly, whether there is a reason to organize the files somehow other than spatially.

A spatial organization would greatly facilitate providing an IVOA-friendly query-and-download interface for external access to the actual files.

A spatial organization is easy to describe with ObsCore/CAOM2 metadata, and therefore to make accessible via ObsTAP (with dataproduct_type="measurements").
The spatial organization could either be organized around:
- the coadd tiling (tracts and patches),
- the Qserv sharding scheme, or
- HEALPix tiles.
Considerations:
- An organization around coadd tiles would be somewhat simpler for us to generate from the underlying pipeline outputs.
- A Qserv organization facilitates Qserv loading, replication, and disaster recovery.
- A HEALPix organization would allow the very straightforward generation of HiPS-catalog-formatted data from the Object catalog.

Added after the meeting ( 06 Mar 2020 ):

Recent HSC SDM standardization is producing Parquet files of ~1170 bytes/row, which translates into 10s of TB for 10s of billions of Objects.

Ignoring (rashly!) the non-uniform density of the sky, this means that, roughly, if we used HiPS order 5 tiles to define the "released" Parquet tiling (12,288 tiles of 1.8 * 1.8 sq. deg) we would end up with files a few GB in size, depending on the final size of the Object table. (The number of tiles varies by a factor of four for each HiPS order, of course.) This is comparable to the per-tract size of current HSC output Parquet files. It may be a bit small for contemporary data-center-to-data-center wide-area file transfers.

(NB, this size estimate does not include any allocation of space for a border of duplicates around the fiducial region of the tiles, an idea that was mooted but not adopted at the vF2F.)

Including O(10,000) records - fewer if we decide to go for larger individual files - in our ObsTAP service for these tiles seems perfectly reasonable.

Space shortcuts

Page tree