Metric Storage: "Tidy Data"

The ability to drill down requires that metric values be *stored as scalars* at the maximum granularity desired.  In the extreme there are metrics per source, which are simply measurement columns in the relevant source catalogs.

The principles of "tidy data" (which are a restatement of database normal forms in the language of data analysis) provide a guide for how to structure this storage.  In brief:

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit [(e.g., (ccd, visit, dataset) or (tract, patch, sky)] forms a table.

With data stored in this form it becomes trivial to perform aggregation at any level desired and to join against other relevant metadata (night, airmass, focal plane position, moon phase...).  It is also a workflow that is particularly suited to the pandas package, as encouraged by  RFC-465 - Getting issue details... STATUS .


For scaling, we consider 1 year of LSST operations as the maximum scale of dataset relevant for this working group.  At 275k visits/year and 189 CCDs, that implies ~50M rows in a database if we store each metric measurement at the (visit, CCD) level.  
Allowing for tens of reprocessings this still fits comfortably in standard SQL databases.  We could use a centralized QA database or persist QA results in sqlite or parquet along with the dataset.

We suggest the following: the QA datastore records all values of a metric at the lowest reasonable granularity level (e.g., CCD, patch), keyed by dataID.  These values can then be joined to the EFD or other metadata using the dataID.  Aggregations (means, std, etc.) are then performed in user code or in the database depending on the application.  Web displays like SQuASH, which use the highest level of aggregation, could run afterburner codes that compute standard dataset-level aggregates (storing them as needed for performance reasons).  Drill-down  then simply requires retrieving the metric values that were aggregated from the QA store.

  • No labels