2017 Quality Analysis Prototype narrative description

Introduction

[NB: I say “will” in various places, but this should not be construed as reflecting a plan already agreed to by T/CAMs or properly resourced. Those decisions and associated compromises now need to be made. It was simply easier to avoid phrasing the entire document in the conditional.]

The “QA Prototype” refers to the development and demonstration of a coordinated set of capabilities and tools to be developed by LSST DM during 2017. Coordinated work will begin in the S17 cycle, based on a variety of efforts already in progress, and should produce a minimally useful environment by the end of the cycle, enough for the DM leadership to be able to assess whether the work is progressing reasonably and to support Science Pipelines code development and assessment early in F17. The scope of the prototype is limited to the “stellar locus” photometric QA method outlined by Robert Lupton, applied to the 2017 public HSC data release, but should be evaluated all along with its extensibility to a wider range of data and analyses in mind.

(The term “QA” is used here, without judgement, in the sense that it has recently been used in DM, effectively standing for “quality analysis”, and not in the other common sense of “quality assurance”.)

QA Prototype development will continue during F17 in partnership with Science Pipelines to provide additional capabilities that support their efforts, and to permit a full assessment of the QA architecture and tools in the Fall of 2017. This assessment should be aimed at understanding whether the QA Prototype has been a success, within the resource constraints of DM in 2017, and whether with continued work it appears likely to provide a satisfactory platform for the verification of DM scientific performance requirements and for the commissioning of LSST in the ComCam era and beyond.

The QA Prototype is intended to provide an environment for the analysis and exploration of data emerging from runs of pipeline code, allowing both for the execution of pre-defined post-processing of data and for on-demand execution of user-determined analyses. It depends on a combination of capabilities from the infrastructure, middleware, database and data access, SQuaRE, and science user interface and tools groups, and it will require advances in all of these areas.

Operational concept

The principal initial use case for the QA Prototype (QAP) arises from the application of Science Pipelines code to a substantial dataset, using the DM task and workflow framework, followed by the analysis of the resulting data products.

Initial pipeline processing

Initially, because of the size of the dataset and the amount of computing required, such a pipeline processing “campaign” (a DM workflow group term) will be launched from time to time at human discretion. It is to be expected that several days of process execution will be required to carry out the campaigns desired in mid 2017 on the resources expected to be available then. See DM-8143 - Getting issue details... STATUS for some size and processing estimates.

For the QAP reference use case, the stellar locus analysis, the campaign’s processing will be a subset of DRP, and its image and catalog outputs will largely be a subset of the Level 2 Data Products produced by DRP. However, it may include additional intermediate output data products that would not be persisted in the final production DRP system, to facilitate both the QA analysis itself and further debugging.

All of the data products of the quasi-DRP campaign should be stored in a manner accessible to all DM users. Catalog data products should be ingested into databases and accessible through the DAX interfaces. Image data products should be accessible through DAX-imgserv, and image metadata should be queryable through DAX-dbserv and/or DAX-metaserv (there are still some architectural discussions under way on that point).

For reasons that will become apparent below, it will facilitate downstream analysis for a HEALPixel ID to be computed and persisted for each source or object, and for it to be used as an index in the resulting catalog database.

Access to pipeline processing inputs and outputs

All the input and output data products should be available through a Butler interface to both interactive and batch Python processes run by users. It is anticipated that much of the interactive work will take place in a Jupyter/IPython notebook environment, but “plain” Python processes must also be supported.

A specific requirement of the QAP during the implementation period is the ability to retrieve catalog data from the databases via a Python API. The means that will be offered for this are still to be designed in detail. It is highly desirable that a Butler API for this be delivered on the QAP time scale, with the ability to return the results of relational queries. It is unlikely that this could provide “ORM-ish” recreation of the original Python-domain object model of the catalog data products on this time scale, though that would be a valuable longer-term goal. However, for the purposes of the QAP, it should be sufficient to allow the retrieval of catalog data in a simple tabular data model that reflects the database query performed.

All of the data products of the quasi-DRP campaign should also be accessible to an instance of the SUIT, derived from the one being delivered for PDAC. The SUIT currently provides the ability to visualize image data and to query and visualize tabular data with common column data types. (Some aspects of the latter are still awaiting full availability of the DAX-metaserv API and its integration into the SUIT.) The ability to plot catalog data over image data is native to the SUIT. The SUIT currently does not provide specific visualizations of a range of LSST-specific data objects of obvious interest from a Science Pipelines perspective (e.g., PSF models, Footprints, HeavyFootprints, galaxy models). It is intended that these will be provided in the course of DM construction; however, we need to consult with Science Pipelines to understand which of these would be most useful or even required for the QAP. (Note that one LSST-specific data object which SUIT does already support, based on work in 2016, is the image mask model. This visualization will be available in the QAP both via Python API - afw.display - and as a “built-in” capability when viewing LSST-stack-persisted images obtained via the SUIT UI.

Bulk quality analysis processing

The QAP’s reference use case then proceeds to a bulk analysis of the catalog data produced by the campaign. This analysis is capable of producing both summary results assessing the data quality obtained from the campaign as a whole and detailed results allowing QA assessment across the sky, across the focal plane, or across other to-be-determined axes. The request from the Science Pipelines group is that the summary results (e.g., a measurement of a KPM for the whole campaign) in general be derived hierarchically from the detailed results (e.g., by averaging or the application of other aggregation functions).

Initially it is likely to be the case that the original quasi-DRP campaign is run by itself, with the QA processing run separately once the campaign is complete, at the discretion of a responsible team member. Later on, when a well-established "canned" QA processing is available, it will likely be desirable to run it automatically upon completion of a quasi-DRP campaign. This will become a point of contact of the QA and QC systems.

In this spirit, it is anticipated that, with an ongoing increase in computing and storage resources available, it will become possible to perform such campaigns more often and to place them into an “automated QC” framework of, say, weekly runs. The implementation should be amenable to this, but it is not a required functionality during the S17-F17 period.

The bulk analysis will produce data products of its own representing the analysis results mentioned above. These are primarily expected to be catalog data products, where rows in the catalog represent the “detailed results” mentioned above. Such data products must be ingested to a database, ideally together with the “parent” pipeline processing campaign’s outputs. The Princeton discussion took this to be a relational database with the DAX services and associated Butler interfaces in front of it. It is not clear that a complex relational data model is needed for the QAP, however, so during detailed design we can consider alternatives. (In particular it is not part of the baseline design for the system to retain relational connections between the results table and the individual sources/objects used to create them; the plan is to be able to recreate this connection on the fly by repeating the original selection.) Here as elsewhere, any design changes would have to be thought through with an eye on later use cases, not just the QAP.

It is to be expected that the bulk analysis will be run more than once on the same set of pipeline outputs, so a strategy is required to handle the life-cycle of the resulting two levels of databases (pipeline outputs and QA outputs). The summary results mentioned above may be persisted via the metrics framework developed for the QC system.

It is possible that, for the purposes of the QAP, the operational aspects of the design can be simplified by permitting the pipeline outputs and QA outputs to reside on separate database servers (thereby giving up the opportunity for JOIN operations between the two levels). The basic functions of the QAP and its reference analysis do appear to permit this, but a closer analysis is required.

Bulk analysis workflow

In the case of the stellar locus analysis, the principal subdivision of the data will be a division of the sky into spatial bins (notionally defined in the HEALPix scheme) of the minimum size permitting a statistically meaningful stellar locus analysis. Results of analyses in multiple color-color planes for each bin are expected to be computed. The results for a single such plane (e.g., g-r-i) can be thought of as an image in HEALPix space. The full results of the bulk analysis will contain multiple scalars for every HEALPixel (more than one parameter from the stellar locus analysis on each of several color-color planes), implying the ability to create multiple images of this kind from the data. The reference design envisions generating these images as visualizations, on the fly (though potentially with cacheing), from the underlying catalog of results by HEALPixel.

The analysis for each HEALPixel will begin from the object catalog produced by the pipeline processing, selecting the objects (or sources) in each pixel and then applying additional cuts to define a high-purity stellar dataset suitable for the analysis. Some of this selection will be done at database level (with the expectation that this will be expressed as a Butler call) with further refinement in Python at the task level.

The design chosen for the QAP will need to be extensible to other axes of subdividing the parameter space for the stellar locus analysis (or future QA analyses), e.g., to be capable of carrying it out in focal plane space, or as a function of other state variables of the observatory.

The stellar locus analysis requires the ability to define photometric colors for detected objects across several bands. It therefore requires an association of objects across the photometric bands of the input dataset. This can be accomplished in more than one way, with various associated tradeoffs, but in the long run the availability of an N-way matching tool will be an essential element of LSST's QA capabilities. Both symmetric matching and asymmetric matching (e.g., to an external reference catalog) will be needed. For the purposes of the QAP, however, the principal functions of the QAP can be decoupled from the development of an N-way matcher, as long as some other means of defining cross-band objects is part of the QAP processing (e.g., the use of forced photometry may be an option). This is a tradeoff that can be made to prioritize the development and demonstration of the infrastructure/middleware/DB/DAX/SUIT/SQuaRE aspects of the QAP.

The bulk analysis will be organized as much as possible using the same task, configuration, and workflow structures used for the principal science pipelines. This means, in particular, that it should be coded as Tasks and SuperTasks. The Task level will perform the actual stellar locus statistical analysis on a bin (HEALPix pixel, in the reference case), with its inputs and outputs as in-memory LSST-stack Python objects. The SuperTask level will be concerned with data retrieval and persistence, ideally using the Butler interface to query results described above. (If that interface is not available at first, work-arounds at a lower level can be used but will be fragile to future migrations.)

The workflow for the QA analysis will have to be capable of taking the sky area covered by the input dataset (i.e., for the QAP, the early-2017 HSC public data release), computing its coverage at a specified level of the HEALPix hierarchy, assigning the fitting of individual pixels as units of work, executing the work, and monitoring completeness. An estimate for the total processing required is not yet available, but it is anticipated that obtaining results in a timely manner will require parallelization of the processing across both cores and hosts.

The design will require a means for aggregating the per-HEALPixel results obtained from the analysis into tables. At the per-job level this can clearly be done using the same ingest of afw.table-format FITS files that is presently used for catalog data products, but some thought needs to go into how to handle the single-row-level results that the analysis will produce, in the Task-SuperTask-Butler structure. (This is not expected to be difficult - it just needs to be decided clearly.)

Once the bulk analysis is complete, an initial set of aggregations of the results (e.g., to measure KPMs) may be performed and persisted.

Interactive analysis and visualization

Following this, interactive analysis will be carried out. Interactive analysis can begin either from a Python prompt, using Butler and DAX APIs to retrieve data from the bulk analysis (and, typically for "drill-down" purposes, the pipeline processing outputs), or from a QA-specific SUIT-based portal providing access to the data. For example, we envision providing portal-level support for requesting the display of specified quantities from the results by HEALPixel as images in sky coordinates. It should be possible to overplot these images on single-epoch or coadded image data products from the underlying pipeline run (adjustable color and transparency will be very useful here), as well as to overplot HEALPixel images with catalog data.

The HEALPixel image display capability must be available both in the portal environment and in the Jupyter/IPython environment. (E.g., a Firefly component that can display these images should be usable both in a Firefly web application and as a Jupyter widget.) We expect to provide an afw.display-like way to drive an external display of a HEALPix image from a (potentially non-Jupyter) Python environment as well.

The analysis environment should support the interactive computation of aggregate results at user-selected higher levels of the HEALPix hierarchy, based on user-provided (Python) aggregation kernels.

The analysis environment must then support the selection of individual HEALPixels and the display of the underlying data used to perform the stellar locus analysis for the selected pixel(s). The data should be displayable in a variety of coordinate systems, but most particularly in color-color space (displaying the stellar locus itself) and in sky coordinates, overlaid on images. It should be possible to recover the specific object/source selection used for the persisted stellar locus results, not just the full set of objects/sources within the selected HEALPixel. It should be possible to repeat the locus fit itself and to display it as, e.g., a histogram projected across the line of the fitted locus.

It should be possible to configure a user’s interactive or batch analysis environment to use the same versions of code used in the pipeline processing and/or the bulk QA analysis, or to repeat elements of this processing with new/modified versions of code.

Thus, for a selected subset of the data used for the QAP, it should be possible to interactively repeat the stellar locus analysis or even elements of the pipeline processing with changed data selection parameters, other configuration parameters, or changed code, or for a color-color plane not included in the original bulk analysis. Having done so interactively, it should be straightforward to do so (i.e., with the same Task/SuperTask code and configurations) at a larger scale using batch resources. Having produced reanalyzed QA result data, it should be accessible and visualizable as easily as the results of the bulk analysis. This requires the ability for users to create new database tables and access them through the same Butler and other APIs used for the original processing and bulk QA analysis results.

It should be possible to perform a selection of a particular HEALPixel graphically in the HEALPix-display component, with the resulting pixel ID available to the user’s Python code. This should be possible either in the Firefly-web-application case (via an afw.display-like interface), or in the Jupyter-widget case (via the widget’s API or an afw.display-like interface). Once retrieved, it should be easy to use the pixel ID to perform a query on the object/source catalog data.

Because of the limited time available to construct the QAP, the principal new visualization developed by the SUIT group for the QAP will be limited to the HEALPix image viewer. All the other basic plotting needs should be supportable with existing Firefly capabilities, with advances to the widget and afw.display APIs during this period to support the QAP. On this time scale, if other more complex visualizations are needed (e.g., vector field plots) they can always be performed in the notebook or other Python environments using common tools like matplotlib or Bokeh.

All of these capabilities should be available (up to natural constraints due to CPU, memory, and network resources) from remote laptops (or other computers) as well as on centrally provided servers at NCSA. Remote access may require the use of a VPN or other tunneling capability.

Centrally provided services

The centrally provided system should include:

hosting and service of the input dataset (images and image metadata tables)
hosting and service of one or more reference catalogs (including the relevant data from Gaia)
batch execution of the pipeline analysis
hosting and service of the resulting catalog and image data products
batch execution of the bulk QA analysis
hosting and service of the resulting QA catalog data
a Firefly server providing both a web portal view of the data and support for visualization widgets
- a Python back end service for extensions to Firefly
a JupyterHub or equivalent environment for supporting interactive logins, with access to all the above data and services
a batch system for supporting larger-scale user analyses
access to the code versions and configurations used for the pipeline processing and QA bulk processing (as images/containers available to be activated by users, and/or in a shared-stack environment), with the ability for users to further customize the code and configurations for interactive analysis
the capability for users to persist tabular data resulting from their own analyses, and have that data accessible for visualization and further analysis

Additional information

(still to be posted)

Summary of major deliverables from each DM subsystem (all of these are already mentioned in the above narrative)
Design sketches

Space shortcuts

Page tree