The use cases on this page extend beyond butler and data access to the kinds of processing and analysis work we will want to do to develop, integrate, verify, and validate DRP and AP code.  The intent is that this will help inform the high-level architecture of the Data Facility so that design can set the context for lower-level data access and butler use cases.

Developer-Initiated Processing

A DM developer wants to test an alternate configuration for an algorithm that is part of DRP.  The first step involves submitting a workflow defined by a single a Pipeline consisting of multiple SuperTasks on a small patch of sky, with the work expected to consume up to a few hundred core-hours of compute effort but only a couple of hours of wall-clock time, producing a few tens of terabytes of output data products.   Ideally the batch submission could be performed from the notebook environment described in Direct Ad-Hoc Analysis, but this is not critical.

The inputs to the processing could be:

  • raw data
  • an official data release
  • a large-scale processing run initiated and managed by an operator (as per Operator-Initiated Processing)
  • another small-scale processing run initiated by the same developer (as per Developer-Initiated Processing)
  • another small-scale processing run initiated by a different developer (as per Developer-Initiated Processing)
  • a processing run executed automatically on a predefined cadence by the continuous integration system.

The output data products are typically needed for only a few weeks, but should only be deleted by the developer or with the developer's permission.  Some mechanism (possibly policy-only) should ensure that data products are not deleted if they are still in use by another developer.

Variant 1

Instead of testing a configuration change, the developer wishes to test a code change.

1a) The new code has already been included in a managed software release.

1b) The new code has been merged to master for a few days, and hence is guaranteed to have been built by the CI system, but has not been included in a managed software release.

1c) The new code has been committed to a branch, but has not yet been merged to master.

Variant 2

Instead of testing a configuration change, the developer wishes to test a new set of calibration products.

2a) The new calibration products have already been made available to the operations production system.

2b) The new calibration products are experimental and have not been made available to the operations production system.

Variant 3

The developer wants to test a totally new algorithm that produces new data product types.

3a) The new data products are defined along an existing combination of units of data (e.g. tract+patch+filter, which is already used by many coadd data products).

3b) The new data products are defined along a new combination of existing units of data (e.g. sensor+filter, which are used individually to define other data products but have not appeared before without visit before now).

Operator-Initiated Processing

Like Developer-Initiated Processing, but involving a much larger processing run that will consume up to thousands of core weeks and tens of petabytes of storage.  The developer requesting the run starts by obtaining permission for a large run and defining the storage duration of the outputs according to some TBD policy.  An operator other than the developer may then be needed to initiate and manage the new processing run, depending on the actual size of the job and the need for human intervention in executing it.  In general, this could involve multiple Pipelines and many batch submissions.

If the change being tested is a code change (Variant 1), it can be assumed that the new code has been included in a managed software release, and if it is a new set of calibration products (Variant 2) the calibration products can be assumed to be made available to the operations production system.  New data product types (Variant 3) to support a single processing run need not be supported, as long is there is a separate mechanism for adding new data product types to the production system as e.g. part of a managed software release.

Direct Ad-Hoc Analysis

Following one or more processing runs, a developer uses a notebook environment to analyze the results, running a combination of predefined and ad-hoc computation, plotting, and display code.

The input data may come from any combination of:

  • official data releases
  • small-scale processing runs (as per Developer-Initiated Processing)
  • large-scaling processing runs (as per Operator-Initiated Processing)
  • processing runs executed automatically on a predefined cadence by the continuous integration system
  • local data repositories in the notebook environment's filesystem.

Any combination of these may be compared in an analysis session.

The data products analyzed will frequently include catalog data stored in a database.  In other cases, the processing run may not have included database ingest originally, but the developer performing the analysis wants to use database-like query functionality, requiring either a way to easily perform ingest at this stage or otherwise aggregate the data and support SQL-like query operations.

The analysis environment should include an interactive Python prompt in which loaded data products can be introspected, image and optionally tabular data displays can be manipulated, and plots can be created and modified.  High-level APIs that link points in scatter plots, overplotted symbols in the image display, and records in the tabular data and interactive Python, allowing them to be selected, highlighted, and filtered jointly in all contexts should be included.  These may be utilized both by library code (for predefined metrics and plots) and interactive Python users.

Plots, tabular data, and overplotted images may also be saved or loaded in the same manner as more direct pipeline outputs, allowing code written using the same APIs to be run in a non-interactive context.

TODO: move some of the bits about connected display/plotting/notebooks from the next section here.

Multi-Scale Spatial Analysis

Like Direct Ad-Hoc Analysis, but the developer wishes to inspect values of a set of metrics on multiple spatial scales in an image-like display, reflecting different binning in aggregate calculations.  To enable efficient inspection of large areas of sky, the metrics to be inspected may need to be precomputed on the coarser grids of interest before interactive analysis begins.

As with Direct Ad-Hoc Analysis, the inputs to this processing will usually be catalog values stored in a database, but they may be catalog values that have not yet been ingested into a database.  The metric values will often be compared to (binned) image data on the same spatial scales, which may also require some preprocessing of image data.

In addition to the gridded metric values themselves, the outputs may involve intermediate data products that can be used for more in-depth analysis. For example, the summary metric may be the width of a color-color scatter diagram in a certain direction, and the intermediate data product might be the filtered set of data points used to construct the color-color scatter diagram.

After the necessary preprocessing, the developer utilizes the same display tools as Direct Ad-Hoc Analysis, but with the image display able to zoom out to much larger areas using binned (and stitched) images and overlays of the binned metric values included.

Provenance-Driven Debugging

In the environment described in Direct Ad-Hoc Analysis, a developer discovers a problem in a data product that may be due to a problem in earlier stages of processing.  By following the provenance change from that data product back, she reconstructs the sequence of SuperTasks and configuration needed to reproduce it as a Pipeline, starting from some previous step that is assumed to have had no problems (which may be raw data), and submits it as a new batch job (as in Developer-Initiated Processing) while configuring the control system to write several intermediate data products that are not persisted by default.  The developer then inspects the intermediate data products until the problem is found, which may involve additional cycles of job execution with increasingly fine-grained intermediate data products persisted over smaller subsets of the full pipeline; some of these may be sufficiently small that they are run via the notebook environment's own Python kernel or another local job in a separate process, rather than via a batch submission).

The developer then tests possible solutions to the problem by rerunning the offending Pipeline or SuperTask (possibly as a batch job, again as in Developer-Initiated Processing, or possibly using the notebook environment's own Python kernel for very small processing jobs), and again inspecting the intermediate outputs.  

Calibration Products Validation


  • Not clear that calibration products are that different from other data products that can be used as an input to pipelines.
  • But we do have named/labeled sets of calibration products in the operations system to choose from in batch processing and suck into AP processing, and we need to define how we validate those.
  • Might want to get Robert Lupton to fill this in.

Automated Metrics Testing and Tracking


  • Definitely Simon Krughoff's domain.
  • Probably relevant for both DRP and AP.

Drill-Down from Metrics Tracking


  • Maybe Simon Krughoff's domain.
  • Probably relevant for both DRP and AP.
  • Need to make this seamless with ad-hoc analysis environment.
  • A special case of this (where the metrics are spatially-binned summary statistics) is a special case of Multi-Scale Spatial Summary Analysis

AP Change Validation


  • I think many AP algorithmic changes are developed and tested like DRP ones, but some changes are more about how the live system updates itself, and all changes ultimately need to be tested in some kind of live-like environment.
  • Need Eric Bellm or someone else from UW to think about how we do this.

  • No labels