ButlerWG Meeting 2017-10-26

Meeting began 26 Oct 2017 @ 10am; ended 12:00pm.

Attending

Next meeting: 31 Oct 2017 @ 11am, tentative pending sufficient topics for discussion raised before the meeting.

Use Case vs. Design Walkthrough

Most of the meeting was spent walking through an extended commissioning use case from Simon Krughoff, as it would be implemented with the proposed design. Use case is reproduced below in italics along with comments from the meeting.

Commissioning scientist takes a set of observations at the beginning of commissioning. These observations are planned to cover a broad range of parameters so it is generally useful in testing pipelines.
- Data from the camera during commissioning will be automatically transferred to NCSA, but this may have a delay, so it may not go through NCSA to get to the commissioning cluster.
- Since we'll be making copies of the same raw data before it is ingested into a Registry, we need to make sure we can identify when raw datasets are the same across Registries. Unique filenames should provide the necessary information.
Those data are reduced on the comm cluster several times to iterate to a set of configs that produce the best results.
- Not clear exactly what the comm cluster Registry and Datastore will look like; "clone of LSP" (including its batch environment), but that can't be strictly true for all details.
- Otherwise this is a straightforward example of executing SuperTasks with outputs going into multiple collections in the same Registry.

The final repository and associated configs are archived at NCSA (with the configs living in a git repo I guess?)
- Ingesting already-processed data into the DBB isn't currently planned to be supported, but might be possible if needed.
- Not obvious this needs to go into the DBB vs. some other persistent storage. Assumption for the rest of this discussion is that it goes into a Registry that uses a private schema within the DBB Database and a separate but DBB-like Datastore.
- Configs will be saved as regular datasets for persistence, but human-curated "recommended configs" are more like software versions and are stored in git.

The curated dataset is re-reduced using a modern stack on a regular cadence. The outputs of those reductions are labeled by reruns.
- More standard SuperTask execution into multiple collections in a single Registry.

The commissioning team discover they need to do more looking into glint features.

They use the data discovery service to discover chips with likely glint features (by looking for images in visits with suspiciously large number of detections for example).
- This revealed a gap in the current design: "number of rows in a catalog" is metadata that is associated with Datasets with a certain StorageClass, not a DataUnit.
  - Jim Bosch and Unknown User (pschella) to add per-StorageClass tables containing additional metadata to the design, and investigate moving some DataUnit metadata to these (particularly regions).
- Further discussion revealed the need for additional metadata tables containing per-Dataset metrics that are not associated with a StorageClass but still need to be accessible to Registry queries (e.g. astromery scatter). In general, these would be per-DatasetType tables, but that's too many tables, and it gets in the way of making DatasetType creation dynamic. We discussed several possibilities.
  - We could add columns to the per-StorageClass tables that may not be used by all DatasetTypes with that StorageClass. This is an approach used by DESDM.
  - We could split up StorageClasses (e.g. make coadd images different from visit-level images).
  - We could add tables that store arbitrary key-value pairs. This would require intercepting and rewriting user queries to make it easy to query them, but Brian Van Klaveren has successfully implemented this approach in other projects.
  - We could add metadata tables associated with certain "blessed" DatasetTypes, with no expectation that dynamically-created DatasetTypes would have metadata tables.
  - Jim Bosch to add text to design document acknowledging this gap but leaving determination of exactly what metadata to add as TBD.
- Need to allow queries to use tables that are not in the Common Schema (but are added by users to their own database for development purposes, or by operators to production database). Should have a controlled process for adding tables to the Common Schema.

They spin up a notebook in the notebook environment to analyze the specific chips that exhibit the problem.
- Should be supported by generic interactive use of the Butler against a particular collection.

Once they figure out where the glints fall, they schedule specific observations to maximize the number of glints so they have lots of data to work with.

The visits are on the commissioning cluster and need to be reduced using the latest stack.
- As with earlier steps, need to make sure ingest handles raw data consistently across the commissioning cluster and NCSA.
- Reductions are again standard SuperTask executions.

The team spin up notebooks in the notebook environment to examine the new data to try to e.g. come up with an algorithm to correlate the glints with observing conditions.
- Also supported by generic interactive use of the Butler.

Once they have a model, they want to retrieve archived data to see if historical observations fit with their model.
- Requires lookups against the Registry using the same sort of criteria discussed in previous steps.

We then continued with Michelle Gower's review of the design from the previous meeting.

Made several clarifications in the design.
Expressed a need for a "middle layer" in the documentation describing the design, between the too-brief overview and the too-detailed schema and class documentation. Some of this will need to be done by Tim Jenness and/or Jim Bosch in presentations to the DMLT anyway, but those presentations are unlikely to go into enough detail. All WG members are actioned to suggest organizing principles for that middle layer. A database relationship diagram would be helpful.

We finished with a brief discussion of the final outputs and closing conditions of the WG:

Primary deliverables are the Use Case and Requirements documents, which have been stable for a while now (aside from formal writeup by Tim Jenness).
Design document is a "bonus" deliverable.
- WG members will not be able to digest design enough to sign off on it completely before Nov. 1.
- We will probably be able to say that we have not been able to rule out the design as of Nov. 1.
- Further design work after Nov. 1 should probably be normal ticketed work scheduled by T/CAMs instead of WG work, but we need to make sure we do not drop the broad, cross-team validation of the design that the WG has permitted.
Deliverables will not include a full specification of what "backends" will be needed; design only specifies a common interface for databases and storage systems, not what database and storage systems need to exist.
- These will hopefully represent components that need to be built independent of the design (though we need to test further whether design makes any of them significantly harder to implement).
- Some backends may be clearly needed to build the DM system, but are not clearly identified in any existing WBS (WG identified some backends whose owners were unclear to WG members).
- WG could make some progress on specifying these, but does not have time (and it would probably be more efficient to have smaller conversations that bring in T/CAMs).
Deliverables are to Wil O'Mullane, who decides what to do with them.
- Presumably this will involve RFC(s) to formalize at least Use Cases and Requirements as LDM documents.
- Scope and form of the review of WG outputs by DMLT or other non-WG members is up to Wil.
- Michelle Gower reports that NCSA leadership is interested in having a detailed review of WG outputs (like the recent live review of the Level 1 System).

Space shortcuts

Page tree

Attending

Use Case vs. Design Walkthrough