UPDATE FROM 2018-11-12 MEETING WITH CADC REPS

Present: Tim Jenness , Gregory Dubois-Felsmann , Jim Bosch , Pat Dowler, Séverin Gaudet, Stephen Gwyn

ORIGINAL REVIEW, 2018-10-04

Collections

Both the Gen3 Butler and CAOM2 define a "collection" concept, and happily they're broadly consistent.

In Butler, collections are just groups of Datasets defined by a string tag, and they can be used for many purposes - sets of master calibrations to use, different selections of raw data, the outputs of processing runs, and any combination of these.  That will probably include a collection for each data release, and some way of grouping public prompt processing data products into one or more collections.

Most of these should probably not be mapped to CAOM2 at all, but all public data collections almost certainly should be, and for these we should be able to have a one-to-one relationship between the Butler collection and the CAOM2 collection, which would avoid a lot of confusion (but see the Alternatives section at the bottom of this page).

While Datasets are permitted to belong to multiple Collections in the Butler data model, the mapping to CAOM2 is probably easiest if the collections exposed in a particular CAOM2 view do not overlap in the Datasets they contain (possibly with an exception for raw data).

Instruments, Telescopes, and Environment

Butler already has a Camera DataUnit, which we should at least consider renaming to Instrument for more generality, less conflict with cameraGeom.Camera, and consistency with CAOM2.  We can put any static CAOM2 Telescope or Instrument information there easily.  It doesn't seem to me that we gain anything by normalizing out those concepts in the Butler schema; it's probably more important to try to keep the number of DataUnit dimensions small.

Non-static Telescope "keywords" and CAOM2 Environment information should not go in the Butler Camera/Instrument table.  Butler would normally put time-dependent observatory state information in Datasets such as a persisted afw.cameraGeom.Camera instance (which will use valid timestamp ranges for their data IDs, like any master calibration).  We expect to copy some of that information into the Exposure and/or Visit tables to enable Dataset lookups and Preflight queries on it (fields TBD).

Observations

The CAOM2 Observation concept is extremely general, and maps to multiple very distinct concepts on the Butler side.  The CAOM2 view into Butler content is thus essentially a union of several entirely different views, some of which work much better than others.

I'm interpreting the CAOM2 docs as saying that the unique key for an Observation is the (observationID, collection) tuple, and hence the same observationID can appear in multiple collections.  That also maps slightly better to the Butler data model than the interpretation that says that observationIDs must be unique across all collections, but we can support that interpretation too by just appending the collection name to the observationID definitions below.

CAOM2 Observations that represent Butler Exposures:

CAOM2 Observations that represent Butler Visits:

CAOM2 Observations that represent coadds:

CAOM2 Observations that represent difference images:

CAOM2 Observations that represent master calibrations:

Analysis

CAOM2 maps well to our Exposure and Visit concepts, and it does so in a consistent and intuitive way.  We can easily implement CAOM2 views into a Butler schema for these.

I toyed a bit with adding an Observation table to the Butler schema, with a string primary key field that would be used as both a foreign key and a primary key for both Exposure and Visit; it seemed a bit messier than what we have now on the whole, but I wouldn't rule it out entirely.  It wouldn't necessarily make the mapping to CAOM2 much easier, anyway, as the primary difficulties there are in non-Exposure, non-Visit Observations and the relationship between Observation and Collection, which I definitely don't want to change on the Butler side.

Unless it can be extended to (or already does) support nesting of CompositeObservations, CAOM2 maps sufficiently poorly to coadds and difference images that I'm not sure we should even try.

I'm moderately confident we can make CAOM2 CompoundObservations for at least some master calibrations, but this may be confusing if they sometimes represent full-focal-plane concepts and sometimes do not.

Planes

Once Observations mappings are established, expanding to Planes is relatively straightforward.  For a CAOM2 Observation defined by a Butler Data ID and a Collection (which includes Observations associated with Exposures, Visits, and coadds), there is one CAOM2 Plane for every Butler DatasetType that has at least one Dataset in the Collection with a Data ID that is a superset of the Observation's Data ID (that sounds more complex than it is; need example or diagram).

Attributes:

Associated Entities:

Artifacts, Parts, and Chunks

It seems straightforward to map Butler Dataset directly to CAOM2 Artifact, and I see no problem with doing exactly that.  Artifact.productType can work like Plane.calibrationLevel: something the DatasetType table can define.  While Artifact's (mostly optional) fields seem to assume a bit more of a file-based system than the Butler requires, it will probably map reasonably well to the concrete mostly-file-based system (I think) we're actually building, and to the extent it doesn't, I imagine we'll still be able expose something that looks more-or-less file-based to users.

The CAOM2 Part concept can probably be used to expose the Butler's Dataset composition system, at least in some cases, and Chunks appear to be something we can straightforwardly define for at least Datasets with image-like StorageClasses.

Alternatives

Defining a CAOM2 Observation's "collection" to map directly to a Butler collection implies that (unlike Butler DataUnits), CAOM2 Observations are collection-dependent, and hence there are no CAOM2 concepts that span multiple Butler collections.  But the notion that Butler's collection-independent DataUnits are static even across the collections that define different data releases is at some level aspirational (i.e. I imagine we'll prepare for schema evolution by adding levels of indirection above or outside the Butler schema).  It might make sense to treat CAOM2 Observations the same way, and map all Butler collections to a much smaller number of CAOM2 collections (i.e. start with just one, and add more only when really necessary).  While I'd need to work through the implications of that in detail to be certain, I think that would make the mapping from Visit/Exposure to Observation much more natural (at least within any scope in which each Exposures is not assigned to multiple Visits), at the expense of making the already more tenuous Observation mapping from coadds, difference images, and master calibrations probably unworkable: the CompositeObservation membership of those quantities is definitely Butler-collection-specific (well, coadd and difference image CompositeObservations might be salvageable if we define their children to be possible rather than actual inputs).

The other clear mismatch here is in Plane, which might work better if we let it be a Dataset-level quantity (and hence always have a 1-1 relationship with Artifact in our realization of CAOM2).  That would let its productID and Provenance map to much more meaningful Butler-side entities.  On the other hand, Plane has other attributes we could define at the Observation level (Energy, Time, Position, Metrics), in some cases (all but Metrics) even if we follow the last paragraph and extend the domain of each CAOM2 Observation across multiple Butler collections.  Those would be denormalized even more if we pushed the Plane definition down to match Dataset.