You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Overview of Relevant Changes

Data Models for Observational Data

In the Gen2 Butler, we permit each camera to have its own data model for observational data - it can have its own system for labeling exposures, sensors, etc.  This makes things very natural at the very highest level when we want to specify which units of data should be processed or analyzed, but it is a problem for pipeline developers in two ways:

  • There is no camera-generic way to express a concept like "visit", even though such grouping by such concepts is frequently necessary in pipeline code.
  • Each camera must provide its own path template for any dataset that includes data ID keys related to observational data, making the definition of new DatasetTypes much heavier.

In the Gen3 Butler, we instead expect each camera to conform itself to a common data model: Exposures, Visits, and Sensors are defined as generic concepts that pipeline code can utilize, and cameras must provide their own definitions of these terms that are consistent with the generic concept.

Data IDs → DataUnits

The flexible dictionary data IDs used in the Gen2 Butler have been both generalized and more strictly enumerated into what we call DataUnits (for "Units of Data") in the Gen3 Butler.  Dictionary data IDs will still be used, but the keys allowed in those dictionaries are limited to a predefined set of DataUnit types, and we are making a much stronger distinction between keys that uniquely identify certain DataUnits (such as the unique ID number associated with a visit or tract or the name of a filter) with metadata that can be used to look them up (such as the date a particular visit was observed.  The latter can no longer be used in data IDs, as we are making a strong separation between "complete" data IDs that can uniquely identify a dataset and expressions that in general yield multiple datasets (they may also yield a single dataset, but this cannot be easily guaranteed before the query is actually performed); the Gen2 concept of a "partial" data ID no longer exists.  In exchange expressions will be much more powerful: they will support at least a significant subset of SQL.

A full description of the set of DataUnits is beyond the scope of this document (see DMTN-073; still WIP).  The relevant ones for this document will be discussed at various points below.  Some other important facts about DataUnits:

  • DataUnits can have dependencies: for example, a Patch is always associated with a particular Tract.
  • Each DataUnit has one or more value fields, which uniquely identify it when combined with the value fields of the DataUnits they depend on.  For example, a Patch's (x,y) index uniquely identifies it only when combined with its Tract's number (and, now, its SkyMap's name, since SkyMaps are themselves DataUnits).  A data ID like (skymap="rings-120", tract=8766, patch=(4,5)) thus uniquely identifies a SkyMap DataUnit, a Tract DataUnit, and a Patch DataUnit.
  • Some DataUnits are associated with a table in the Registry that provides additional metadata (such as the timestamp or airmass of a Visit) as well as relationships between DataUnits (such as the PhysicalFilter associated with a Visit).
  • We often need to be able to assign a unique reversible integer ID to a particular combination of DataUnits.  For example, when labeling Objects, we want to combine an auto-increment number within a particular tract and patch with a unique identifier for that tract-patch combination.  That means we need a way to generate those IDs from many different combinations of DataUnits, and a natural way to do this is for each DataUnit to have a mapping to an integer and a way to get at the maximum number of bytes or bits that integer occupies (much like we do in Gen2 with e.g. bypass_ccdExposureId).

No Mappers, No Special Mapping Hooks

There will be no mappers in the Gen3 Butler.  The template used to write a dataset (assuming files are even involved) with a particular DatasetType will usually be defined by the Butler client configuration (though this could load default configuration from a repository, and we do not expect camera-level template specializations to be used very often).  The concept of a template for reading will no longer be meaningful - we will store the actual filename (or equivalent) of every dataset we store in a database, and look it up directly.

The special mapping hooks (std_, bypass_, etc) that are used in the Gen2 Butler to override how certain datasets are read will be gone as well.  Instead, composite datasets and a more flexible system for specifying serialization code (and automatically save an associated reader object) will be used for many of these customizations.  In Gen3, these customizations are generally applied when datasets are ingested or otherwise written, and they are stored with the datasets (at the level of individual datasets, that is; this metadata is not necessarily on the same e.g. disk as the dataset).  That means no special configuration should be needed in a Butler client to read something once it has been persisted, but it also means that it is much more difficult to change how a dataset is read after it has been written.  That limitation suggests a model in which raw data is not automatically augmented with auxiliary information on read, but is instead stored in its original form and augmented by the creation of "virtual" (no-new-files) datasets at the during (or even prior to) ISR.  This may in some contexts be less convenient than what we have now, but I believe it is the only way to rigorously capture the provenance of that auxiliary information (e.g. which version of what the camera geometry looked like at at a particular point in time should be associated with a raw dataset).

Data Repositories vs. Collections

In the Gen2 Butler, different processing runs are represented by different data repositories, and the relationship beteen inputs and outputs are represented by chained repositories that are searched in a particular order.

In Gen3, a single data repository will contain both raw data (possibly from multiple cameras) and a large number of processing runs.  Different processing runs or manually-curated groups of related datasets are represented by Collections, which are simply a database tag (they have no representation in filesystem/Datastore).  Datasets can belong to any number of Collections, and typically the Collection associated with a processing Run will include all Datasets in the Collection(s) that were used as an input to that processing Run.  With a Collection, however, there can only be a single Dataset with a particular DatasetType and Data ID.  A Butler instance is associated with only one Collection, so there's any Butler.get call should have a unique and unambiguous result.

Populating Instrumental DataUnits

Gen3 data repositories will not be limited to data from a single Camera, and this means that Camera is itself a DataUnit - a data ID that includes observational data must include a key-value pair that identifies the camera (by a short string name, like "HSC" or "DECam").1

Each Camera is also responsible for defining a set of associated Sensor DataUnits and a set of associated PhysicalFilter DataUnits.

  • For Obs WG: review the proposed schemas for Camera, Sensor, and PhysicalFilter in DMTN-073 (blocked by DM-12620 - Getting issue details... STATUS ).  Note that the Gen3 design also permits additional per-Camera metadata tables for DataUnits, but we'd prefer to put as much as possible in the common, camera-generic DataUnit tables (and ideally avoiding having per-Camera metadata).
  • For Obs WG / Gen3 MW Team: design a class interface that obs_* packages can specialize to provide the descriptions of Cameras, Sensors, and PhysicalFilters.  This should include:
    • What needs to be provided when adding a Camera to a Registry (i.e. adding entries for that Camera to the Sensor and PhysicalFilter tables for the first time).
    • Keeping the Registry information up-to-date (e.g. what happens when a new filter is added).
    • How to convert Sensor and PhysicalFilter DataUnits to unique integer IDs and back, along with a description of how many bits/bytes those integers occupy for this Camera.

1. We'll try to to infer the camera name whenever possible, and make it usually unnecessary to include it in high-level, user-facing code.  But pipeline code will typically work with data IDs that do include the camera explicitly.

Producing and Registering Master Calibration Datasets

Pipeline-Produced Master Calibrations


Human-Curated Master Calibrations


Raw Data Ingest and Observational DataUnits


Configuration/Pipeline Overrides


  • No labels