Use cases from Jim Bosch

Here's the detailed list of use cases I can think of for composite datasets. It's a bit more exhaustive that what I sent you before, so some of it may be new to you, but I don't think there's anything here that can't be met (at least initially) by your current design. Some of these will require significant changes to Science Pipelines code as well, and hence aren't likely to happen soon. When you put this into an RFC, you might want to include YAML snippets for how the composite datasets describe in each of these cases would be defined. We should also invite others to come up with additional use cases in case there's functionality that would have to be added to support them.

1. Access Exposure With Updated WCS

In jointcal (and meas_mosaic) we generate a Wcs and (soon) Calib that supersede those included in the Exposure "calexp" dataset generated by single-visit processing. Downstream tasks should be able to access an updated Exposure as a different named dataset, via e.g. butler.get('calexp:jointcal', ...) or butler.get('calexp', dataId={'version':'jointcal', ...})

Because jointcal will be run at a tract level (or some other grouping of Exposures), and some "calexp" datasets will be associated with multiple tracts, the data ID of the new Wcs, the new Calib, and the updated Exposure will need to include a tract key even though the original "calexp" data ID does not.

Requirements to Satisfy This Use Case

Composable and Decomposable & Pluggable
- Use: allows the jointcal Wcs and the jointcal Calib to be stored individually, without storing data that would be duplicate with data in the exposure from single-visit processing.
Datasets Findable by Processing Stage
- This allows later processing stages to ask for a dataset that has been processed by a specific processing step, and would fail if a dataset is present but had not been processed to the extent expected by the current processing step.
- When new Wcs and Calib components are written by the jointcal task, the output repository from jointcal should contain a definition of an Exposure dataset type that includes the pixels from the original calexp and the new Wcs and Calib components.

Additional DataId Keys in Later Processing Stages
- This allows the later processing stage(s) to search based on criteria that is derived in an intermediate processing step.

Pseudocode

Create a butler where the input repository contains the calexp datasets from single frame processing as Type 1 datasets, and the output repository will contain the jointcal processed images.

import lsst.daf.persistence as dafPersist
inputRepoArgs = dafPersist.RepositoryArgs(root='single_visit_processing')
outputRepoArgs = dafPersist.RepositoryArgs(root='jointcal_processing')
butler = dafPersistence.Butler(inputs=inputRepoArgs, outputRepoArgs)

The policy needs entries for the existing Type 1 calexp dataset, for Type 2 dataset getters to get components from the Type 1 dataset, and for Type 3 dataset to persist the composite dataset as component outputs.

Note that calexp_wcs and calexp_calib have the same template as calexp; this is because they come from the same persisted dataset but is not required or implied.

datasets: {
    calexp: {
        template:      "sci-results/%(run)d/%(camcol)d/%(filter)s/calexp/calexp-%(run)06d-%(filter)s%(camcol)d-%(field)04d.fits"
        python:        "lsst.afw.image.ExposureF"
        persistable:   "ExposureF"
        storage:       "FitsStorage"
        level:         "None"
        tables:        raw
        tables:        raw_skyTile
    }
	# Type 2 getter for wcs in a Type 1 calexp
    calexp_wcs: {
        template:    "sci-results/%(run)d/%(camcol)d/%(filter)s/calexp/calexp-%(run)06d-%(filter)s%(camcol)d-%(field)04d.fits"
        python:      "lsst.afw.image.Wcs" 
        persistable: "Wsc"
        storage:     "FitsStorage"
    }
	# Type 2 getter for calib in a Type 1 calexp
    calexp_calib: {
        template:    "sci-results/%(run)d/%(camcol)d/%(filter)s/calexp/calexp-%(run)06d-%(filter)s%(camcol)d-%(field)04d.fits"
        python:      "lsst.afw.image.Calib"
        persistable: "ignored"
        storage:     "FitsStorage"
    }
	# Type 1 datasets for components of the Type 3 jointcalexp
	joint_wcs: {
		template:    "wcs/.../filename.fits"
		python:      "lsst.afw.image.Wcs"
		persistable: "Wcs"
		storage:     "FitsStorage"
	}
	joint_calib: {
		template:    "wcs/.../filename.fits"
		python:      "lsst.afw.image.Wcs"
		persistable: "Calib"
		storage:     "FitsStorage"
    }
    # Type 3 dataset definition for jointcalexp
    jointcalexp: {
        python:      "lsst.afw.image.ExposureF"
        persistable: "ExposureF"
        composite: { 
            wcs: {
                datasetType: "jointcal_wcs"
            }
            calib: {
                datasetType: "jointcal_calib"
            }
        }
        assembler: "lsst.mypackage.jointcal.JointcalAssembler"
        disassembler: "lsst.mypackage.jointcal.JointcalDisassembler"
    }
}

Assembler and Disassembler for jointcalexp are written and saved as specified by the policy. Assembler/Disassembler API is still TBD. Proposing:

I think it will work well for the code that will call the dis/assembler to get each type 1 dataset (butler should do this recursively by recursively getting components of type 2 and type 3 datasets), and pass the group of those in a dict to the assembler with the component item key as the dict key.

def JointcalAssembler(butler, dataId, componentDict):
    # expects componentDict keys 'wcs' and 'calib', to contain Wcs and Calib objects, respectively.
    # in this case, load the 'base' object, and then overlay components
    exposure = butler.get('calexp', dataId)
    exposure.setWcs(componentDict['wcs'])
    exposure.setCalib(componentDict['calib'])
    return exposure

Calling butler.put to serialize the Exposure into the repository will decompose the Exposure into many datasets.

Conversation

Component Lookup by Processing Stage

Unknown User (npease) asked:

If new Wcs and Calib component datasets are written, but other component datasets are not replaced: is it important (or possible?) to specify which component datasets should be from calexp.jointcal and which are ok to have come from the single-visit calexp? I'm currently imagining a lookup algorithm for components:

butler.get(datasetType='calexp:jointcal', ...)
    for each component in composite:
        location = butler.map(datasetType, dataId)
        if location:
			# I figure it will find locations for 'wcs' and 'calib'
		else:
            # There are lots of other possible components in MaskedImage 
			# and ExposureInfo (psf, detector, validPolygon, filter, 
			# coaddInputs, etc) What to do?

In the "What to do" case, would butler fall back to the next-previous processing step name? (how would it know?) Or, would butler remove 'stage' from the dataset type and search for component types in the repo-lookup order? If so, why bother with a stage qualification on the named composite? (if it can go ahead and use prior components if a component with the name does not exist)?

Jim Bosch said:

I believe there is always a strict ordering of stages, with the most recent stage always overriding previous stages. The stage lookup order should correspond to repo order, so at first glance it seems we could just drop the stage names entirely and rely on repo lookup order. As we discussed in Tucson, I believe that's a little dangerous in terms of provenance (especially in the case where some data units are run through all processing stages before other data units are run through any stages), which is why I'd prefer a design in which the stages are explicitly encoded in either the datasetType or the data ID.

All composite datasetType definitions would explicitly define which stages they'd get their components from, and once we have the ability to define datasetTypes within repos, we could have each stage add an alias that redirects datasetTypes without explicit stage names to the one with most recent stage. In the meantime, we can always just use the stage-qualified datasetTypes. I think that with this design we don't actually require a strict ordering of stages for lookup, which could be nice if I'm wrong about that always being true.

I don't think this is the only way to meet our requirements, but it seems safer and more explicit than more "automatic" approaches that would utilize the repo chain for component lookups or otherwise automatically assemble composites. That also makes it more fragile - several datasetType definitions would need to be updated when pipeline stages are reordered - but I think that's an acceptable cost for the safety. Other developers should probably weigh in on this.

Additional DataId Keys

Unknown User (npease) asked:

Does the key (and value) have to be in the policy's dataset type template, or could it be in registry metadata only? What about FITS metadata? (note that use of FITS is not always guaranteed)
Is it possible to declare early that the dataId has a new expected key (perhaps we could keep a policy override in the repository?)
Is it possible to know all the dataId keys at ingest time? (Then we could init the keys to 'null' whose values will be determined by an intermediate processing step, but not have to override policy data at an intermediate processing step.)

Jim Bosch said:

As for additional data ID keys, I believe these would always have to be present in the datasetType template of at least one of the updated components, and would always be necessary to load datasets after the stage that added the data ID key (because it would be necessary to disambiguate). For example, we might start with a single "calexp:sfm" dataset with (visit, ccd) keys. A later "jointcal" stage would define a "calexp.wcs:jointcal" dataset with (visit, ccd, tract) keys, implying that (visit, ccd) is no longer sufficient to identify a "calexp:jointcal". The additional data ID key would nearly always be "tract", actually - I can't think of a use case for anything else - and I don't think it will every be anything that will be present in a raw data registry. It is closely related to the spatial registry functionality we've asked for in the butler, however - and I could certainly imagine us asking for all "calexp:jointcal" datasets in a specific region on the sky, which then implies one or more tracts and allows the spatial query system to find all overlapping (visit, ccd) IDs.

2. Load Exposure Components Individually

Analysis code should be able to load components of an Exposure dataset individually without loading the full Exposure, using something like butler.get('calexp.psf', ...). Whether the Exposure is stored as a single file on disk or as separate files for each component should be transparent to the user.

Use Case Questions

Can't the user code ask for a WCS dataset type? I think with the same dataId as would be used for the calexp?
- Jim Bosch: I don't think we can get away with just a "wcs" dataset type, as there will be multiple WCSs (one for every Exposure dataset type).

Requirements to Satisfy This Use Case

Composable and Decomposable & Pluggable
- Use: allows the Exposure and the Psf to be loaded separately.
Component Access
- Use: allows the component to be specified in terms of the composite calexp, but only returns the Psf of the calexp.

3. Individual Storage of Exposure & SourceCatalog Components, Plus Caching

Some components of an Exposure (including Wcs and Calib) are also conceptually associated with any SourceCatalog generated from that Exposure, and in the future we would like to actually attach these components to in-memory SourceCatalogs. When loading a SourceCatalog via the butler, these components should be read as well, whether they are persisted internally as part of the Exposure, as part of the SourceCatalog, or independently. This must not involve reading the full Exposure, as this would be much more expensive. Ideally when both a SourceCatalog and its associated Exposure are loaded through the butler the component shared_ptrs attached to each would point to the same underlying object, but this is probably an optimization, not a requirement.

Requirements to Satisfy This Use Case

Composable and Decomposable & Pluggable
- Use: given plugin code for building a SourceCatalog and an Exposure from component datasets, the Butler can assemble each of these composite objects from their component datasets.
Caching
- Use: if a component has already been loaded and is needed again as an object of another dataset, the butler can install a pointer/ref to the already-constructed component objects in the composite objects instead of creating another instance of the same object.

4. Store Processed-Exposure Dataset Components Separately

When processing coadds, the first stage (detection) modifies only the background (a lightweight object that can be added or subtracted from the image) and a single mask plane, and no subsequent steps modify the coadd Exposure object at all. We would like to define Exposure datasets that represent both the coadd prior to detection and the coadd after detection, without duplicating on-disk the Exposure components that are shared.

Here the delta between the two images are currently represented as operations beyond just get/set (add/subtract for backgrounds, bitwise AND/OR/NOT for masks), but modifying Science Pipelines code to treat these as get/set (by adding a Background component to Exposure and treating Mask planes more atomically) may be an acceptable solution.

Requirements to Satisfy This Use Case

Composable and Decomposable & Pluggable
- Use: The butler plugin for building the coadd needs to be written so that replaceable component objects are fetched individually. As newer versions of component objects are created they are 'put' into later/newer repositories. This way they they mask earlier versions of objects (and/or do they need to be findable by Processing Stage similar to how it's described in Access Exposure With Updated WCS?).

5. Composite Object Tables

Object tables are currently written as a set of several SourceCatalogs datasets that column-partition the full table. It should be possible to define an Object dataset that combines these SourceCatalog datasets in the column direction. This will also require changes to the table library as well as the butler and may not be implemented soon, but should be considered in the composite dataset design.

Requirements to Satisfy This Use Case

Composable and Decomposable & Pluggable
- Use: requires that a Butler plugin can be registered for a dataset type name that handles building the SourceCatalog in the desired way.

6. Aggregates (as opposed to Composites) and Aggregate Dataset Type Name

Most catalog datasets and some image datasets can be trivially aggregated according to nested data IDs. For example, a catalog dataset for a tract (or visit) can be generated simply by combining all catalogs for the patches within that tract (ccd) in the row direction. Access to these aggregate datasets should probably use a different name than their constituents, to avoid confusion with other interpretations of partial data IDs.

Aggregating datasets may include combining structured metadata or components that they share (which should already be consistent but may be duplicated).

Requirements to Satisfy This Use Case

Composable and Decomposable & Pluggable
- Use: for the dataset name specified for the aggregate case there will be a plugin that knows how to combine (or knows how to call the constructor in a way that will cause it to combine) the datasets into a single aggregate object.

7. Access to Precursor Datasets (via Persisted DataId)

Coadd Exposures require access to some of the components of the sensor-level Exposures that were used to construct them. These are currently persisted (duplicated) directly within the coadd Exposure dataset, but we could also use butler composite datasets to assemble these from the original files, which would require support for composite datasets in which the data IDs for the components must themselves be unpersisted. There are potential disadvantages to normalizing persistence in this case (many small files are less convenient and sometimes less performant) that may preclude actually making this change even if the butler supported it, however.

Requirements to Satisfy This Use Case

Use: Storing dataset+dataId lookup information in a registry or in a persisted object's metadata will allow butler to find component objects, even when the dataId passed to a butler function provides enough information for a composite to be found but not for its components to be found.

Other Use Cases

8. Subsectioning Repositories Along the Dataset Axis (from Simon Krughoff)

The calexp is a composite dataset comprising a MaskedImage and other things like the Psf, Wcs, and Calib objects. The src is effectively a composite dataset (though it is not stored this way) because it needs the Calib object from the calexp in order to put the instrumental fluxes in a calibrated system.

We know of cases where we want to create a new repository with just the src catalogs for all the dataIds because the images are so large. This implies that composite datasets should be able to share components and be able to (un)persist the shared components independently.

Notes Regarding Item 8

calexp is a dataset type, typically associated with an ExposureF or ExposureD object type. src is a dataset type, typically associated with a SourceCatalog object type. Both contain a calib object which can be a common component of the two objects.

Requirements to Satisfy This Use Case

Use: when performing a get operation, while searching for components, butler can look in more than one input repository for the component objects and assemble them assuming a unique match was found for each component object's dataset type and the dataId provided.

9. WCS Stored Separately from Exposure, and Replacing Existing Images In Exposure (from Jim Bosch )

(This is the original example from Jim Bosch)
Exposure consists of a MaskedImage and an ExposureInfo; the former contains three separate images, and the latter is just a bucket of more complex objects held by std::shared_ptr (Psf, Wcs, Calib, …). All of these are currently persisted together (i.e. it's a type 1 composite). We already have clear use cases for:

Defining a new dataset that takes an existing Exposure dataset and replaces its Wcs with one persisted elsewhere.
Defining a new dataset that takes an existing Exposure dataset and replaces one or two of its images with one(s) persisted elsewhere. Single-image replacements (specifically mask replacements) are probably the most common.

It might be useful to start with just these two use cases, or it might be useful to start by just decomposing all of Exposure into its constituents.

Requirements to Satisfy This Use Case

10. Full Camera Visit Metadata Shared by All the CCD Exposures Associated With That Visit (from John Parejko )

Our Exposure and SourceCatalog objects are designed to hold the data from one CCD. Much of the exposure metadata (e.g. ExposureInfo) is visit-wide (e.g. observatory information, telescope boresight, some components of the WCS) and we may often want to work with full focal plane visits (e.g. initial astrometric fitting to prevent single CCD catastrophic failures, jointcal's future full focal plane fits). Having a "VisitExposure" type of object would help to manage the data, and would let questions about caching exposures, catalogs and metadata across the visit be managed at the butler level.

Space shortcuts

Page tree

Use cases from Jim Bosch

1. Access Exposure With Updated WCS

Requirements to Satisfy This Use Case

Pseudocode

Conversation

Component Lookup by Processing Stage

Additional DataId Keys

2. Load Exposure Components Individually

Use Case Questions

Requirements to Satisfy This Use Case

3. Individual Storage of Exposure & SourceCatalog Components, Plus Caching

Requirements to Satisfy This Use Case

4. Store Processed-Exposure Dataset Components Separately

Requirements to Satisfy This Use Case

5. Composite Object Tables

Requirements to Satisfy This Use Case

6. Aggregates (as opposed to Composites) and Aggregate Dataset Type Name

Requirements to Satisfy This Use Case

7. Access to Precursor Datasets (via Persisted DataId)

Requirements to Satisfy This Use Case

Other Use Cases

8. Subsectioning Repositories Along the Dataset Axis (from Simon Krughoff)

Notes Regarding Item 8

Requirements to Satisfy This Use Case

9. WCS Stored Separately from Exposure, and Replacing Existing Images In Exposure (from Jim Bosch )

Requirements to Satisfy This Use Case

10. Full Camera Visit Metadata Shared by All the CCD Exposures Associated With That Visit (from John Parejko )

Space shortcuts

Page tree

Use Cases for Butler Composite Datasets

Use cases from Jim Bosch

1. Access Exposure With Updated WCS

Requirements to Satisfy This Use Case

Pseudocode

Conversation

Component Lookup by Processing Stage

Additional DataId Keys

2. Load Exposure Components Individually

Use Case Questions

Requirements to Satisfy This Use Case

3. Individual Storage of Exposure & SourceCatalog Components, Plus Caching

Requirements to Satisfy This Use Case

4. Store Processed-Exposure Dataset Components Separately

Requirements to Satisfy This Use Case

5. Composite Object Tables

Requirements to Satisfy This Use Case

6. Aggregates (as opposed to Composites) and Aggregate Dataset Type Name

Requirements to Satisfy This Use Case

7. Access to Precursor Datasets (via Persisted DataId)

Requirements to Satisfy This Use Case

Other Use Cases

8. Subsectioning Repositories Along the Dataset Axis (from Simon Krughoff)

Notes Regarding Item 8

Requirements to Satisfy This Use Case

9. WCS Stored Separately from Exposure, and Replacing Existing Images In Exposure (from Jim Bosch )

Requirements to Satisfy This Use Case

10. Full Camera Visit Metadata Shared by All the CCD Exposures Associated With That Visit (from John Parejko )