Gen3 Butler query Python API

This page emerged initially from conversations in the hack sessions at the 2019 PCW. It collects some use cases and indicative pseudocode relating to metadata queries on Gen3 Butler repositories.

Use Cases

Use cases are labeled as "Tier 0", "Tier 1", "Tier 2" as a rough indication of priority. "Tier 0" use cases are ones that are already satisfied in some way (even if not direct or elegant) through public interfaces of the Gen3 middleware that do not involve the end user directly coding against the registry schema (e.g., by writing SQL).

Get a list (iterable over strings?) of all the collections within a repository. (Tier 0)
Get a list (iterable over ?) of all the known DatasetType values within a repository. (Tier 0)
Get a list (iterable over ?) of all the DatasetType values within a repository for which a dataset actually exists. (Tier 2)
Queries for a list (iterable over DatasetRef):
1. of all the extant datasets of a particular DatasetType in a repository, or in a named collection within that repository. (Tier 1)
2. of all the extant datasets (i.e., of any DatasetType) in a repository, or in a named collection within that repository. (Tier 2)
3. of all the extant datasets of a set of DatasetTypes, defined either as a list of specific types, or as some form of wildcard expression, in a repository, or in a named collection within that repository. (Tier 2) Most interesting use case is probably tail wildcarding, e.g., "calexp*".
4. of all the extant, or, selectably, all the possible, datasets in a repository, or in a named collection within that repository, that are directly associated with a specific DataId or DataId wildcard expression. "Directly associated" means that, say, if a visit ID is specified, only datasets from that visit and that visit alone (e.g., "raw", "calexp", a difference image, a source table) would be reported. "Indirectly associated" datasets such as a coadd tile partially derived from that visit's image would not be reported. "All the possible datasets" means a list based on all the DatasetType values known to the repository. Optionally, limit the query to a specific DatasetType or DatasetType wildcard/list. (Tier 2)
5. of all the extant, or, selectably, all the possible, datasets in a repository, or in a named collection within that repository, that are indirectly associated with a specific DataId or DataId wildcard expression. "Indirectly associated" includes relationships based on joins across dimensions, but should not take into account dependencies relating to specific DatasetTypes. That is, a query for "all datasets indirectly associated with patch X" would return, among others, all datasets from visits overlapping the patch, even datasets that are not used in any actual generation of coadds (e.g., a DIASource table dataset). Said another way, "indirectly associated" is derived solely from registry data, not from any pipeline specification or QuantumGraph. Optionally, limit the query to a specific DatasetType or DatasetType wildcard/list. (Tier 3)
6. of all the datasets "touched" by (i.e., used as inputs, intermediates, or outputs) a specified QuantumGraph (or a task specification that can be used to derive one), whether they already exist or not. (Tier 0)
7. of all the inputs (selectably: ultimate "leaf node" inputs, or all inputs including intermediates) on which a specified dataset (DataId-DatasetType pair) depends, given a QuantumGraph (or a task specification that can be used to derive one), whether they already exist or not. (Tier 0)
8. of all the outputs derived from a specified dataset (DataId-DatasetType pair) depends, given a QuantumGraph (or a task specification that can be used to derive one), whether they already exist or not. (Tier 0)

Secondary support use cases

Given a DatasetRef, without actually loading the dataset into memory, obtain the Python type that would result from a get() on that DatasetRef. (Tier 0)
Given a DatasetType, obtain the Python type that would result from a get() on a dataset of that type. (Tier 0)
Given a DatasetRef, determine whether the associated dataset actually exists, without actually loading the dataset into memory. (With the understanding that "actually exists" can therefore only be an estimate and can't take into account the possibility that an error might occur when the actual get() were attempted.) (Tier 0)

API Possibilities

As a starting point I'm suggesting that these discovery APIs operate on a Butler, either as a member function or as a utility function that takes a Butler.

However, many, perhaps all, of the capabilities described might be capable of being executed directly on a Registry, which is of course available from a Butler. It is an implementation detail whether they are implemented as functions on a Registry with the Butler versions implemented as simple call-throughs.

Registry.getAllCollections() already exists, returns set of str ; same API on a Butler would be fine
Registry.getAllDatasetTypes() already exists, returns frozenset of DatasetType ; same API on a Butler would be fine
No current simple API. Possibly no efficient way to implement this without an exhaustive search? Something like getAllDatasetTypes( extant = True ) would be a reasonable API if it could be provided.
...
1. listDatasets( datasetType = DatasetType ) and listDatasets( collection = str, datasetType = DatasetType ) returning an iterable of DatasetRef would be reasonable interfaces
2. listDatasets( ) and listDatasets( collection = str ) returning an iterable of DatasetRef would be reasonable interfaces
3. listDatasets( datasetTypeFilter = wildcard-string ) and listDatasets( datasetTypeFilter = wildcard-string, collection = str )
4. listDatasets( dataQuery = str ), listDatasets( collection = str, dataQuery = str ), listDatasets( dataQuery = str, datasetType = DatasetType ), etc., where the dataQuery argument is a string of the same format that is accepted by the --data-query parameter to the pipetask command.
5. listRelatedDatasets( collection = str, dataQuery = str, datasetType = DatasetType, maxDepth = int ), returns an iterable of ( DatasetRef, depth ) tuples, where the "depth" indicates the minimum number of "hops" required to get from a DataId that matches the dataQuery parameter to the DataId of an output dataset. For example, going from a tract/patch specification to a visit is depth 1, going from a visit DataId to a single-frame calibration DataId is depth 1, and going from a tract/patch to a single-frame calibration DataId is depth 2. It is not obvious that there is any requirement for supporting depth>1 searches at all. Note that the "depth" does not refer in any way to a number of stages of processing to get from one dataset to another.
  An alternative might be to overload listDatasets() (see 4d above) with the maxDepth parameter, with a default value of 0, but the indirect-resolution functionality seems so different to me from the direct that a different function name seems warranted.
6. (appears to be possible already by iterating over quanta in the graph)
7. (appears to be possible already by iterating over quanta in the graph)
8. (appears to be possible already by iterating over quanta in the graph)

Secondary cases

DatasetRef.datasetType.storageClass.pytype already exists.
DatasetType.storageClass.pytype already exists.
There are two levels of support for this. Butler.datasetExists tests whether the dataset exists in both the Registry and at least one Datastore. I could imagine the Datastore test being as slow as actually loading the dataset into memory (say, a Datastore pointing at data on tape), and there is also a Registry-only test: Registry.find . A dataset may exist in a Registry but not a Datastore either because it was a (e.g.) processing intermediate that was not saved in any repository (but the database contains the provenance necessary to exactly construct it) or simply because the Butler client wasn't configured with the Datastore that holds it.

Space shortcuts

Page tree

Use Cases

API Possibilities