This page emerged initially from conversations in the hack sessions at the 2019 PCW. It collects some use cases and indicative pseudocode relating to metadata queries on Gen3 Butler repositories.
Use Cases
Use cases are labeled as "Tier 0", "Tier 1", "Tier 2" as a rough indication of priority. "Tier 0" use cases are ones that are already satisfied in some way (even if not direct or elegant) through public interfaces of the Gen3 middleware that do not involve the end user directly coding against the registry schema (e.g., by writing SQL).
- Get a list (iterable over strings?) of all the collections within a repository. (Tier 0)
- Get a list (iterable over ?) of all the known
DatasetType
values within a repository. (Tier 0) - Get a list (iterable over ?) of all the
DatasetType
values within a repository for which a dataset actually exists. (Tier 2) - Queries for a list (iterable over
DatasetRef
):- of all the extant datasets of a particular
DatasetType
in a repository, or in a named collection within that repository. (Tier 1) - of all the extant datasets (i.e., of any
DatasetType
) in a repository, or in a named collection within that repository. (Tier 2) - of all the extant datasets of a set of
DatasetType
s, defined either as a list of specific types, or as some form of wildcard expression, in a repository, or in a named collection within that repository. (Tier 2) Most interesting use case is probably tail wildcarding, e.g., "calexp*". - of all the extant, or, selectably, all the possible, datasets in a repository, or in a named collection within that repository, that are directly associated with a specific
DataId
orDataId
wildcard expression. "Directly associated" means that, say, if a visit ID is specified, only datasets from that visit and that visit alone (e.g., "raw", "calexp", a difference image, a source table) would be reported. "Indirectly associated" datasets such as a coadd tile partially derived from that visit's image would not be reported. "All the possible datasets" means a list based on all theDatasetType
values known to the repository. Optionally, limit the query to a specific DatasetType or DatasetType wildcard/list. (Tier 2) of all the extant, or, selectably, all the possible, datasets in a repository, or in a named collection within that repository, that are indirectly associated with a specific
DataId
orDataId
wildcard expression. "Indirectly associated" includes relationships based on joins across dimensions, but should not take into account dependencies relating to specific DatasetTypes. That is, a query for "all datasets indirectly associated with patch X" would return, among others, all datasets from visits overlapping the patch, even datasets that are not used in any actual generation of coadds (e.g., a DIASource table dataset). Said another way, "indirectly associated" is derived solely from registry data, not from any pipeline specification or QuantumGraph. Optionally, limit the query to a specific DatasetType or DatasetType wildcard/list. (Tier 3)- of all the datasets "touched" by (i.e., used as inputs, intermediates, or outputs) a specified QuantumGraph (or a task specification that can be used to derive one), whether they already exist or not. (Tier 0)
- of all the inputs (selectably: ultimate "leaf node" inputs, or all inputs including intermediates) on which a specified dataset (DataId-DatasetType pair) depends, given a QuantumGraph (or a task specification that can be used to derive one), whether they already exist or not. (Tier 0)
- of all the outputs derived from a specified dataset (DataId-DatasetType pair) depends, given a QuantumGraph (or a task specification that can be used to derive one), whether they already exist or not. (Tier 0)
- of all the extant datasets of a particular
Secondary support use cases
- Given a
DatasetRef
, without actually loading the dataset into memory, obtain the Python type that would result from aget()
on thatDatasetRef
. (Tier 0) - Given a
DatasetType
, obtain the Python type that would result from aget()
on a dataset of that type. (Tier 0) - Given a
DatasetRef
, determine whether the associated dataset actually exists, without actually loading the dataset into memory. (With the understanding that "actually exists" can therefore only be an estimate and can't take into account the possibility that an error might occur when the actualget()
were attempted.) (Tier 0)
API Possibilities
As a starting point I'm suggesting that these discovery APIs operate on a Butler
, either as a member function or as a utility function that takes a Butler.
However, many, perhaps all, of the capabilities described might be capable of being executed directly on a Registry
, which is of course available from a Butler
. It is an implementation detail whether they are implemented as functions on a Registry
with the Butler
versions implemented as simple call-throughs.
Registry.getAllCollections()
already exists, returnsset
ofstr
; same API on aButler
would be fineRegistry.getAllDatasetTypes()
already exists, returnsfrozenset
ofDatasetType
; same API on a Butler would be fine- No current simple API. Possibly no efficient way to implement this without an exhaustive search? Something like
getAllDatasetTypes( extant = True )
would be a reasonable API if it could be provided. - ...
listDatasets( datasetType =
DatasetType)
andlistDatasets( collection =
str, datasetType =
DatasetType)
returning an iterable ofDatasetRef
would be reasonable interfaceslistDatasets( )
andlistDatasets( collection =
str)
returning an iterable ofDatasetRef
would be reasonable interfaceslistDatasets( datasetTypeFilter =
wildcard-string)
andlistDatasets( datasetTypeFilter =
wildcard-string, collection =
str)
listDatasets( dataQuery =
str)
,listDatasets( collection =
str, dataQuery =
str)
,listDatasets( dataQuery =
str, datasetType =
DatasetType)
, etc., where thedataQuery
argument is a string of the same format that is accepted by the--data-query
parameter to thepipetask
command.listRelatedDatasets( collection =
str, dataQuery =
str, datasetType =
DatasetType, maxDepth =
int)
, returns an iterable of ( DatasetRef, depth ) tuples, where the "depth" indicates the minimum number of "hops" required to get from aDataId
that matches thedataQuery
parameter to theDataId
of an output dataset. For example, going from a tract/patch specification to a visit is depth 1, going from a visitDataId
to a single-frame calibrationDataId
is depth 1, and going from a tract/patch to a single-frame calibrationDataId
is depth 2. It is not obvious that there is any requirement for supporting depth>1 searches at all. Note that the "depth" does not refer in any way to a number of stages of processing to get from one dataset to another.
An alternative might be to overloadlistDatasets()
(see 4d above) with themaxDepth
parameter, with a default value of 0, but the indirect-resolution functionality seems so different to me from the direct that a different function name seems warranted.- (appears to be possible already by iterating over quanta in the graph)
- (appears to be possible already by iterating over quanta in the graph)
- (appears to be possible already by iterating over quanta in the graph)
Secondary cases
DatasetRef.datasetType.storageClass.pytype
already exists.DatasetType.storageClass.pytype
already exists.- There are two levels of support for this.
Butler.datasetExists
tests whether the dataset exists in both theRegistry
and at least oneDatastore
. I could imagine theDatastore
test being as slow as actually loading the dataset into memory (say, aDatastore
pointing at data on tape), and there is also aRegistry-only
test:Registry.find
. A dataset may exist in aRegistry
but not aDatastore
either because it was a (e.g.) processing intermediate that was not saved in any repository (but the database contains the provenance necessary to exactly construct it) or simply because theButler
client wasn't configured with theDatastore
that holds it.