What is the Data Butler
- Manages repositories of datasets
- Finds datasets by scientifically-meaningful key/value pairs
- Can automatically "rendezvous" one dataset with another based on key values
- Example: calibrations linked by time
- Retrieves datasets as in-memory objects
- Persists in-memory objects to datasets
- Implemented in Python; no access from C++
Definitions
Repository
- Collection of datasets
- Configuration for accessing datasets
- Metadata databases for finding datasets
- Version (e.g. as of particular time)
Dataset
- The persisted form of an in-memory object
- Can be a single item, a composite, or a collection
- Examples:
int
/long
,PropertySet
,ExposureF
,WCS
,PSF
,set
/list
/dict
Persistable class
- A Python class (often SWIGged from C++) that can be persisted and retrieved
- Must provide methods for doing persistence and retrieval
Dataset type
- A label given to a group of datasets reflecting their meaning or usage
- Used by convention by Tasks for their inputs and outputs
- Examples:
calexp
,src
,icSrc
Dataset class
- A labeled set of basic access characteristics serving as the basis for a group of dataset types
- Used to define new dataset types
Storage
- A mechanism for reading/writing a dataset to/from an in-memory object
- Examples:
FitsStorage
,SqlStorage
Transport
- A mechanism for providing access to data
- Examples:
file:
,http:
,sqlite:
,mysql:
DataId
- A dictionary of key/value pairs
DataRef
- A
DataId
packaged with aButler
for access to datasets - Can be used with multiple dataset types (if the keys are appropriate)
DataRefSet
- Logically, a set of
DataRef
s - May be implemented as an iterator/generator
- Based on an input dataset type, but
DataRef
s can be used with other dataset types - All
DataRef
s point to an existing input dataset at time of generation
Mapper
- Not used by application code; only used via
Butler
- Driven by per-repository configuration
- Camera-specific subclasses recorded in repository configuration
- Obtains a location template based on dataset type
- Inherits from dataset class
- Includes URL path with transport, storage method, optionally Python type
- Includes filesystem locations and database tables/queries
- Read-only and write-only types
- Expands an input
DataId
with additional key/value pairs (fixed and/or as-needed) needed to expand location template- Queries registry databases in input repositories as needed
- Globs in filesystem if needed
- Expands location template into a
ButlerLocation
- Optionally can provide methods for standardizing (post-processing) retrieved data
- Can be used to bidirectionally map
DataId
s to numeric identifiers- By treating numeric identifier as a dataset or as a single
DataId
key's value - Uses special
IdStorage
- By treating numeric identifier as a dataset or as a single
Provides utilities for subclasses
Maintain templates for dataset types in repository configuration
Look up key/value pairs using equality or range joins in registry databases
Glob for key/value pairs in filesystem
Record metadata of new datasets in registries
Maintain registry of registries
ButlerLocation
- All location information needed for a Storage
- May include:
- Expanded path template(s)
- Python object class name
- Storage class name
DataId
- Additional key/value pairs
Butler
- Obtains mapper class name from repository
- Calls
Mapper
to translateDataId
into location - Calls appropriate Storage to retrieve or persist data
- Repository identified by root (URL) path
- Zero or more read-only input repositories
- Input repositories identified by role
- Role can be used in output repository configuration
- One output repository
- Input repositories recorded in output repository with roles
- Initial output repository configuration derived from camera-specific defaults and input repository overrides
- User can provide overrides for output repository configuration
- Tasks can add to output repository configuration
- Calibration (and other?) repositories permitted
- Input and output repositories
- Provides utility for searching read-only parent repositories
Butler Interface
__init__(outputRepo, inputRepos=None)
outputRepo
is a repository URL (string)inputRepos
is a map from role name (string) to repository URL
get(self, datasetType, dataId={}, **kwArgs)
- returns object retrieved using
dataId
with keyword argument overrides
- returns object retrieved using
put(self, obj, datasetType, dataId={}, **kwArgs)
- persists
obj
usingdataId
with keyword argument overrides - The
Butler
(or actually either itsMapper
or aStorage
) is allowed to notice that the identicalobj
has been persisted before and not persist it again- This also applies to components of composite
obj
s, which can be persisted as a reference to the original
- This also applies to components of composite
- persists
getKeys(self, datasetType=None)
- returns list of
DataId
keys appropriate fordatasetType
or all keys known for output repository
- returns list of
getDatasetTypes(self)
- returns list of known dataset types
createDatasetType(self, datasetType, datasetClass, pathTemplate, **kwArgs)
- creates a new
datasetType
based on thedatasetClass
using the provided path template and keyword arguments
- creates a new
getRefSet(self, datasetType, partialDataId={}, **kwArgs)
- returns
DataRefSet
enumerating all existing datasets ofdatasetType
usingpartialDataId
with keyword argument overrides
- returns
defineAlias(alias, datasetType)
- Henceforth, any use of "
@alias
" with thisButler
becomes equivalent todatasetType
- Henceforth, any use of "
What's new?
ButlerFactory
is gone.
getDatasetTypes
andcreateDatasetType
are passed through from theMapper
.subset
is renamed togetRefSet
to be more descriptive; its functionality subsumes the oldqueryMetadata
.datasetExists
is gone, sincegetRefSet
only returns datasets that exist.level
arguments have been removed, as the concept turned out to be useless in practice.put
can handle duplicates (for configurations and provenance or for sharing objects).- Much-requested dataset type aliasing facility enables
Task
s to handle, e.g., any "src
"-like dataset.
Mapper Interface
Interface used by Butler
:
__init__(self, repo)
repo
is an output repository
map(self, datasetType, dataId)
- returns the
ButlerLocation
corresponding to thedataId
for the givendatasetType
- returns the
getKeys(self, datasetType)
- returns list of
DataId
keys appropriate fordatasetType
or all keys known for output repository
- returns list of
getDatasetTypes(self)
- returns list of known dataset types
createDatasetType(self, datasetType, datasetClass, **kwArgs)
- creates a new
datasetType
based on thedatasetClass
using the keyword arguments
- creates a new
listDatasets(self, datasetType, partialDataId={}, **kwArgs)
- returns or generates set of
DataId
s enumerating all existing datasets ofdatasetType
usingpartialDataId
with keyword argument overrides
- returns or generates set of
canStandardize(self, datasetType)
- returns
True
if thedatasetType
can be standardized
- returns
standardize(self, obj, datasetType, dataId)
- returns the standardized version of
obj
, given itsdatasetType
anddataId
- returns the standardized version of
Interface for subclasses:
- (TBWritten)
What's new?
createDatasetType
has been added.listDatasets
replaces the oldqueryMetadata
.validate
was never used and is gone. (It was originally supposed to do something likeCameraMapper
'sMapping
'sneed()
.)- Much lookup functionality is now intended to be performed by custom Storage classes, like
IdStorage
.
2 Comments
Gregory Dubois-Felsmann
In a chat on the "Butler design 2015-06-02" room on HipChat (wishing it were easier to embed a reference), K-T said
Butlers are created by a constructor call that just takes a string (or strings):
How will creating multiple instances be avoided if the constructor is so simple? Have every instance be a thin facade for a singleton?
Gregory Dubois-Felsmann
It appears that the old "immediate" and delayed-loading proxy concepts have been removed from the interface. (Presumably there is still nothing stopping the Butler configuration for a particular dataset type in a particular repository from returning a proxy.)
In discussions with others this week it has occurred to me that it might be of interest to retain some equivalent of either the proxy or datasetExists() so that a butler-using unit of work could be interrogated by a processing control framework along the lines of "if I were to call you what I/O would you do?".
This comment is basically a note to myself and Kian-Tat Lim to discuss this in more detail later.