What is the Data Butler

  • Manages repositories of datasets
  • Finds datasets by scientifically-meaningful key/value pairs
  • Can automatically "rendezvous" one dataset with another based on key values
    • Example: calibrations linked by time
  • Retrieves datasets as in-memory objects
  • Persists in-memory objects to datasets
  • Implemented in Python; no access from C++

Definitions

Repository

  • Collection of datasets
  • Configuration for accessing datasets
  • Metadata databases for finding datasets
  • Version (e.g. as of particular time)

Dataset

  • The persisted form of an in-memory object
  • Can be a single item, a composite, or a collection
  • Examples: int/long, PropertySet, ExposureF, WCS, PSF, set/list/dict

Persistable class

  • A Python class (often SWIGged from C++) that can be persisted and retrieved
  • Must provide methods for doing persistence and retrieval

Dataset type

  • A label given to a group of datasets reflecting their meaning or usage
  • Used by convention by Tasks for their inputs and outputs
  • Examples: calexp, src, icSrc

Dataset class

  • A labeled set of basic access characteristics serving as the basis for a group of dataset types
  • Used to define new dataset types

Storage

  • A mechanism for reading/writing a dataset to/from an in-memory object
  • Examples: FitsStorage, SqlStorage

Transport

  • A mechanism for providing access to data
  • Examples: file:, http:, sqlite:, mysql:

DataId

  • A dictionary of key/value pairs

DataRef

  • A DataId packaged with a Butler for access to datasets
  • Can be used with multiple dataset types (if the keys are appropriate)

DataRefSet

  • Logically, a set of DataRefs
  • May be implemented as an iterator/generator
  • Based on an input dataset type, but DataRefs can be used with other dataset types
  • All DataRefs point to an existing input dataset at time of generation

 Mapper

  • Not used by application code; only used via Butler
  • Driven by per-repository configuration
  • Camera-specific subclasses recorded in repository configuration
  • Obtains a location template based on dataset type
    • Inherits from dataset class
    • Includes URL path with transport, storage method, optionally Python type
    • Includes filesystem locations and database tables/queries
    • Read-only and write-only types
  • Expands an input DataId with additional key/value pairs (fixed and/or as-needed) needed to expand location template
    • Queries registry databases in input repositories as needed
    • Globs in filesystem if needed
  • Expands location template into a ButlerLocation
  • Optionally can provide methods for standardizing (post-processing) retrieved data
  • Can be used to bidirectionally map DataIds to numeric identifiers
    • By treating numeric identifier as a dataset or as a single DataId key's value
    • Uses special IdStorage
  • Provides utilities for subclasses

    • Maintain templates for dataset types in repository configuration

    • Look up key/value pairs using equality or range joins in registry databases

    • Glob for key/value pairs in filesystem

    • Record metadata of new datasets in registries

    • Maintain registry of registries

ButlerLocation

  • All location information needed for a Storage
  • May include:
    • Expanded path template(s)
    • Python object class name
    • Storage class name
    • DataId
    • Additional key/value pairs

Butler

  • Obtains mapper class name from repository
  • Calls Mapper to translate DataId into location
  • Calls appropriate Storage to retrieve or persist data
  • Repository identified by root (URL) path
  • Zero or more read-only input repositories
    • Input repositories identified by role
    • Role can be used in output repository configuration
  • One output repository
    • Input repositories recorded in output repository with roles
    • Initial output repository configuration derived from camera-specific defaults and input repository overrides
    • User can provide overrides for output repository configuration
    • Tasks can add to output repository configuration
  • Calibration (and other?) repositories permitted
  • Input and output repositories
    • Provides utility for searching read-only parent repositories

Butler Interface

  • __init__(outputRepo, inputRepos=None)
    • outputRepo is a repository URL (string)
    • inputRepos is a map from role name (string) to repository URL
  • get(self, datasetType, dataId={}, **kwArgs)
    • returns object retrieved using dataId with keyword argument overrides
  • put(self, obj, datasetType, dataId={}, **kwArgs)
    • persists obj using dataId with keyword argument overrides
    • The Butler (or actually either its Mapper or a Storage) is allowed to notice that the identical obj has been persisted before and not persist it again
      • This also applies to components of composite objs, which can be persisted as a reference to the original
  • getKeys(self, datasetType=None)
    • returns list of DataId keys appropriate for datasetType or all keys known for output repository
  • getDatasetTypes(self)
    • returns list of known dataset types
  • createDatasetType(self, datasetType, datasetClass, pathTemplate, **kwArgs)
    • creates a new datasetType based on the datasetClass using the provided path template and keyword arguments
  • getRefSet(self, datasetType, partialDataId={}, **kwArgs)
    • returns DataRefSet enumerating all existing datasets of datasetType using partialDataId with keyword argument overrides
  • defineAlias(alias, datasetType)
    • Henceforth, any use of "@alias" with this Butler becomes equivalent to datasetType

What's new?

  • ButlerFactory is gone.
  • getDatasetTypes and createDatasetType are passed through from the Mapper.
  • subset is renamed to getRefSet to be more descriptive; its functionality subsumes the old queryMetadata.
  • datasetExists is gone, since getRefSet only returns datasets that exist.
  • level arguments have been removed, as the concept turned out to be useless in practice.
  • put can handle duplicates (for configurations and provenance or for sharing objects).
  • Much-requested dataset type aliasing facility enables Tasks to handle, e.g., any "src"-like dataset.

Mapper Interface

Interface used by Butler:

  • __init__(self, repo)
    • repo is an output repository
  • map(self, datasetType, dataId)
    • returns the ButlerLocation corresponding to the dataIdfor the given datasetType
  • getKeys(self, datasetType)
    • returns list of DataId keys appropriate for datasetType or all keys known for output repository
  • getDatasetTypes(self)
    • returns list of known dataset types
  • createDatasetType(self, datasetType, datasetClass, **kwArgs)
    • creates a new datasetType based on the datasetClass using the keyword arguments
  • listDatasets(self, datasetType, partialDataId={}, **kwArgs)
    • returns or generates set of DataIds enumerating all existing datasets of datasetType using partialDataId with keyword argument overrides
  • canStandardize(self, datasetType)
    • returns True if the datasetType can be standardized
  • standardize(self, obj, datasetType, dataId)
    • returns the standardized version of obj, given its datasetType and dataId

Interface for subclasses:

  • (TBWritten)

What's new?

  • createDatasetType has been added.
  • listDatasets replaces the old queryMetadata.
  • validate was never used and is gone.  (It was originally supposed to do something like CameraMapper's Mapping's need().)
  • Much lookup functionality is now intended to be performed by custom Storage classes, like IdStorage.
  • No labels

2 Comments

  1. In a chat on the "Butler design 2015-06-02" room on HipChat (wishing it were easier to embed a reference), K-T said

    The original intent was to allow multiple Old Butlers per process.  But I think the New Butler is going to have to disallow this on the grounds that provenance and other tracking will be impossible, unless we behind-the-scenes link the two Butler objects together with common state (and configuration).  Thanks for pointing this out.

    Butlers are created by a constructor call that just takes a string (or strings):

    • __init__(outputRepo, inputRepos=None)
      • outputRepo is a repository URL (string)
      • inputRepos is a map from role name (string) to repository URL

    How will creating multiple instances be avoided if the constructor is so simple?  Have every instance be a thin facade for a singleton?

  2. It appears that the old "immediate" and delayed-loading proxy concepts have been removed from the interface.  (Presumably there is still nothing stopping the Butler configuration for a particular dataset type in a particular repository from returning a proxy.)

    In discussions with others this week it has occurred to me that it might be of interest to retain some equivalent of either the proxy or datasetExists() so that a butler-using unit of work could be interrogated by a processing control framework along the lines of "if I were to call you what I/O would you do?".

    This comment is basically a note to myself and Kian-Tat Lim to discuss this in more detail later.