Datastore Questions

Whilst PosixDatastore works reasonably well there are a number of open questions regarding how datastores behave when they try to do something clever. This page explains some of the open issues.

InMemoryDatastore is very convenient now that per-datasetType acceptance can control which datasets use the datastore. There are still problems though.

If in-memory is the only datastore receiving a dataset, what happens to registry when the process ends? There has been some discussion indicating that registry should never delete datasets but deletion should solely involve deletion from the datastore and removal from the dataset_storage table. Currently if the process ends the dataset_storage table will not be updated and will include datasets that are no longer present. The solution could be to give the Datastore constructor two Registry objects (one ephemeral (presumably in-memory sqlite registry) and one permanent) and let the datastore decide which one to use.
In-memory datastores do not currently do anything with the datasets they receive other than store them in an internal dict. The object returned is exactly the object that was given. This leads to the possibility of a Python object being retrieved from the datastore, then being updated by the user, and stored as a different dataset type and the original object will have changed by the time the next time it is retrieved. Should in-memory datastore always do a deepcopy on get? Should that be configurable per-dataset? (it might take a while for large images)? Does that require that every python type supported by butler must support deepcopy?
Should in-memory datastore support cache expiry? Only keep N items? Remove an item when it is retrieved (allowing an intermediate dataset to be stored and retrieved almost instantaneously but not taking up memory after it is used)?

There are multiple types of caching datastores that can be conceived:

A butler that can retrieve data from a remote system (such as data backbone) but which should store in a local file cache (or presumably in-memory cache) so that the next time the same data is retrieved it will be read locally. This sort of composeable datastore (effectively a combination of in-memory datastore forwarding requests to a posix datastore that forwards requests to a remote datastore) has been advocated by Pim for a long time but will be incredibly useful.
Temporary caching datastore for intermediate files (and in theory raw data files). This could act like a Posix Datastore but files are removed from the cache as the allocated disk space runs low or as files age out. If registry does not care when files are removed (because a full record should exist) then cache expiry could be done solely by the datastore removing entries from dataset_storage – cache expiry could be triggered explicitly or some kind of garbage collection could happen during put() . A chained datastore that has readonly access to the data back bone, uses a file cache for specific datasetTypes and a non-cached posix datastore for important output datasets seems reasonable. That way the intermediate files can be stored and examined without affecting the primary datastore. A caching datastore should be using the permanent registry not an ephmeral one, but should dataset_storage table have a field to indicate that the dataset might disappear?

Other thoughts:

Parameter usage has also been mostly untested. Parameters can be passed through to the assemblers (so image cutout is supported) but we have not yet demonstrated how image compression parameters are to be handled. You should not have to specify compression parameters in Butler.put . These are effectively per-datasetType parameters for formatters. Where do we put them? Sounds like a separate section in datastore.yaml (per-storage class, per dataset type, per dimensions as usual): formatterParams ?
Is it at all possible to override a Formatter on read? Currently we register the formatter that was used to write and ignore the config system when reading a dataset back in. This makes it impossible for someone to switch their butlers to using astropy.table when reading a FITS table or reading an astropy image when reading an Exposure. This, of course, requires that the definition of the StorageClass also changes to be a different python type (which I think will work if the override definition is merged in first).
We've talked about a datastore that maps to database tables but we've never tried it. Should we try it before we attempt to finalize datastore APIs?

Space shortcuts

Page tree

6 Comments

Jim Bosch

Tim Jenness

Tim Jenness

Jim Bosch

Gregory Dubois-Felsmann

Tim Jenness