Whilst PosixDatastore works reasonably well there are a number of open questions regarding how datastores behave when they try to do something clever. This page explains some of the open issues.

InMemoryDatastore is very convenient now that per-datasetType acceptance can control which datasets use the datastore. There are still problems though.

  1. If in-memory is the only datastore receiving a dataset, what happens to registry when the process ends?  There has been some discussion indicating that registry should never delete datasets but deletion should solely involve deletion from the datastore and removal from the dataset_storage  table.  Currently if the process ends the dataset_storage  table will not be updated and will include datasets that are no longer present.  The solution could be to give the Datastore constructor two Registry objects (one ephemeral (presumably in-memory sqlite registry) and one permanent) and let the datastore decide which one to use.
  2. In-memory datastores do not currently do anything with the datasets they receive other than store them in an internal dict.  The object returned is exactly the object that was given.  This leads to the possibility of a Python object being retrieved from the datastore, then being updated by the user, and stored as a different dataset type and the original object will have changed by the time the next time it is retrieved.  Should in-memory datastore always do a deepcopy on get?  Should that be configurable per-dataset? (it might take a while for large images)? Does that require that every python type supported by butler must  support deepcopy?
  3. Should in-memory datastore support cache expiry? Only keep N items? Remove an item when it is retrieved (allowing an intermediate dataset to be stored and retrieved almost instantaneously but not taking up memory after it is used)?


There are multiple types of caching datastores that can be conceived:

  1. A butler that can retrieve data from a remote system (such as data backbone) but which should store in a local file cache (or presumably in-memory cache) so that the next time the same data is retrieved it will be read locally.  This sort of composeable datastore (effectively a combination of in-memory datastore forwarding requests to a posix datastore that forwards requests to a remote datastore) has been advocated by Pim for a long time but will be incredibly useful.
  2. Temporary caching datastore for intermediate files (and in theory raw data files).  This could act like a Posix Datastore but files are removed from the cache as the allocated disk space runs low or as files age out. If registry does not care when files are removed (because a full record should exist) then cache expiry could be done solely by the datastore removing entries from dataset_storage – cache expiry could be triggered explicitly or some kind of garbage collection could happen during put() .  A chained datastore that has readonly access to the data back bone, uses a file cache for specific datasetTypes and a non-cached posix datastore for important output datasets seems reasonable.  That way the intermediate files can be stored and examined without affecting the primary datastore. A caching datastore should be using the permanent registry not an ephmeral one, but should dataset_storage table have a field to indicate that the dataset might disappear?


Other thoughts:

  1. Parameter usage has also been mostly untested.  Parameters can be passed through to the assemblers (so image cutout is supported) but we have not yet demonstrated how image compression parameters are to be handled.  You should not have to specify compression parameters in Butler.put .  These are effectively per-datasetType parameters for formatters. Where do we put them? Sounds like a separate section in datastore.yaml (per-storage class, per dataset type, per dimensions as usual): formatterParams ?
  2. Is it at all possible to override a Formatter on read? Currently we register the formatter that was used to write and ignore the config system when reading a dataset back in.  This makes it impossible for someone to switch their butlers to using astropy.table when reading a FITS table or reading an astropy image when reading an Exposure. This, of course, requires that the definition of the StorageClass also changes to be a different python type (which I think will work if the override definition is merged in first).
  3. We've talked about a datastore that maps to database tables but we've never tried it. Should we try it before we attempt to finalize datastore APIs?
  • No labels

6 Comments

  1. Pne more Datastore issue that might merit some attention: I've always been a little uncomfortable with the state of the the ingest method, which fits quite nicely in PosixDatastore but is a bit weird as an interface on Datastore itself, because it deals with, well, files and formatters.  I don't think we can just drop it, though - it is important to be able to ingest files from disk into other Datastores, or at least some other Datastores.  I don't know if what we need is to promote Formatters somehow to admit that they're bigger than just PosixDatastore, or have a Datastore subclass that's still abstract but adds interfaces for dealing with files, but it'd be good to have someone step back and think about the situation a bit.

  2. S3Datastore does have an ingest method for files and that works because S3 objects easily map to individual files. In the new S3 world formatters can serialize to/from bytes as well as files.

  3. Jim Bosch thanks for your comments on StorageClass (which was a side comment, expanding here as a main comment to make it more visible).  I've been wondering about that for a while. Currently StorageClass really has nothing to do with storage.  It provides a butler label for a specific python type, defines the components, and specifies the python class required to assemble from components and disassemble into components.  At this point it really does seem that StorageClass is not the best name.

    Serialization by the formatters obviously has to be compatible with the python type being serialized, but there are many cases where once the bytes are on disk you get a choice of how to read it back in (another example of this was the defects that I am writing out in standard region form so in theory someone could read them back in as astropy regions).

    Currently we store the formatter used to write the dataset in the datastore internal registry to guarantee that you get back what you put in regardless of any change that may have been made to the configuration system (assuming the formatter is still compatible).  Not all formatters can change type on read: pickle presumably can't.  YAML might be possible because at the moment we have to pass the YAML to a constructor because we can't assume that special YAML handlers have been registered (and JSON doesn't have pyyaml style registration).  FITS and text files can change their type on read. For get we do currently know the storage class used to write the file and the storage class that we are using for the read (that's so that we can understand we have composites being extracted).

    The person calling get wants their specified formatter to be used to read the dataset and if that returns a different python type then we also need to have changed the definition of the StorageClass itself.

    In thinking about this I assume that the use case here is not that people have a mixed system where they have an afw Exposure that they want to put and an astropy NDData that they want to get back straight away?  Supporting this seems hard because a dataset type is defined in terms of the dimensions and  StorageClass which by definition means that a dataset type can't change its python type in the middle of a program.

    Use Case 1:

    User in a single script wants to write an afw table to FITS and then later on read it back from the butler as an astropy table.  Here the storage class has to somehow be different for the put and the get but also the dataset types defined in the butler have to understand that the thing that used to be "defects" mapping to one storage class, is now also called "defects" but mapping to another storage class. Butler DatasetType definitions don't know how to handle that.

    Use case 2:

    Butler repository has afw Exposure fits files in it. You want to read the calexp as an afw Exposure, then write it out as a calexp using a different formatter because you want to convert it to HDF5.  Unless we choose a different dataset type is the only way to do this to specify two distinct Butlers?

    Use case 3:

    A pre-populated butler repository exists that has been written using afw.  Now, without any afw involved, the user wants to be able to read datasets in as astropy objects.

    I think Use Case 3 is the important one and is the easier to consider than #1 because it doesn't require that the butler repository sees a difference in the definition of a dataset type.  In this scheme I think we can deal with it by having a new section in the datastore config that specifies explicit read_formatters that override the one stored in registry and which is looked up based on dataset type and storage class just like anything else.  There would need to be an override in the config of the storage class definition as well to change the python type but that should work.

    Does this seem okay? It would seem to be relatively easy to implement since it's a few lines of code to read the new formatters section and check for overrides on read.

    1. I think we already support your Use Case 2,  because PosixDatastore remembers the Formatter used to originally write the thing (and hence the one used now to read), and there's no need for that to be different from the one configured to write new stuff in the client.  We just need to make sure we don't break it.

      I think we can probably get by without supporting Use Case 1, but should support it if the opportunity presents itself.

      That brings us to Use Case 3, which is at least really nice to have, enough so that I agree it's worth working on a bit.  Your solution would probably work well enough for the relevant cases in practice, but I'm worried that it breaks the 1+3 case - i.e., if you provide a custom formatter to the read client, you can't take advantage of the repository's information about what file format was used.  So if the repo had both FITS and HDF5 files written by afw.table, the astropy formatter (which is explicitly either FITS or HDF5) would only work on some of them.  That's why I suggested formatter lookup based on two keys - both file format and Python type - instead.

      Another option might be to have the read override config specify a name that's passed to the in-repo formatter, to a method that returns a new formatter that should work for the same file but generate the in-memory type specified by that string.  So an afw.table FitsCatalogFormatter would have a, say, rebind  method that when given the string "astropy", returns an Astropy.Table formatter that also reads FITS.



  4. Tim Jenness and I just had a conversation about how to support the option of obtaining multiple alternative in-memory representations of a dataset on get() .  From an LSP perspective, it would be very nice to give people this option.  One user-level API possibility that emerged from the discussion was:

    b.get( id, "calexp", storageClass="astropy-NDData" ) 

    to return an image in a canonical Astropy representation that by default would come back in afw.image  form.

  5. After talking to Gregory Dubois-Felsmann yesterday, I came to the conclusion that programmatic switching of storageClass in get()  is far safer than configuration-based changes to the storage class.  Using configuration will break programs that assume that a "calexp" is an afwExposure, whereas a change to the API means that scripts written assuming astropy will work and those assuming afwExposure will work and no edits will be required (assuming the formatter is available and the storage class defined).