We want to be able to serialize (and deserialze) python objects to differet file formats (e.g. FITS, HDF5) and in different storage types (e.g. local filesystem, Amazon S3, Database).

Right now there is hard-coded dispatch in the butler for differet file formats, but no support for storage types other than on the local filesystem. We need to add butler support for multiple formats and storage types.

Requirements

TODO; I haven't written any formal definition of requirements other than the above understanding of what's needed.

Deserializing Subclass

There is a request that when a serialized object is a subclass of the type specified by the policy that the data be deserialized into an instance of the subclass (as opposed to the policy-specified base class). I think this is best handled by the serializer and deserialzer explicitly.

This request was specifically for pex.Config; right now there is code in the deserilizer that verifies that the deserialized object data is of the same type as the object that is deserializing that data into itself. When the object is being serialized into a stream, the serializer (Config.saveToStream) writes some verificaiton code that will be executed on import:

configType = type(self)
typeString = _typeStr(configType)
print >> outfile, "import %s" % (configType.__module__)
print >> outfile, "assert type(%s)==%s, 'config is of type %%s.%%s" % (root, typeString), \
                  "instead of %s' %% (type(%s).__module__, type(%s).__name__)" % (typeString, root, root)

Notice where this puts an assert statement into the serialized data.
When importing, Config.loadFromStream executes this verification code (it uses the exec function), and if the instance doing the importing is not the same as the recorded type (typeString) it will raise an AssertionError.

TBD how does this interact with composites?

Proposal

The idea behind this proposal uses the idea of serializers and the serializer registry that we have in the C++ domain that writes objects to and from the filesystem. It moves the serializer registry and registered serialization functions into the python domain and adds a dimension to the registry for different types of storage locations. Users will be able to specify serialization and deserialization functions for the following combination:

  • type - the type of the python object to be serialized or deserialized
  • format - what the persistence format of serialized data is. For example: FITS or HDF5.
  • storage - what butler storage backend should be used. Typically storage backends represent different storage location types like posix, Amazon S3, or database storage engines.

Object Serializers

For type-format-storage dispatch, Python serialization functions must be written for classes that are to be serialized and deserialized by butler. The intention is these serialization functions may be specific for a type of python object, to be written to a format, on a particular storage. And, the functions may also be usable across a range of objects that share a common API.

For example, serializer functions could be written for the ExposureF type python object to FITS format on Posix storage.

Serializer Registration

These serializers get registered with the butler’s serializer registry. At runtime they are looked up based on what kind of object is being read/written, what kind of format is indicated by the policy, and what kind of storage the butler is using for that repository.

This resembles what was done previously in C++ with the FormatterRegistry but is moved from C++ to pure Python and adds support for pluggable storage back ends.

Base Class Serializers

Python types that subclass base classes may rely on base class serialzers. The serializer registry will look for base class serialers according to the object’s MRO.

Serializer API

Whatever API we settle on must allow custom mappers to be written for custom storage types requiring custom parameters.

Proposed: both the serializer and deserialzer take a ButlerLocation which is created & populated by the mapper with the parameters that are needed by the serialization function. 

TODO verify how this will be done with edge cases

Serialize API

def write(obj, butlerLocation):
...

The serialzer takes an object to be serialized and a butlerLocation that describes the location to serilize that object, and any other data needed by the serializer. 

Deserializer API

def read(butlerLocation):
...

The deserializer takes a butlerLocation that contains parameters needed to access the dataset in storage.

Serializer Registry

We will add a singleton registry to the Butler framework that keeps a registry of serialization functions as described above. The following will be registered:

  • object type - This is the object type that will be serialized or deserialized. The value should be an importable string or a class object.
  • storage - This names a type of storage that the serializer & deserializer will be used for. For example, 'posix', 'database', 's3'. (To support this, Butler Storage classes must name the type of storage it is, for example PosixStorage.getType() would return 'Posix'.). The value must be a string (NOT a type).
  • format - Names the type of formatting and/or file format to use. e.g. 'FitsFormat' (aka 'FitsStorage'), 'BoostFormat'. The value must be a string (NOT a type).
  • serializer - This is the function that is used to serialize the object. Value must be callable function object, or an importable string that names a callable function, that takes an object to be serialized and a ButlerLocation .
  • deserializer - This is the function that is used to deserialize the object. Value must be callable function object, or an importable string that names a callable function, that takes a ButlerLocation.

Serializer Registration

Serializers:

  • must be registered with the daf_persistence serializer registry.
  • do not have to be in the same package as the object they serialize.
  • should generally be registered where they are declared.

Registration via Butler Package

If needed, serializers can be registered by an assigned manager object.

For example we could register serialzers to register in the __init__.py file of a package, which may be the place to register AFW base class serializers & deserializers. TBD we should determine a proper location for this, if it's needed.

Replacing Registered Serializers

TBD what to do if user code wanted to use butler with a different set of serializers? Options:

  • Raise an exception when a second serializer for the same (type, storage, format) is registered.
  • Overwrite the previous registration. Overwrite behavior TBD, options include:
    • Silently (But that could also lead to confusion and seeking for which serialzer was registered last.)
    • Require an ‘overwrite’ argument
    • Raise a custom exception that can be caught and handled by the registering script.

Passing Custom Arguments to Serializers

We thought users might want to be able to set arguments to be passed to the serializer functions. Is this needed? It seems like so long as the mapper put the needed parameters into the ButlerLocation that the serializer would have all the information it needs to perform serialization operations.

If we do need to be able to customize the signature of the serialization functions, it would be good to start with a concrete example. However, one idea for how we might implement this via the policy is below.

Serialization Function Signature Customization By policy

TBD this should be discussed, but the basic idea follows:

It seems reasonable that the policy could contain a key under dataset type whose value is a map of storage to dict of kwargs for the serialize; {<storage name>:[arg1Name, arg2Name, ...]}.

For example:

datasetType
  serializerArgs : {‘posix’:[‘foo’, ’bar’]}

Would declare a serializer function with the signature:

mySerializerFunc(obj, butlerLocation, foo, bar)

The mapper would add the args to the butlerLocation that it returned from mapping, and when the serialization function is called it will also pass the arguments if there are any.

Changes to ButlerLocation

After mapping, ButlerLocation should contain enough data to call the serialization function (including a reference to the function); I think it should explicitly not need to communicate with the Repository, and possibly not Storage
classes again. I think Butler._read can call a method on ButlerLocation directly, instead of calling through repo to access to storage to a read method in storage.

Changes to Storage subclasses

The role of Storage changes: it gets used to store storage-type specific details (e.g. a database connection), and is used by mappers to access the physical repository (e.g. to determine if a dataset exists) and if needed for storage actions (e.g. inspection, retrieval, modification) that are not directly related to dataset IO. Since the (de)serializer to use will be added to the ButlerLocation, Storage classes will no longer contain the methods to serialize or deserialize a dataset.

Naming Issue

Right now the name "storage" is used to indicate Fits, Boost, Pickle, etc. I think the name "format" is a better name. This allows the name "storage" to apply to what the type of storage is e.g. posix, database, s3. We will change the key storage and related values (FitsStorage, BoostStorage) to be ‘format’ and ‘FitsFormat’ and ‘BoostFormat’.

Examples

For ExposureF to/from FITS on local filesystem

The following would go in a new package lsst.daf.serializers

class ExposureFPosixFits(object):
    @staticmethod
    def write(object, butlerLocation):
        locations = butlerLocation.getLocations()
        with SafeFilename(locations[0]) as locationString:
            logLoc = LogicalLocation(locationString, additionalData)
        persistencePolicy = pexPolicy.Policy()
        persistence = Persistence.getPersistence(persistencePolicy)
        # Create a list of Storages for the item.
        storageList = StorageList()
        storage = self.persistence.getPersistStorage(storageName, logLoc)
        storageList.append(storage)
        # KT says this ends up calling .writeFits, need to investigate but maybe this serializer does not 
        # need to exist and instead a generic serialzer like FitsCatalogPosixFits can be used for (ExposureF,
        # Posix, Fits) too.
        # Per KT: we want to pull out the C++ persistence registry (replacing it with python registry)
        self.persistence.persist(obj, storageList, additionalData)

    @staticmethod
    def read(butlerLocation):
        locations = butlerLocation.getLocations()
        pythonType = butlerLocation.getPythonType()
        if pythonType is not None:
            pythonType = typeify(pythonType) #import if string
        for location in locations:
            logLoc = LogicalLocation(locationString, additionalData)
            storageList = StorageList()
            storage = self.persistence.getRetrieveStorage(storageName, logLoc)
            storageList.append(storage)
            itemData = self.persistence.unsafeRetrieve(butlerLocation.getCppType(),
                                                       storageList,
                                                       additionalData)
            finalItem = pythonType.swigConvert(itemData)
            results.append(finalItem)

FitsCatalog to/from FITS on local filesystem

this is used for SourceCatalog, but can be registered and used for writing any of our objects that have the methods writeFits(location, flags) and readFits(location, hdu, flags)

class FitsCatalogPosixFits(object):
    @staticmethod
    def write(object, butlerLocation):
        locations = butlerLocation.getLocations()
        if len(locations) < 1:
            raise RuntimeError("no location passed in butlerLocation")
        with SafeFilename(locations[0]) as locationString:
            logLoc = LogicalLocation(locationString, additionalData)
            flags = additionalData.getInt("flags", 0)
            obj.writeFits(logLoc.locString(), flags=flags)

    def read(object, butlerLocation):
        results = []
        for locationString in locations:
            logLoc = LogicalLocation(locationString, additionalData)
            if not os.path.exists(logLoc.locString()):
                raise RuntimeError, "No such FITS catalog file: " + logLoc.locString()
            hdu = additionalData.getInt("hdu", 0)
            flags = additionalData.getInt("flags", 0)
            finalItem = pythonType.readFits(logLoc.locString(), hdu, flags)
        results.append(finalItem)

ConfigStorage for lsst.pipe.tasks.processCcd.ProcessCcdConfig to/from FITS on local filesystem

In this example a base class serializer is registered; ProcessCcdConfig inherits from Config. When the registry does not find an entry for ProcessCcdConfig it will search iterating through the class’s MRO until it finds the serializer that is registered for Config.

We need to define a location for the pex Config serilizer functions that can depend on both daf_persistence and pex_config. I think daf_butlerUtils would be ok for this. Or possibly this wants to go into a seprate package, something pex_serialization?

 
from lsst.daf.persistence import ButlerRegistry
from lsst.pipe.tasks.processCcd import ProcessCcdConfig

def write(object, butlerLocation):
    locations = butlerLocation.getLocations()
    with SafeFilename(locations[0]) as locationString:
        logLoc = LogicalLocation(locationString, additionalData)
        obj.save(logLoc.locString())

def read(butlerLocation):
    results = []
    for locationString in butlerLocation.getLocations():
        logLoc = LogicalLocation(locationString, additionalData)
    pythonType = butlerLocation.getPythonType()
    pythonType = typeify(pythonType)
    finalItem = pythonType()
    finalItem.load(logLoc.locString())
    results.append(finalItem)
    return results

import lsst.pex.Config
ButlerRegistry.register(lsst.pex.Config, 'posix', 'ConfigFormat', write, read)

SourceCatalog to/from database

TODO


  • No labels