Preamble

During measurement, a series of plug-in algorithms are used to measure the raw properties of each source (e.g. fluxes, positions in pixel coordinates). The "calibration and ingest" system provides a means of transforming those raw measurements to calibrated units, such as magnitudes or positions in celestial coordinates.

An important consideration is that the information required to perform the transformation may not be available at measurement time. For example, it may depend on a global astrometric or photometric calibration which has not yet been performed. For this reason, measurement algorithms cannot be expected to simply write calibrated measurements to their output.

This work is tracked as  DM-1074 - Getting issue details... STATUS DM-1598 - Getting issue details... STATUS .

Goals

A task should be available which takes as input:

  • An afw::table::SourceCatalog describing raw measurements of source from a particular image;
  • An lsst::afw::image::Calib describing the photometric calibration of that image;
  • An lsst::afw::image::Wcs describing the world coordinate system of the image.

The task should produce an afw::table::BaseCatalog containing calibrated measurements. Note that:

  • The transformation from raw to calibrated units is not known a priori, but must rather be defined on a per-measurement plugin basis;
  • The relationship between input and output fields is not necessarily one-to-one – rather, some raw measurements may be combined to produce derived quantities;
  • The transformation from raw to calibrated units may depend on the photometric and WCS information supplied to the task, and on the configuration of the plugin which performed the raw measurement;
  • We do not copy slots from the input SourceTable.

In addition, one or more command line tasks should be produced which provide an appropriate interface between the calibration transformation and the end user, making it possible to specify the appropriate input data.

Design

We tackle this problem by:

  1. Defining "transformation plugins", which describe the means of transforming the output of a measurement plugin.
  2. Augmenting measurement plugins with a method which provides the caller with a transformation plugin appropriate to the measurement.
  3. Providing a task which transforms an input catalog to an output catalog using the infrastructure defined above.

Transformation Plugins

Transformation plugins are written in Python, and inherit from the base class TransformPlugin. They are expected to adhere to the same interface.

class TransformPlugin(object):
    def __init__(self, name, mapper, cfg, wcs, calib):
		...
    def __call__(self, oldRecord, newRecord):
        ...

The name and cfg arguments to __init__ describe the name and configuration of the measurement plugin whose results we will transform. The wcs and calib arguments describe the WCS and calibration which will be available for use in the transformation. These four arguments are stored as instance variables within the object. The mapper argument is a SchemaMapper which describes the input (raw measurement list) and output (calibrated) schemas and the relationship between them. The __init__ method should:

  1. Add mappings between fields which should be directly copied from input to output;
  2. Add field definitions for those quantities which will be calculated during transformation and store their keys.

The __call__ method will be invoked once with each pair of raw and corresponding calibrated table records. It should perform whatever transformation is necessary to populate the latter given the former as well as the stored information in the transformation plugin (measurement plugin name & configuration, WCS, calibration).

For example, the following would:

  1. Copy the contents of all fields beginning with the name of the measurement plugin from the raw source list to the calibrated output;
  2. Add an additional field with the name example and the value 10.0 to all output records.
class Example(TransformPlugin):
    def __init__(self, name, mapper, cfg, wcs, calib):
        TransformPlugin.__init__(self, name, mapper, cfg, wcs, calib)
        for key, field in mapper.getInputSchema().extract(name + "*").itervalues():
            mapper.addMapping(key)
		outputSchema = mapper.editOutputSchema()
		self.newKey = newSchema.addField("example", type="D")

    def __call__(self, oldRecord, newRecord):
        newRecord.set(self.newKey, 10.0)

Note that hierarchies of transformation plugins can be built up in this way – for example, a FluxTransformer plugin could provide some basic transformation for flux fields which could be inherited & augmented by PsfFluxTransformer, etc.

Mapping measurement plugins to transformations

Measurement plugins are expected to provide a static method which returns the class of the transformation plugin which should be applied to their outputs. We modify BasePlugin to return a NullTransform which causes no data to be copied to the output:

class NullTransform(TransformPlugin):
	def __call__(self, oldRecord, newRecord):
		pass
 
class BasePlugin(object):
	...
    @staticmethod
    def getTransformClass():
        return NullTransform 

This null operation is then the default for any measurements which do not define their own transformations.

Transformation task design

TransformTask defined an __init__ which takes two additional arguments: the configuration of the task which was used to perform the raw measurements and the registry of available plugins. The former is used to derive a list of plugins which were used to do the measurements and their configurations, which is stored as an instance variable. 

class TransformTask(pipeBase.Task):
    def __init__(self, *args, **kwargs):
		...
        measConfig = kwargs.pop('measConfig')
        self.pluginRegistry = kwargs.pop('pluginRegistry')
        self.measPlugins = [(name, measConfig.value.plugins.get(name))
                            for name in measConfig.value.plugins.names]
 

The run method of TransformTask takes the list of sources to be transformed and the WCS and calibration to be applied as arguments. It constructs a mapper and adds some (configurable) fields which are copied as standard – in this way it is possible to preserve arbitrary data which is not the output of a measurement plugin in the output. It then uses the the registry to look up the measurement plugins by name, retrieve the corresponding TransformPlugin and configure it appropriately:

class TransformTask(pipeBase.Task):
	def run(self, sourceCat, wcs, calib):
		mapper = afwTable.SchemaMapper(sourceCat.schema)
		mapper.addMapping(sourceCat.schema.find('id').key)
		transforms = [self.pluginRegistry.get(name).PluginClass.getTransformClass()(name, mapper, cfg, wcs, calib)
					  for name, cfg in self.measPlugins]
 

Finally, we iterate over all sources, using a combination of the mapper and the transformation plugins to transform old to new:

class TransformTask(pipeBase.Task):
    def run(...):
		...
		newSources = afwTable.BaseCatalog(mapper.getOutputSchema())
		newSources.reserve(len(sourceCat))
		for oldSource in sourceCat:
            newSource = newSources.addNew()
            newSource.assign(oldSource, mapper)
            for transform in transforms:
                transform(oldSource, newSource)

Command line tasks

A series of command line tasks can be defined which feed the appropriate inputs to TransformTask, loading whatever plugin registry, source table, calibration and WCS is required by the end user.

Discussion & criticisms

The way in which the __init__ method of TransformPlugin derivatives takes a reference to a mapper and modifies is ugly: it's unfortunate for a constructor to modify an object other than the one which it's constructing, and, if a function modifies something, it would ideally return the thing being modified. An alternative would be to add another method (TransformPlugin.configure(mapper), say) which avoids the above, but this invokes more code for little practical benefit.

All transformations (other than a trivial copy) are performed by Python code. It is assumed that this is not a major bottleneck. However, future improvements to SchemaMapper could increase the variety of operations that be performed directly inside the mapper based on C++ code, thereby mitigating this. Likely it will always be necessary to iterate over the source list in Python since it will be desirable to continue to define some transformations in Python code.

Implementation

A prototype (lacking documentation, tests, etc) implementation of the above system is available on the u/swinbank/DM-1598 branch in meas_base and pipe_tasks.

  • No labels

21 Comments

  1. Joshua Hoblitt writes (on Hipchat):

    I don't have much constructive to say other than it would be nice for "dummy" lsst::afw::image::Calib & lsst::afw::image::Wcs objects to be available that are essentially no-ops. Maybe something that is automatically used if None is passed in?
    I suspect this might be an API that users will want to test by hand to evaluate the transform
    users in the context of level 3/external

    1. Unfortunately I think that will be difficult, as both Calib and Wcs are hard-coded to change the units as well as the values, and at least in the case of Wcs, those units are captured in the input types of the classes they deal with.

      1. It depends what the detailed requirements are here, but doing this at least on a superficial level is pretty easy. In Python code, you can always mock up something that looks enough like a Wcs or Calib to enable you to test a simple transformation. If you need to test a transformation which is dependent on the detailed characteristics of either of the above, you'll presumably need to construct a "real" one with the properties you need.

        The tests for my code will include generating trivial/default Wcs and Calib objects to demonstrate that the interface works anyway.

  2. You can access a SourceTable by row or by column, and it's a good deal faster to do so by column (and it's easier too,  no need to save Keys for efficiency).  Could your calibration loop work by column?

     

    1. While operating by column might still be a bit faster, I think it'd be cleaner to keep the current design of row-by-row manipulation, but allow TransformPlugins to be implemented in C++ (and in fact to implement all the standard ones we'll use 99% of the time in C++).  That might already be the case, given that Python will just duck-type them unless you have explicit isinstance checks.  Being able to use column-based setters in analysis code is very useful, but I sort of dislike the fact that being efficient in NumPy forces us to write things in a way that is less natural to read.  Of course, I could be an outlier here, as someone who is pretty comfortable with C++.

      1. Adapting my current prototype to implement Robert's idea is very straightforward; I think you should be simply able to change the definition of __call__ in the transformation plugins so it looks something like this:

        class Example(TransformPlugin):
            def __call__(self, oldCatalog, newCatalog):
        		newCatalog.getColumnView()[self.newKey] = 10.0
        
        

        And then the task becomes even simpler:

        class TransformTask(...):
            def run(...):
            	...
                newSources = afwTable.BaseCatalog(mapper.getOutputSchema())
                newSources.extend(sourceCat, mapper=mapper)
                for transform in transforms:
                    transform(sourceCat, newSources)
        
        

        (I've not checked if I need to worry about deep-copies in extend when using a mapper, but that's beside the point for now.)

        While I broadly agree with Jim that for performance we should use C++ rather than sacrifice readability, I don't actually find this any less natural (indeed, perhaps even more natural) than the row-based approach.

        1. I agree - now that I see it, it's not as bad as I thought, and I think it's probably best to just go with this interface.

        2. Actually, can we do this interface and implement the predefined standard ones in C++?  In C++, we could do the loop over rows first (with probably marginally better performance than doing it in columns, because it traverses the memory in the right order).

          Now that I think about it, though, we can't just write a C++ class that has the same interface and use it, because the config instance the constructor takes is pure Python, and even when that's based on a C++ control class, we'd need to call makeControl() on the config object and pass the result to the C++ code.  So, instead, I think we should probably just provide some C++ classes that Python TransformPlugin classes could delegate through via composition - there'd logically one of these for each of the existing Result/ResultMapper pairs in meas_base, which I think would make this system very intuitive.

          Of course, none of that affects the interface you've described here, except that the typical example subclass would now look something like this:

          class Example(TransformPlugin):
           
              def __init__(self, name, mapper, cfg, wcs, calib):
                  TransformPlugin.__init__(self, name, mapper, cfg, wcs, calib)
                  self.fluxTransformer = lsst.meas.base.FluxTransformer(cfg.makeControl(), name, wcs, calib)
                  # add flags to mapper here
                  
              def __call__(self, oldCatalog, newCatalog):
                  self.fluxTransformer(oldCatalog, newCatalog)
  3. This is a very minor issue, but when a class constructor takes a config object, it's conventional to make it the first argument, and call it "config".

  4. When writing the example code for my last comment, I realized that the interface as you have it won't quite work: the wcs and calib arguments need to be passed to __call__, not __init__, as they'll be different for every catalog processed.

    1. I think this isn't a problem in the current design: __init__ on the transformation plugins is called in run on the task, which is called once per catalog.

      In the original design, __call__ was called once per row; rather than passing in wcs and calib on every call, it seemed better to store them in the plugin during construction. Further, the overhead of constructing the plugins should be pretty insignificant compared to everything else that happens when the task is run.

      However, assuming that Robert's suggestion above is adopted, there's only one call to __call__ per run. Assuming that the source tables being processed in subsequent calls to run have the same schema (which I think is a reasonable assumption), it should be possible to move the plugin construction into __init__ on the task and move the wcs and calib arguments to __call__ as you suggest.

      1. Ah, of course.  But I do think that now that we have __call__ operating on a whole catalog at a time, it makes more sense to move the wcs and calib arguments there and move TransformPlugin initialization to the task constructor.  You'll find that when it comes time to write the command-line task that calls these, you'll actually need that, because command-line tasks that create catalogs are required to save the schema they'll use before any data is actually processed (we actually guarantee that all source tables being processed in subsequent calls to run must have the same schema).

        1. I prototyped this, and it works.

          However, in this scheme I need to have the schema of the input catalog available in the task __init__, which means I can't just simply read it from the first dataref I come across. My workaround was this, which is functional but not elegant. Suggestions for a better approach welcome.

          By the way, I am continuing to push reworked prototypes to the u/swinbank/DM-1598 branches, but I'm not updating the document above on the assumption that it'll rapidly get confusing if these comments refer to different revisions of the doc.

          1. You should be able to use ButlerInitializedTaskRunner directly for this; it's basically what it was designed for.  I think the missing piece is that for any catalog dataset (e.g. "src") there's a corresponding schema dataset ("src_schema"), and you can load that from the butler with no data ID required.  So in the constructor for the command-line driver, you get a butler from the keyword arguments, you can use that to get the input schema, and then pass that on to the TransformTask constructor.

            1. That was indeed the missing link – thanks!

  5. FWIW, I'm not too bothered by the fact that mapper is an input/output argument for the constructor - while I agree that it's a bit unusual, especially in Python, it's at least not a problem for this design review to address, as it's a very common pattern for Schema and SchemaMapper objects throughout our codebase.

  6. For TransformPlugin: I agree that:

    • It is better to pass the full catalogs to __call__ instead of single records
    • __init__ should be called once from the task's initialization, if at all possible, and that wcs and calib should be passed to __call__
    • The config argument should be named "config", not "cfg", and should come first in __init__

    Would it make any sense for these plugins to be subclasses of Task? This gives you a log attribute and a time-and-resource measuring decorator. If you go this route I would not add a "run" method.

    1. Thanks for your comments; I certainly agree with the first three points.

      I also quite like the suggestion of subclassing Task. However, I'd like to make it possible to write "first class" transformations in C++ – that is, they have exactly the same capabilities as those written in Python. This is currently done with a tiny Python wrapper (which effectively just calls makeControl() on the config object). It wouldn't be possible to maintain this equivalence if we make the Python code into a derivative of Task, and for that reason I'm not going to implement this idea unless there's a big demand for it.

  7. I don't like the name TransformerPlugin (transform what? and everything's a plugin in our system); would prefer something like MeasurementTransformer or MeasurementCalibrater.

    How does this account for position-dependent photometric calibration (e.g., output from son-of-meas_mosaic)?  Through a subclass of Calib?  That might require updating the Calib interface to accept x,y.

    1. +1 on a name change.

      I think Calib definitely should be the place we put position-dependent photometric effects, and yes, that will require a different interface on the base class.  I think that means we don't worry about it for this particular design, since it just passes the Calib down to the plugins, and it will be their job to use them correctly.

    2. Agree re naming. I was trying to avoid the term "calibration", since we already have a calibration task; I quite like MeasurementTransformer though.

      I'll defer to Jim's expertise re your second point.