Preamble
During measurement, a series of plug-in algorithms are used to measure the raw properties of each source (e.g. fluxes, positions in pixel coordinates). The "calibration and ingest" system provides a means of transforming those raw measurements to calibrated units, such as magnitudes or positions in celestial coordinates.
An important consideration is that the information required to perform the transformation may not be available at measurement time. For example, it may depend on a global astrometric or photometric calibration which has not yet been performed. For this reason, measurement algorithms cannot be expected to simply write calibrated measurements to their output.
This work is tracked as - DM-1074Getting issue details... STATUS & - DM-1598Getting issue details... STATUS .
Goals
A task should be available which takes as input:
- An afw::table::SourceCatalog describing raw measurements of source from a particular image;
- An lsst::afw::image::Calib describing the photometric calibration of that image;
- An lsst::afw::image::Wcs describing the world coordinate system of the image.
The task should produce an afw::table::BaseCatalog containing calibrated measurements. Note that:
- The transformation from raw to calibrated units is not known a priori, but must rather be defined on a per-measurement plugin basis;
- The relationship between input and output fields is not necessarily one-to-one – rather, some raw measurements may be combined to produce derived quantities;
- The transformation from raw to calibrated units may depend on the photometric and WCS information supplied to the task, and on the configuration of the plugin which performed the raw measurement;
- We do not copy slots from the input SourceTable.
In addition, one or more command line tasks should be produced which provide an appropriate interface between the calibration transformation and the end user, making it possible to specify the appropriate input data.
Design
We tackle this problem by:
- Defining "transformation plugins", which describe the means of transforming the output of a measurement plugin.
- Augmenting measurement plugins with a method which provides the caller with a transformation plugin appropriate to the measurement.
- Providing a task which transforms an input catalog to an output catalog using the infrastructure defined above.
Transformation Plugins
Transformation plugins are written in Python, and inherit from the base class TransformPlugin. They are expected to adhere to the same interface.
class TransformPlugin(object): def __init__(self, name, mapper, cfg, wcs, calib): ... def __call__(self, oldRecord, newRecord): ...
The name
and cfg
arguments to __init__
describe the name and configuration of the measurement plugin whose results we will transform. The wcs
and calib
arguments describe the WCS and calibration which will be available for use in the transformation. These four arguments are stored as instance variables within the object. The mapper
argument is a SchemaMapper which describes the input (raw measurement list) and output (calibrated) schemas and the relationship between them. The __init__
method should:
- Add mappings between fields which should be directly copied from input to output;
- Add field definitions for those quantities which will be calculated during transformation and store their keys.
The __call__
method will be invoked once with each pair of raw and corresponding calibrated table records. It should perform whatever transformation is necessary to populate the latter given the former as well as the stored information in the transformation plugin (measurement plugin name & configuration, WCS, calibration).
For example, the following would:
- Copy the contents of all fields beginning with the name of the measurement plugin from the raw source list to the calibrated output;
- Add an additional field with the name
example
and the value10.0
to all output records.
class Example(TransformPlugin): def __init__(self, name, mapper, cfg, wcs, calib): TransformPlugin.__init__(self, name, mapper, cfg, wcs, calib) for key, field in mapper.getInputSchema().extract(name + "*").itervalues(): mapper.addMapping(key) outputSchema = mapper.editOutputSchema() self.newKey = newSchema.addField("example", type="D") def __call__(self, oldRecord, newRecord): newRecord.set(self.newKey, 10.0)
Note that hierarchies of transformation plugins can be built up in this way – for example, a FluxTransformer
plugin could provide some basic transformation for flux fields which could be inherited & augmented by PsfFluxTransformer
, etc.
Mapping measurement plugins to transformations
Measurement plugins are expected to provide a static method which returns the class of the transformation plugin which should be applied to their outputs. We modify BasePlugin
to return a NullTransform
which causes no data to be copied to the output:
class NullTransform(TransformPlugin): def __call__(self, oldRecord, newRecord): pass class BasePlugin(object): ... @staticmethod def getTransformClass(): return NullTransform
This null operation is then the default for any measurements which do not define their own transformations.
Transformation task design
TransformTask
defined an __init__
which takes two additional arguments: the configuration of the task which was used to perform the raw measurements and the registry of available plugins. The former is used to derive a list of plugins which were used to do the measurements and their configurations, which is stored as an instance variable.
class TransformTask(pipeBase.Task): def __init__(self, *args, **kwargs): ... measConfig = kwargs.pop('measConfig') self.pluginRegistry = kwargs.pop('pluginRegistry') self.measPlugins = [(name, measConfig.value.plugins.get(name)) for name in measConfig.value.plugins.names]
The run
method of TransformTask
takes the list of sources to be transformed and the WCS and calibration to be applied as arguments. It constructs a mapper and adds some (configurable) fields which are copied as standard – in this way it is possible to preserve arbitrary data which is not the output of a measurement plugin in the output. It then uses the the registry to look up the measurement plugins by name, retrieve the corresponding TransformPlugin
and configure it appropriately:
class TransformTask(pipeBase.Task): def run(self, sourceCat, wcs, calib): mapper = afwTable.SchemaMapper(sourceCat.schema) mapper.addMapping(sourceCat.schema.find('id').key) transforms = [self.pluginRegistry.get(name).PluginClass.getTransformClass()(name, mapper, cfg, wcs, calib) for name, cfg in self.measPlugins]
Finally, we iterate over all sources, using a combination of the mapper and the transformation plugins to transform old to new:
class TransformTask(pipeBase.Task): def run(...): ... newSources = afwTable.BaseCatalog(mapper.getOutputSchema()) newSources.reserve(len(sourceCat)) for oldSource in sourceCat: newSource = newSources.addNew() newSource.assign(oldSource, mapper) for transform in transforms: transform(oldSource, newSource)
Command line tasks
A series of command line tasks can be defined which feed the appropriate inputs to TransformTask
, loading whatever plugin registry, source table, calibration and WCS is required by the end user.
Discussion & criticisms
The way in which the __init__
method of TransformPlugin
derivatives takes a reference to a mapper and modifies is ugly: it's unfortunate for a constructor to modify an object other than the one which it's constructing, and, if a function modifies something, it would ideally return the thing being modified. An alternative would be to add another method (TransformPlugin.configure(mapper)
, say) which avoids the above, but this invokes more code for little practical benefit.
All transformations (other than a trivial copy) are performed by Python code. It is assumed that this is not a major bottleneck. However, future improvements to SchemaMapper
could increase the variety of operations that be performed directly inside the mapper based on C++ code, thereby mitigating this. Likely it will always be necessary to iterate over the source list in Python since it will be desirable to continue to define some transformations in Python code.
Implementation
A prototype (lacking documentation, tests, etc) implementation of the above system is available on the u/swinbank/DM-1598
branch in meas_base
and pipe_tasks
.
21 Comments
John Swinbank
Joshua Hoblitt writes (on Hipchat):
I don't have much constructive to say other than it would be nice for "dummy" lsst::afw::image::Calib & lsst::afw::image::Wcs objects to be available that are essentially no-ops. Maybe something that is automatically used if None is passed in?
I suspect this might be an API that users will want to test by hand to evaluate the transform
users in the context of level 3/external
Jim Bosch
Unfortunately I think that will be difficult, as both
Calib
andWcs
are hard-coded to change the units as well as the values, and at least in the case ofWcs
, those units are captured in the input types of the classes they deal with.John Swinbank
It depends what the detailed requirements are here, but doing this at least on a superficial level is pretty easy. In Python code, you can always mock up something that looks enough like a
Wcs
orCalib
to enable you to test a simple transformation. If you need to test a transformation which is dependent on the detailed characteristics of either of the above, you'll presumably need to construct a "real" one with the properties you need.The tests for my code will include generating trivial/default
Wcs
andCalib
objects to demonstrate that the interface works anyway.Robert Lupton
You can access a SourceTable by row or by column, and it's a good deal faster to do so by column (and it's easier too, no need to save Keys for efficiency). Could your calibration loop work by column?
Jim Bosch
While operating by column might still be a bit faster, I think it'd be cleaner to keep the current design of row-by-row manipulation, but allow
TransformPlugins
to be implemented in C++ (and in fact to implement all the standard ones we'll use 99% of the time in C++). That might already be the case, given that Python will just duck-type them unless you have explicitisinstance
checks. Being able to use column-based setters in analysis code is very useful, but I sort of dislike the fact that being efficient in NumPy forces us to write things in a way that is less natural to read. Of course, I could be an outlier here, as someone who is pretty comfortable with C++.John Swinbank
Adapting my current prototype to implement Robert's idea is very straightforward; I think you should be simply able to change the definition of
__call__
in the transformation plugins so it looks something like this:And then the task becomes even simpler:
(I've not checked if I need to worry about deep-copies in
extend
when using a mapper, but that's beside the point for now.)While I broadly agree with Jim that for performance we should use C++ rather than sacrifice readability, I don't actually find this any less natural (indeed, perhaps even more natural) than the row-based approach.
Jim Bosch
I agree - now that I see it, it's not as bad as I thought, and I think it's probably best to just go with this interface.
Jim Bosch
Actually, can we do this interface and implement the predefined standard ones in C++? In C++, we could do the loop over rows first (with probably marginally better performance than doing it in columns, because it traverses the memory in the right order).
Now that I think about it, though, we can't just write a C++ class that has the same interface and use it, because the config instance the constructor takes is pure Python, and even when that's based on a C++ control class, we'd need to call makeControl() on the config object and pass the result to the C++ code. So, instead, I think we should probably just provide some C++ classes that Python
TransformPlugin
classes could delegate through via composition - there'd logically one of these for each of the existing Result/ResultMapper pairs in meas_base, which I think would make this system very intuitive.Of course, none of that affects the interface you've described here, except that the typical example subclass would now look something like this:
Jim Bosch
This is a very minor issue, but when a class constructor takes a config object, it's conventional to make it the first argument, and call it "
config
".Jim Bosch
When writing the example code for my last comment, I realized that the interface as you have it won't quite work: the
wcs
andcalib
arguments need to be passed to__call__
, not__init__
, as they'll be different for every catalog processed.John Swinbank
I think this isn't a problem in the current design:
__init__
on the transformation plugins is called inrun
on the task, which is called once per catalog.In the original design,
__call__
was called once per row; rather than passing inwcs
andcalib
on every call, it seemed better to store them in the plugin during construction. Further, the overhead of constructing the plugins should be pretty insignificant compared to everything else that happens when the task is run.However, assuming that Robert's suggestion above is adopted, there's only one call to
__call__
per run. Assuming that the source tables being processed in subsequent calls torun
have the same schema (which I think is a reasonable assumption), it should be possible to move the plugin construction into__init__
on the task and move thewcs
andcalib
arguments to__call__
as you suggest.Jim Bosch
Ah, of course. But I do think that now that we have
__call__
operating on a whole catalog at a time, it makes more sense to move thewcs
andcalib
arguments there and moveTransformPlugin
initialization to the task constructor. You'll find that when it comes time to write the command-line task that calls these, you'll actually need that, because command-line tasks that create catalogs are required to save the schema they'll use before any data is actually processed (we actually guarantee that all source tables being processed in subsequent calls torun
must have the same schema).John Swinbank
I prototyped this, and it works.
However, in this scheme I need to have the schema of the input catalog available in the task
__init__
, which means I can't just simply read it from the first dataref I come across. My workaround was this, which is functional but not elegant. Suggestions for a better approach welcome.By the way, I am continuing to push reworked prototypes to the
u/swinbank/DM-1598
branches, but I'm not updating the document above on the assumption that it'll rapidly get confusing if these comments refer to different revisions of the doc.Jim Bosch
You should be able to use ButlerInitializedTaskRunner directly for this; it's basically what it was designed for. I think the missing piece is that for any catalog dataset (e.g. "src") there's a corresponding schema dataset ("src_schema"), and you can load that from the butler with no data ID required. So in the constructor for the command-line driver, you get a butler from the keyword arguments, you can use that to get the input schema, and then pass that on to the TransformTask constructor.
John Swinbank
That was indeed the missing link – thanks!
Jim Bosch
FWIW, I'm not too bothered by the fact that mapper is an input/output argument for the constructor - while I agree that it's a bit unusual, especially in Python, it's at least not a problem for this design review to address, as it's a very common pattern for
Schema
andSchemaMapper
objects throughout our codebase.Russell Owen
For TransformPlugin: I agree that:
Would it make any sense for these plugins to be subclasses of Task? This gives you a log attribute and a time-and-resource measuring decorator. If you go this route I would not add a "run" method.
John Swinbank
Thanks for your comments; I certainly agree with the first three points.
I also quite like the suggestion of subclassing
Task
. However, I'd like to make it possible to write "first class" transformations in C++ – that is, they have exactly the same capabilities as those written in Python. This is currently done with a tiny Python wrapper (which effectively just callsmakeControl()
on the config object). It wouldn't be possible to maintain this equivalence if we make the Python code into a derivative ofTask
, and for that reason I'm not going to implement this idea unless there's a big demand for it.Paul Price
I don't like the name
TransformerPlugin
(transform what? and everything's a plugin in our system); would prefer something likeMeasurementTransformer
orMeasurementCalibrater
.How does this account for position-dependent photometric calibration (e.g., output from son-of-meas_mosaic)? Through a subclass of
Calib
? That might require updating the Calib interface to accept x,y.Jim Bosch
+1 on a name change.
I think Calib definitely should be the place we put position-dependent photometric effects, and yes, that will require a different interface on the base class. I think that means we don't worry about it for this particular design, since it just passes the Calib down to the plugins, and it will be their job to use them correctly.
John Swinbank
Agree re naming. I was trying to avoid the term "calibration", since we already have a calibration task; I quite like
MeasurementTransformer
though.I'll defer to Jim's expertise re your second point.