InstanceCatalog Redesign
Jake Vanderplas, June 20, 2013
In this document I will outline the elements of the redesign of the instance catalog generation and measures API. The overarching goal of this redesign was to provide an efficient as well as easily configurable and extensible interface to the catalogs generated from LSST simulation runs. The redesign covers two main pieces of the stack, within the generation and measures codebases
Goals of Review
The goals of this review are to evaluate the effectiveness of the design in terms of extensibility and maintainability. We want users to be able to define custom catalog outputs in a very seamless way: no config files, just Python code. So please address the following questions in this review:
- Does the method of introspection for defining the catalog class seem appropriate?
- Are there use cases where introspection may cause problems?
- Does the custom class interface make sense from a user's perspective?
- Is the code written clearly enough?
- Checking out the code
Core Principles
Both pieces have a core design principle, which is the idea of subclasses as a primary interface. That is, users who wish to modify the default behavior can do so through the creation of subclasses of the core functionality. In order to make these subclasses as terse as possible, metaclasses are used to process specified class attributes and build the appropriate methods.
Database Interface
So, for example, the Star database object is defined (in StarModels.py)
as follows:
from dbConnection import DBObject class StarObj(DBObject): objid = 'msstars' tableid = 'starsMSRGB_forceseek' # ... several more string attributes columns = [('id','simobjid', int), ('umag', None), ('gmag', None), #... several more column attributes ]
The class declaration contains no methods: all applicable methods are defined within the DBObject base class. Additionally, upon creation the new class is registered within the DBObject instance, so that it can be referred to by its specified object id:
star_obj = DBObject.from_objid('msstars', address=...)
This is true not only of built-in types, but also of user-defined types. What this results in is a very simple user-end API for extending the behavior of the built-in objects, with no separate config files needed.
Instance Catalog Interface
Where this approach really becomes interesting is within the instance catalog definitions. The InstanceCatalog object has been rewritten with the same principles, but with a very flexible way to define the source of data for columns. As above, the column names are specified as class attributes. So, a new type of test catalog can be specified like this:
class TestCatalog(InstanceCatalog): catalog_type = 'test_catalog' column_outputs = ['galID', 'raJ2000', 'decJ2000'] refIdCol = 'galid' transformations = {'raJ2000':np.degrees, 'decJ2000':np.degrees}
When the catalog is instantiated from a database and the ``write_catalog`` method is called, the column_outputs are searched and the appropriate data is written. This data can be sourced from the database, from ``get_`` methods in base classes, or from ``get_`` methods in the class itself. The core of the logic required for this is in the ``column_by_name()`` method. It takes a column name as an argument, and returns the column values.
For example, imagine you have a database and are searching for the column called 'gmag'. The search path is as follows:
- If the class has a method called ``get_gmag``, then return the result of this method. Note that this will search the entire method resolution order of the class, so that parent classes and class mixins can be used in an intuitive way to extend built-in functionality.
- If there is no getter for the column, the next place to be searched is within compound column names (see below). Compound column names are groups of attributes which, for efficiency and convenience, may be computed together. If 'gmag' exists in a compound column, then its value will be computed and returned.
- Finally, if 'gmag' does not have an associated getter or compound column, it's value is taken from the database.
Compound Columns
For a cleaner syntax and efficiency, a single getter function can return multiple columns. For example, when correcting magnitude systems, it may be more efficient to apply photometric corrections to all magnitudesat once. So your 'gmag' column may be defined as follows:
class TestCatalog(object): # ... @compound('gmag', 'rmag', 'imag') def get_magnitudes(self): g_raw, r_raw, i_raw = map(self.colunmn_by_name, ('g_raw', 'r_raw', 'i_raw') # convert the raw magnidues using some function gmag, rmag, imag = convert(g_raw, r_raw, i_raw) return gmag, rmag, imag
Because the results of compound column calculations will in general be accessed one at a time, the compound decorator also automatically caches the results the first time the function is computed.
Validation: Introspecting the Database
One potential problem with this approach is the validation of user-designed classes. For example, what if the catalog asks for an attribute that has neither a getter, a compound column entry, or a database entry? We'd like this failure to be caught as early as possible, before (for example) the necessary data is pulled from the database. Also, if we simply query the database each time a column is requested, it will lead to a lot of overhead. It would be better to query the database only once to pull down all the required columns, after which the remaining derived attributes can be computed.
This behavior is implemented within the InstanceCatalog metaclass through a clever bit of introspection. When the class is instantiated, the metaclass does a dry-run of outputting the table. To do this dry run, it temporarily replaces the real database with a duck-typed standin (_MimicRecordArray) which logs the columns that are requested. At the end of the dry run, we are left with a list of the columns that are required to be available in the database, and a quick check is performed to make sure the database has all the required columns. If not, an error is raised before any computation or data movement takes place.
Toy Examples
The result is a very intuitive Python API for defining various catalog types. Because the API takes advantage of Python's built-in method resolution order, it allows for very clean and powerful catalogs. Here is an example of a series of basic catalog types defined with this API:
class BasicCatalog(InstanceCatalog): """Simple catalog with columns directly from the database""" catalog_type = 'basic_catalog' refIdCol = 'id' column_outputs = ['id', 'ra_J2000', 'dec_J2000', 'g_raw', 'r_raw', 'i_raw'] # transformations specify conversions when moving from the database # to the catalog. In this case, we take RA/DEC in radians and convert # to degrees. transformations = {"ra_J2000":np.degrees, "dec_J2000":np.degrees} class AstrometryMixin(object): @compound('ra_corrected', 'dec_corrected') def get_points_corrected(self): ra_J2000 = self.column_by_name('ra_J2000') dec_J2000 = self.column_by_name('dec_J2000') # ... do the conversions: these are just standins ra_corrected = ra_J2000 + 0.001 dec_corrected = dec_J2000 - 0.001 return ra_corrected, dec_corrected class PhotometryMixin(object): @compound('g', 'r', 'i') def get_mags(self): g_mag, r_mag, i_mag = map(self.column_by_name, ['g_mag', 'r_mag', 'i_mag']) # ... do conversions here: these are just standins g = g_mag + 0.01 r = r_mag - 0.01 i = i_mag + 0.02 return g, r, i class CustomCatalog(BasicCatalog, AstrometryMixin, PhotometryMixin): column_outputs = ['id', 'redshift', 'points_corrected', 'mags'] transformations = {"ra_corrected":np.degrees, "dec_corrected":np.degrees}
Now to create a catalog, we connect to a database and call write_catalog
db = GalaxyObj() catalog = CustomCatalog(db) catalog.write_catalog("out.txt") # out.txt has the following columns: # id redshift ra_corrected dec_corrected g r i
2 Comments
Andrew Connolly
Comments from Jim Chiang
italics: response to comments and actions
General Comments:
* Use of metaclasses does make it easier on the client/user to define
subclasses with the desired properties.
* What code was meant to be reviewed? I only looked in the rewrite
subdirectories in the generation and measures trees.
* Is any of the older code in those trees used anymore?
This will be the primary code database (needs to be transitioned)
Database Interface:
* Will users commonly create DBObject subclasses? If so, it would be
useful to have the columns attribute able to be generated automatically
via a db query.
Possibly but on their own databases. Jim if this becomes a common case should auto generate this from the schema
* Implementing columns attribute requires access to the db schema. It's
not clear to me how users are meant to find that info. Some general
function for inspecting the db (i.e., which tables are available and their
schema) would be useful.
Simon - difficult to address as there are conversions done in the database so hard to return the columns. Need to know the columns.
* The DBObject.spatialModel attribute (and its later usage in
InstanceCatalog.get_spatialmodel) does not seem sufficiently general
to handle FITS files as spatial models.
Can you support fits images in this model (rather than just state the format as a string). Could make the attribute a spatial model class. Fits files are treated as a special case in the phosim interface which should be addressed.
* The name "appendint" seems insufficiently descriptive of what it is
supposed to be in the trim file entry. Maybe rename it to
"objectTypeId"?
Action: change this name (Simon)
Instance Catalog Interface:
* This implementation works well, and seems fairly intuitive to me,
but I only tried out the functionality at the level of the toy
examples plus adding some constraints in the write_catalog(...)
method.
* It would be nice to see an example where a trim file is produced.
Action: example code to generate trim file (Simon)
Non-design-related issues:
* Documentation!
Action: need to address this asap. Debbie DM stack setup to make this easy. Simon - possible. Do we need to go through TCC? Jake - python community dealing with this (e.g. Anaconda out of continuum)
* External dependencies that are not natively included in the DM stack
distributions need to be listed:
slalib, sqlalchemy, pymssql, (probably others).
* Even with a complete set of externals and DM stack installation,
what packages need to be setup? It's possible to figure out these
dependencies from the import errors, but these sorts of things
should be well-documented.
* slalib dependency should be addressed ASAP.
* Connecting via
mssql+pymssql://LSST-2:L$$TUser@fatboy.npl.washington.edu:1433/LSST
never worked for me. See my email to lsst-imsim.
This prevented me from determining if I had a working installation
until I created my own set of db tables.
* Unit tests should be implemented (and run!) so that (possible) bugs
like the one that Debbie ran into ("AttributeError: 'list' object
has no attrinute 'dtype'") are less likely to happen.
Action: setup unit tests. Jim just have sqllite in the repository
* For the unit tests, one could tuck a small sqlite db file in the
repository and use that. With something like that, the lack of
connectivity to the UW db wouldn't be an issue.
* Almost all of the code snippets on the design review page had coding
errors (I think I fixed all of them.) Part of the issue was
transcription to confluence from Jake's document (e.g., incorrect
indent levels). Others were real syntax errors. Example code like
this should be correct!
Final Comments:
Overall, the design looks solid and well-implemented, but because of
the limited documentation and difficulties I had in getting a useable
distribution and a useable set of db tables, I didn't have time to
exercise or understand the code as well as I would have liked.
In addition to smoothing over the documentation and distribution
issues, more typical use cases should have been provided, ideally
through the unit tests to some degree. I had never used catsim
before, so my attempts to exercise the code may be too simplistic.
Since this code is meant to be used in conjunction with phosim, a set
of tests that generate output for all possible types of trim file
entries should be implemented (with the output successfully fed to
phosim). There may be other unsupported features like the spatial
model issue for FITS files. I would say that this particular case
should be resolved on the phosim side, i.e., by rationalizing the way
those models are specified, but if not, then that would probably mean
a non-trivial redesign of the spatial model handling in the new catsim
code.
Unknown User (djbard)
Comments from Debbie:
italics: response to comments and actions
This is based on my experience setting up and using the catalogue generation code, with an aim to use the code to access my own galaxy catalogue database.
General comments:
** This is a vast improvement over the previous code! In principal, I only need to modify one file instead of several (as was the case in the old code).In principal, this is very easy to modify. In practice, however...
** ...the issues I've run in to, as a user, are down to the lack of documentation and worked examples. I can provide examples based on my own experience, and hopefully other users can also contribute their own worked examples. It'd still be VERY helpful to have developer-provided documentation, including more comments in the code!
Action - Documentation and worked examples. Including comments in the code Debbie will provide some (need to seed them)
Specific comments:
** Can't use it at slac because of slalib.
** It takes forever to check out measures - is there much extraneous information?
Measures takes a long time to check out. Simon - would working from tags be acceptable.
Action - will look at removing superfluous code from git repo (to reduce the size of the repo
** Why are there three packages associated with the catalogue simulation? It's not intuitive what each package does.
originally to separate database from interactions on catalogs. Useful stuff in measures (should separate out the functionality) such as photometry, astrometry, etc. Ok to move some of the packages out of measures to standalone
** When running the catalogue generation example scripts, there are a lot of messages stating: "Either appendint or spatialModel has not been set. Input files for phosim are not possible." These are not useful, since evidently a catalogue is produced anyway.
Cant produce trim files from some of the source types (e.g. for the full galaxy). If in metaclass it happens at the construction of the class. One warning message per type of object. Simon - will run this
** Bug in generation/tests_rewrite/runConn.py related to " AttributeError: 'list' object has no attribute 'dtype'. " ??
Simon will look at this (test not keeping up with the base class)
** (perhaps beyond the scope of this review) Why must bulge and disk be defined separately? Could we not have a galaxy class for ellipticals, and a class for bulge+disk objects? This is particularly relevant when we're trying to do catalogue-level analysis, and we have to try to match up, by hand, the bulges and the disks that are specified at the same location. That's very inefficient.
Because you query bulge and then disk. Simon - galaxy class combines bulge and disk. Should be able to to subclass this
Debbie: Total magnitude of the combination is difficult – helper scripts in measures package?
Simon: Spatial model class may help with this
** It's very complicated to figure out how to make a catalogue from your own database. This is largely due to the total lack of documentation.
** why are there so many different ID numbers? I needed to make a galtileid, even though I"m not tiling galaxies.
Simon: That's clearly a bug
** I was unsuccessful in getting new quantities into the my catalogue, ie shear parameters from my database.
Quantities added to galaxy object definition, but the catalog contains zero for these parameters
Maybe a bug with default value overwriting the input value? Jake: this is probably a bug and should be addressed
Debbie: perhaps talk about details offline
Jake: the more user-specifiable something is, the more precedence that should have
Simon: the metaclass overrides methods on construction. This should be fixed.
Documentation
How to move it all over (all are in agreement)
generation: should be easy - literally replace everything with rewrite
measures: we might need to think about this. Still need to create mixins for certain subpackages
Nice to add unit tests before we move code over - timing with review in 3 weeks?