InstanceCatalog Redesign

Jake Vanderplas, June 20, 2013

In this document I will outline the elements of the redesign of the instance catalog generation and measures API. The overarching goal of this redesign was to provide an efficient as well as easily configurable and extensible interface to the catalogs generated from LSST simulation runs. The redesign covers two main pieces of the stack, within the generation and measures codebases

Goals of Review

The goals of this review are to evaluate the effectiveness of the design in terms of extensibility and maintainability.  We want users to be able to define custom catalog outputs in a very seamless way: no config files, just Python code.  So please address the following questions in this review:

  • Does the method of introspection for defining the catalog class seem appropriate?
  • Are there use cases where introspection may cause problems?
  • Does the custom class interface make sense from a user's perspective?
  • Is the code written clearly enough?
  • Checking out the code
 
The code is in the "rewrite" directory under catalogs and measures; use the git branch origin/u/vanderplas/queryDB_rewrite
for measures, the git branch is origin/u/vanderplas/instance_rewrite

Core Principles

Both pieces have a core design principle, which is the idea of subclasses as a primary interface. That is, users who wish to modify the default behavior can do so through the creation of subclasses of the core functionality. In order to make these subclasses as terse as possible, metaclasses are used to process specified class attributes and build the appropriate methods.

Database Interface

So, for example, the Star database object is defined (in StarModels.py)
as follows:

 
from dbConnection import DBObject
class StarObj(DBObject):
    objid = 'msstars'
    tableid = 'starsMSRGB_forceseek'
    # ... several more string attributes
    columns = [('id','simobjid', int),
               ('umag', None),
               ('gmag', None),
	       #... several more column attributes
	       ]

 

The class declaration contains no methods: all applicable methods are defined within the DBObject base class. Additionally, upon creation the new class is registered within the DBObject instance, so that it can be referred to by its specified object id:

star_obj = DBObject.from_objid('msstars', address=...)

This is true not only of built-in types, but also of user-defined types. What this results in is a very simple user-end API for extending the behavior of the built-in objects, with no separate config files needed.

Instance Catalog Interface

Where this approach really becomes interesting is within the instance catalog definitions. The InstanceCatalog object has been rewritten with the same principles, but with a very flexible way to define the source of data for columns. As above, the column names are specified as class attributes. So, a new type of test catalog can be specified like this:

class TestCatalog(InstanceCatalog):
    catalog_type = 'test_catalog'
    column_outputs = ['galID', 'raJ2000', 'decJ2000']
    refIdCol = 'galid'
    transformations = {'raJ2000':np.degrees,
                       'decJ2000':np.degrees}

 

When the catalog is instantiated from a database and the ``write_catalog`` method is called, the column_outputs are searched and the appropriate data is written. This data can be sourced from the database, from ``get_`` methods in base classes, or from ``get_`` methods in the class itself. The core of the logic required for this is in the ``column_by_name()`` method. It takes a column name as an argument, and returns the column values.

For example, imagine you have a database and are searching for the column called 'gmag'. The search path is as follows:

  • If the class has a method called ``get_gmag``, then return the result of this method. Note that this will search the entire method resolution order of the class, so that parent classes and class mixins can be used in an intuitive way to extend built-in functionality.
  • If there is no getter for the column, the next place to be searched is within compound column names (see below). Compound column names are groups of attributes which, for efficiency and convenience, may be computed together. If 'gmag' exists in a compound column, then its value will be computed and returned.
  • Finally, if 'gmag' does not have an associated getter or compound column, it's value is taken from the database.

Compound Columns

For a cleaner syntax and efficiency, a single getter function can return multiple columns. For example, when correcting magnitude systems, it may be more efficient to apply photometric corrections to all magnitudesat once. So your 'gmag' column may be defined as follows:

class TestCatalog(object):
    # ...
    @compound('gmag', 'rmag', 'imag')
    def get_magnitudes(self):
        g_raw, r_raw, i_raw = map(self.colunmn_by_name,
                                  ('g_raw', 'r_raw', 'i_raw')
        # convert the raw magnidues using some function
		gmag, rmag, imag = convert(g_raw, r_raw, i_raw)
		return gmag, rmag, imag

 

Because the results of compound column calculations will in general be accessed one at a time, the compound decorator also automatically caches the results the first time the function is computed.

Validation: Introspecting the Database

One potential problem with this approach is the validation of user-designed classes. For example, what if the catalog asks for an attribute that has neither a getter, a compound column entry, or a database entry? We'd like this failure to be caught as early as possible, before (for example) the necessary data is pulled from the database. Also, if we simply query the database each time a column is requested, it will lead to a lot of overhead. It would be better to query the database only once to pull down all the required columns, after which the remaining derived attributes can be computed.

This behavior is implemented within the InstanceCatalog metaclass through a clever bit of introspection. When the class is instantiated, the metaclass does a dry-run of outputting the table. To do this dry run, it temporarily replaces the real database with a duck-typed standin (_MimicRecordArray) which logs the columns that are requested. At the end of the dry run, we are left with a list of the columns that are required to be available in the database, and a quick check is performed to make sure the database has all the required columns. If not, an error is raised before any computation or data movement takes place.

Toy Examples

The result is a very intuitive Python API for defining various catalog types. Because the API takes advantage of Python's built-in method resolution order, it allows for very clean and powerful catalogs. Here is an example of a series of basic catalog types defined with this API:

class BasicCatalog(InstanceCatalog):
    """Simple catalog with columns directly from the database"""
    catalog_type = 'basic_catalog'
    refIdCol = 'id'
    column_outputs = ['id', 'ra_J2000', 'dec_J2000', 'g_raw', 'r_raw', 'i_raw']
    # transformations specify conversions when moving from the database
    # to the catalog.  In this case, we take RA/DEC in radians and convert
    # to degrees.
    transformations = {"ra_J2000":np.degrees,
                       "dec_J2000":np.degrees}

class AstrometryMixin(object):
    @compound('ra_corrected', 'dec_corrected')
    def get_points_corrected(self):
        ra_J2000 = self.column_by_name('ra_J2000')
		dec_J2000 = self.column_by_name('dec_J2000')
		# ... do the conversions: these are just standins
		ra_corrected = ra_J2000 + 0.001
		dec_corrected = dec_J2000 - 0.001
		return ra_corrected, dec_corrected

class PhotometryMixin(object):
    @compound('g', 'r', 'i')
    def get_mags(self):
        g_mag, r_mag, i_mag = map(self.column_by_name,
	                          ['g_mag', 'r_mag', 'i_mag'])
        # ... do conversions here: these are just standins
		g = g_mag + 0.01
		r = r_mag - 0.01
		i = i_mag + 0.02
		return g, r, i

class CustomCatalog(BasicCatalog, AstrometryMixin, PhotometryMixin):
    column_outputs = ['id', 'redshift', 'points_corrected', 'mags']
    transformations = {"ra_corrected":np.degrees,
                       "dec_corrected":np.degrees}

 

Now to create a catalog, we connect to a database and call write_catalog

db = GalaxyObj()
catalog = CustomCatalog(db)
catalog.write_catalog("out.txt")
# out.txt has the following columns:
# id redshift ra_corrected dec_corrected g r i 

  • No labels

2 Comments

  1. Comments from Jim Chiang

    italics: response to comments and actions

    General Comments:
    * Use of metaclasses does make it easier on the client/user to define
    subclasses with the desired properties.

    * What code was meant to be reviewed? I only looked in the rewrite
    subdirectories in the generation and measures trees.

    * Is any of the older code in those trees used anymore?

    This will be the primary code database (needs to be transitioned)

    Database Interface:
    * Will users commonly create DBObject subclasses? If so, it would be
    useful to have the columns attribute able to be generated automatically
    via a db query.

    Possibly but on their own databases. Jim if this becomes a common case should auto generate this from the schema

    * Implementing columns attribute requires access to the db schema. It's
    not clear to me how users are meant to find that info. Some general
    function for inspecting the db (i.e., which tables are available and their
    schema) would be useful.

    Simon - difficult to address as there are conversions done in the database so hard to return the columns. Need to know the columns. 

    * The DBObject.spatialModel attribute (and its later usage in
    InstanceCatalog.get_spatialmodel) does not seem sufficiently general
    to handle FITS files as spatial models.

    Can you support fits images in this model (rather than just state the format as a string). Could make the attribute a spatial model class. Fits files are treated as a special case in the phosim interface which should be addressed.

     

    * The name "appendint" seems insufficiently descriptive of what it is
    supposed to be in the trim file entry. Maybe rename it to
    "objectTypeId"?

    Action: change this name (Simon)

    Instance Catalog Interface:
    * This implementation works well, and seems fairly intuitive to me,
    but I only tried out the functionality at the level of the toy
    examples plus adding some constraints in the write_catalog(...)
    method.

    * It would be nice to see an example where a trim file is produced.

    Action: example code to generate trim file (Simon)

    Non-design-related issues:
    * Documentation!

    Action: need to address this asap. Debbie DM stack setup to make this easy. Simon - possible. Do we need to go through TCC? Jake - python community dealing with this (e.g. Anaconda out of continuum) 

    * External dependencies that are not natively included in the DM stack
    distributions need to be listed:
    slalib, sqlalchemy, pymssql, (probably others).

    * Even with a complete set of externals and DM stack installation,
    what packages need to be setup? It's possible to figure out these
    dependencies from the import errors, but these sorts of things
    should be well-documented.

    * slalib dependency should be addressed ASAP.

    * Connecting via
    mssql+pymssql://LSST-2:L$$TUser@fatboy.npl.washington.edu:1433/LSST
    never worked for me. See my email to lsst-imsim.
    This prevented me from determining if I had a working installation
    until I created my own set of db tables.

    * Unit tests should be implemented (and run!) so that (possible) bugs
    like the one that Debbie ran into ("AttributeError: 'list' object
    has no attrinute 'dtype'") are less likely to happen.

    Action: setup unit tests. Jim just have sqllite in the repository

    * For the unit tests, one could tuck a small sqlite db file in the
    repository and use that. With something like that, the lack of
    connectivity to the UW db wouldn't be an issue.

    * Almost all of the code snippets on the design review page had coding
    errors (I think I fixed all of them.) Part of the issue was
    transcription to confluence from Jake's document (e.g., incorrect
    indent levels). Others were real syntax errors. Example code like
    this should be correct!

    Final Comments:
    Overall, the design looks solid and well-implemented, but because of
    the limited documentation and difficulties I had in getting a useable
    distribution and a useable set of db tables, I didn't have time to
    exercise or understand the code as well as I would have liked.

    In addition to smoothing over the documentation and distribution
    issues, more typical use cases should have been provided, ideally
    through the unit tests to some degree. I had never used catsim
    before, so my attempts to exercise the code may be too simplistic.

    Since this code is meant to be used in conjunction with phosim, a set
    of tests that generate output for all possible types of trim file
    entries should be implemented (with the output successfully fed to
    phosim). There may be other unsupported features like the spatial
    model issue for FITS files. I would say that this particular case
    should be resolved on the phosim side, i.e., by rationalizing the way
    those models are specified, but if not, then that would probably mean
    a non-trivial redesign of the spatial model handling in the new catsim
    code.

  2. Unknown User (djbard)

    Comments from Debbie:

    italics: response to comments and actions

    This is based on my experience setting up and using the catalogue generation code, with an aim to use the code to access my own galaxy catalogue database.

    General comments:

    ** This is a vast improvement over the previous code! In principal, I only need to modify one file instead of several (as was the case in the old code).In principal, this is very easy to modify. In practice, however...

    ** ...the issues I've run in to, as a user, are down to the lack of documentation and worked examples. I can provide examples based on my own experience, and hopefully other users can also contribute their own worked examples. It'd still be VERY helpful to have developer-provided documentation, including more comments in the code!

    Action - Documentation and worked examples. Including comments in the code Debbie will provide some (need to seed them)

    Specific comments:

    ** Can't use it at slac because of slalib.

    ** It takes forever to check out measures - is there much extraneous information?

    Measures takes a long time to check out. Simon - would working from tags be acceptable.

     Action - will look at removing superfluous code from git repo (to reduce the size of the repo

    ** Why are there three packages associated with the catalogue simulation? It's not intuitive what each package does.

    originally to separate database from interactions on catalogs. Useful stuff in measures (should separate out the functionality) such as photometry, astrometry, etc. Ok to move some of the packages out of measures to standalone

    ** When running the catalogue generation example scripts, there are a lot of messages stating: "Either appendint or spatialModel has not been set.  Input files for phosim are not possible." These are not useful, since evidently a catalogue is produced anyway.

    Cant produce trim files from some of the source types (e.g. for the full galaxy). If in metaclass it happens at the construction of the class. One warning message per type of object. Simon - will run this 

    ** Bug in generation/tests_rewrite/runConn.py related to " AttributeError: 'list' object has no attribute 'dtype'. " ??

    Simon will look at this (test not keeping up with the base class)

    ** (perhaps beyond the scope of this review) Why must bulge and disk be defined separately? Could we not have a galaxy class for ellipticals, and a class for bulge+disk objects? This is particularly relevant when we're trying to do catalogue-level analysis, and we have to try to match up, by hand, the bulges and the disks that are specified at the same location. That's very inefficient.

    Because you query bulge and then disk. Simon - galaxy class combines bulge and disk. Should be able to to subclass this

    Debbie: Total magnitude of the combination is difficult – helper scripts in measures package?

    Simon: Spatial model class may help with this

    ** It's very complicated to figure out how to make a catalogue from your own database. This is largely due to the total lack of documentation.

    ** why are there so many different ID numbers? I needed to make a galtileid, even though I"m not tiling galaxies.

    Simon: That's clearly a bug

    ** I was unsuccessful in getting new quantities into the my catalogue, ie shear parameters from my database.

    Quantities added to galaxy object definition, but the catalog contains zero for these parameters

    Maybe a bug with default value overwriting the input value?  Jake: this is probably a bug and should be addressed

    Debbie: perhaps talk about details offline

    Jake: the more user-specifiable something is, the more precedence that should have

    Simon: the metaclass overrides methods on construction.  This should be fixed.

    Documentation

    How to move it all over (all are in agreement)

    generation: should be easy - literally replace everything with rewrite

    measures: we might need to think about this.  Still need to create mixins for certain subpackages

    Nice to add unit tests before we move code over - timing with review in 3 weeks?