Nomenclature and style changes

Before Gen2 deprecation, or not at all.

I'm pretty close to merging DM-21147 - Getting issue details... STATUS , which is what I've been considering the blocker for starting to use (or even moving completely to) snake_case in daf_butler.  I'm aware that we haven't all been at the same page on this (others have started using snake_case here and there already), but I think we should make a joint decision one way or another soon.

I've done all of my prototyping on this branch in snake_case, and I actually found it remarkably helpful; I have an irrational aversion to long camelCase names, especially those with conjunctions, and I've learned to accept that I write much better method and especially attribute/property names when I do it in snake_case (first).  If we agree to stick with camelCase, I'll hold my nose and just directly translate the names I've come up with on this branch and force myself not to revisit them.

With or without the style change, I'd also really like to rename the DimensionGraph  and any attributes that use the "graph" nomenclature (e.g. DataCoordinate.graph).  While it's possible to view this class as representing a graph conceptually, there are no Edge or Vertex classes and no way to really use it as a graph, and that makes the name at best unhelpful and probably actively confusing.  The name I've been using in prototyping is DimensionGroup , which isn't precise, but it's at least not misleading; I don't think a precise name is really possible here - the object is somewhat set-like, but it can't actually implement the Python collections.abc.Set  interface while maintaining its "automatically expand to include dependencies" invariants, and there are other containers of the same objects (e.g. NamedValueSet[Dimension]) that are in active use in the code for contexts where a true set is needed.

Dimension definition, persistence, and versioning

Start before Gen2 deprecation, finish after.

At a middleware telecon a few weeks ago (2020-08-20, I think) I think we generally agreed it'd be best to start saving the dimension definitions into the Registry database at repo construction, instead of either copying the dimensions configuration into the repo or loading it from daf_butler. This will ensure [this aspect of] the repository configuration can never get out of sync with the database content, while also making it unnecessary to migrate repositories immediately due to changes to the default/recommended dimensions in daf_butler. I've sketched out a way to persist a DimensionUniverse to a suite of four new tables tables, and I think it's worth trying to make this change before we declare schema stability to avoid a migration that includes a fairly fundamental change to how we version schemas.

I'm also proposing that we go further, and make dimension definitions "dynamic" database content that can be modified (at least in "additive" ways) after repository creation, via Registry methods patterned after those for dataset type and collection registration. I think this moves the dimension definitions outside the purview of whatever alembic-based versioning/migration system we devise (though there would still be a DimensionManager class that is part of that system, representing the mapping from conceptual dimension definitions to concrete DDL).

The big advantage of this is that additive dimension changes of the sort I expect to be most common - adding new dimensions or new metadata columns to existing dimensions - become straightforward per-repo Registry operations, rather than something we'd consider a schema version change or migration. This also lets users experiment with their own dimensions in personal repos before we add them to multi-user repos, without having to modify daf_butler or "claim" a schema version number for those experiments. And other surveys like PFS or SPHEREx that want to define their own dimensions that are not necessarily a superset of LSST's would be able to do so simply by starting from their own initial dimensions config.

The other advantage is just that it should simplify the rest of the schema versioning/migration system by removing something that never fit well with it.

The downside is that the complexity of managing dimension definition changes doesn't go away - it just goes elsewhere. That's not a big deal for isolated individual changes, which this proposal makes much easier - or at least no harder, in the case of non-additive changes that would require tricky hand-written migration scripts in any proposal. But it may make it harder to deal with the _sequences_ of changes that represent a more general diff between two sets of dimension definitions. My impression has been that Alembic does provide functionality to help with that problem (though I'm not sure exactly what), and this proposal may make it harder or even impossible to use that for dimensions changes.

After the initial work to persist the dimensions to the database, I don't think the further work to make the content dynamic has to be a schema change; this depends on how it affects the schema-versioning part of the schema, and whether we can future-proof that.  Ideally we'd do that initial work before declaring schema stability and only make dimensions dynamic after Gen2 deprecation, while praying that no one needs a dimensions change in that time window.

Alias and label dimensions

After Gen2 deprecation.

By making the DimensionUniverse more dynamic and mutable, we also make it easier to support more lightweight dimensions that don't have database representations.  On the prototyping branch, I've sketched out both a "label" dimension - just an opaque string that can mean whatever the user wants it to mean - and a set system of "alias" dimensions, which are essentially a way to give one or more existing dimensions a new name in a particular dataset type.  That opens the door to having the same (original) dimension appear multiple times in the same data ID, which may provide another way to deal with pairwise processing.  They also provide a way to provide a more context-specific name for a data ID key (which would be particularly useful for the label dimension).  In the database, alias dimensions would have no tables of their own, but could still appear in queries as new subqueries of the original dimension tables.  The label dimension would be represented in the database much like the skypix (e.g. htm) dimensions: it would appear as a column in the dataset/collection tables that relate datasets to their data IDs, but it would have no table of its own, and hence it will not be usable for PipelineTask output datasets until additional changes to the QuantumGraph generation algorithm are made (i.e. on DM-21904 - Getting issue details... STATUS ).

I don't think these features are critical for Gen2 deprecation, and I don't plan to implement them anytime soon; the prototyping I've done already is enough to assure me that they can be added later with little disruption.

Explicitly-materialized spatial relationships

Before Gen2 deprecation.

A while ago we discussed ways to improve our spatial joins, and decided against trying to use database-native spatial functionality for this, in favor of (someday, as needed) explicitly materializing desired spatial joins up front. Users would be required to declare via some registration operation that they were interested in a combination of skymap and instrument in order to use them together (same for HTM levels, etc).

I think the time to do this has come, but not (primarily) to speed up queries, though it may also help that.  This approach also solves a number of other problems, including the one most directly blocking work on CALIBRATION collections, which is that right now QG gen queries on skypix/HTM datasets (i.e. reference catalogs) go through the same single-row `queryDatasets` call as calibration lookups (and the QG gen code can't easily tell which it's doing in any particular call, at least not without hard-coding a bunch of dataset types).  So I can't vectorize the calibration lookups (at least not easily) without also vectorizing the refcat lookups, and I can't vectorize the refcat lookups because the database doesn't have enough information about general HTM-observation overlaps.  By materializing all relationships (including those between HTM levels and instruments), we give the database the information it needs.  Materializing spatial relationships is not the only solution to this problem, but it seems much simpler than any others I could think of (these all involve splitting join logic between the database and our Python code).

The other problem this solves is going from approximate/conservative spatial relationships (currently all we can compute in the database) to more precise ones; our query layer currently has to pull down all regions from any table involved in an intersection whenever we have a spatial query, and then reject some result rows by doing its own more precise overlaps.  That added some complexity to the query system that was originally tolerable, but with the addition of temporary tables things got much worse; the fact that we cannot evaluate our queries fully within the database leads to a surprising behavior when queries are evaluated into temporary tables instead of directly resulting in Python objects; we simply cannot make those queries yield the same results.  That behavior is also inefficient: the temporary tables have more rows than they need, because we cannot reject based on the more precise regions.  By fully evaluating more precise overlaps up front when users register a relationship, the database gains all of the information it needs to fully evaluate queries, allowing us to drop

Virtual SkyMap tables

After Gen2 deprecation.

Materializing spatial joins is also one step towards removing the large (often full-sky) tract and patch tables from the database, and instead relying on our existing SkyMap classes that can compute everything we need on the fly: the join tables would still exist, and provide everything we need to relate tracts and patches to other dimensions in queries.

There are a few other steps we'd need to take to enable this, however, and that's why I'm not proposing this as a pre-Gen2-deprecation goal:

  • We'd need to enable Registry to construct a SkyMap instance from information in the repository, keyed by information in the Registry skymap table. That could be a pickled SkyMap blob, some new database-oriented SkyMap persistence format, or maybe even a way for Registry to retrieve the SkyMap from a Datastore. In any case, it's not as simple as Instrument, because a SkyMap needs both a class name and a configuration object to be constructed.
  • We need to be sure any tract/patch columns that might appear in a dimension query expression can be related to their primary keys. I think that's mostly just patch cell coordinates, for which the relationship is trivial, but the tricky part is injecting that into the query expression, not the actual calculation.

Expanding the Query DSL

Start before Gen2 deprecation, finish after.

We've had a custom (but very SQL-like) language for WHERE expressions for a long time, and I think it's been a huge success.  On DM-24938 - Getting issue details... STATUS I added some classes to represent (lazily) the outputs of various types of queries, with interactions that could be considered the beginning of another domain-specific language (DSL) for queries: one that allows query results to be used as subqueries in new queries, hence providing more flexibility for JOIN clauses.  Unlike our WHERE expression language (but like the SQLAlchemy system they are both built on), this one is based on interactions between Python objects, not string parsing.  It's also extremely limited right now, but I think extending it will provide a much better interface going forward, especially compared to the "lots of kwargs" approach taken by queryDatasets  and queryDataIds right now.  I think a example of how I expect queries to look will explain this better than prose or APIs:

# Let's look for reference catalogs in a personal collection and shared one...
refcats = registry.query_datasets(
    "gaia_dr2",
    collections=["u/jbosch/DM-26336", "refcats"],
).related_to(
    # ...whose HTM data IDs overlap visit+detector data IDs...
	registry.query_data_ids(
    	["visit", "detector"]
	).related_to(
        # ...that overlap patches where there are coadd datasets in a collection...
		registry.query_datasets(
			"deepCoadd",
			collections=["HSC/RC2/w_2020_40"]
		).get_data_ids()
	).intersect_on(
        # ...but compute that overlap by just intersecting tract and visit regions
        # not visit+detector regions and patch regions (which is the default)...
    	RelationshipCategory.SPATIAL,
    	instrument="visit",
    	skymap="tract",
	)
# ... and just return the refcat from the first collection, not both, when both
# have a refcat with the particular data ID.
).deduplicate()

This example is a bit contrived (to show more more complexity than I think we'll need in all but the rarest cases), and I think the method names need some work (we may also want to consider a bit of operator overloading).  But overall I think this is quite doable, and after an initial refactor of the query system (which I've prototyped pretty thoroughly already, and want to do soon for QG gen optimization work), I think we can add most of this functionality slowly instead of trying to deliver it all by Gen2 deprecation day.

The new interface here isn't a high priority (in an ideal world it might be, but as this isn't a major pain point, I don't think it can be a priority for Gen2 deprecation).  But it's something I wanted to at least have a vision for now, because rewriting the logic behind it does need to be a high priority - both QG gen optimization and DM-21904 - Getting issue details... STATUS require relaxing some of our assumptions about how spatial and temporal dimensions are related in queries, making those relationships more controllable by the callers even if the default behavior remains what it does now.  That kind of control would be awkward (at best) to fit into the current interface, and given the lower-level changes needed to implement it, I don't think it's worth trying.

  • No labels