The design on this page should be considered obsolete - it doesn't actually solve the problems I was hoping it would solve.  Some of the discussion is still useful, as I think it provides at least some hints at a general path forward.  In the meantime, I think I have a much smaller modification to the current in algorithm in hand that should address the DM-19988 issue (while leaving more challenging problems - particularly the "jointcal problem" noted below) for the future.

The Main Problem

DM-19988 - Getting issue details... STATUS  is a symptom of a fairly deep problem in the current QuantumGraph generation algorithm: the algorithm at present assumes the set of data IDs to be processed is defined by a strict intersection of all relevant dimensions, according to their (usually spatial) relationships.  Because dimensions also provide a form of discretization that expands what's covered by that intersection, this does what we want as long as there is just one pairwise relationship in the query (i.e. visit_detector_region to patch), but it falls apart when that's not the case, such as:

  • In order to process a visit+detector, we also need to do a spatial query on skypix to find reference catalogs.  But strict intersection can filter out any skypix entries that don't overlap the tract or patch of interest, even if they do overlap a visit+detector that overlaps the tract or patch of interest.  This is the actual DM-19988 case.
  • If some task (e.g. jointcal) needs full visits, including some visit+detectors that may not overlap a tract or patch of interest.  And if the graph includes single-epoch processing that feeds jointcal, we need to generate quantums for those non-overlapping detectors, too.

Side Problems

The QuantumGraph generation algorithm has a few other known issues we'd love to fix at the same time if a rework presents an opportunity for this:

  • It's slow.  The Big Join Query (hereafter BJQ) is so big and complex that it must be confusing the query optimizer pretty badly (or perhaps it's confusing us, by making it unclear which indexes we need to add to get the right plan), because these queries take much longer than they should.
  • The strict-intersection problem actually hit us once already, in the temporal matching of calibration_label to exposure.  We worked around that by adding the "per-DatasetType dimension" concept, which was an attempt at generalization that still feels more like a case-specific band-aid.  Temporally matching in calibration products feels a lot like spatially matching in reference catalogs, and a common solution to both problems would make me a lot more confident that such a solution actually is more than a case-specific workaround.
  • We currently assume that a Quantum should be generated if there is at least one input dataset for each of its input dataset types.  That's usually what we want, but tasks that expect multiple inputs of a particular type may require more than one to exist, and some tasks may not need any instances of a particular dataset type (or even expect none!), such as fringe frames in ISR for bluer filters.  We already have a workaround for the first case, by having Tasks without enough inputs produce dummy outputs that can be correctly consumed by later stages, and a special-case fix for the latter is underway on DM-16805 - Getting issue details... STATUS .
  • This isn't a problem per se, but we've identified recently that it'd be nice if QuantumGraphs were a bit more reusable.  That's mostly taken the form of talking about updating the Task software versions or configuration without needing to regenerate the graph, which I think is mostly  orthogonal to the topics on this page, but I think being able to apply the same QuantumGraph to different input collections would also be quite useful, and is potentially related.  That might look like first generating a QuantumGraph with no fully-resolved dataset_ids, just data IDs (even if the presence or absence of some inputs in one collection is still part of generating the graph), and then being able to "apply" (need a better verb for this) that graph to other collections, yielding a new QuantumGraph with dataset_ids and possibly some Quanta filtered out (either because they represent completed processing already done in that collection, i.e. the manual-retry case, or because some needed inputs aren't present in that collection).

Proposal

At least for now, I think we should just encode a lot of knowledge about the concrete set of dimensions we have into the algorithm.  We can look for ways to generalize that later, but I think this plan will work for all current and known-future PipelineTasks in any combination, and it may well be easier to explicitly manually expand a bit rather than fully generalize when we encounter PipelineTasks that don't fit (like CPP tasks, which wouldn't have fit with the old algorithm, either).

We'll start with a BJQ formed with the following logic:

  • Join in any instrument-dependent dimension tables that are part of any Task's quantum dimensions, and then recursively expand to include implied dimensions (e.g. visit implies physical_filter).
  • Join in any skymap-dependent dimension tables that are part of any Task's quantum dimensions, [and then recursively expand, but this will always do nothing].
  • If the joined dimensions include both "instrument" and "skymap", relate the above dimensions only via the broadest spatial relationship: visit_tract_join (other spatial join tables are ignored).  As before, we will filter out non-overlaps using the actual visit and tract regions in Python processing of the result rows.
  • If abstract_filter is included in any Task's quantum dimensions, join it in if it has not already been included (by expanding physical_filter).
  • Join in a dataset subquery for any input datasets that are not also outputs and are not marked as "prerequisite" (these are the "required inputs").
  • SELECT DISTINCT from the joined dimensions, but do not attempt to fetch matched dataset_ids at all - instead we aggregate over them with that DISTINCT.  (Would a GROUP BY on all selected fields be better or worse than DISTINCT for this?  Seems like it ought to be identical.)

The BJQ now thus yields DataIds only, not dataset_ids, and the first-stage QuantumGraph we create will as well.  We'll construct that from the query results with the following logic, iterating over the Tasks in the Pipeline in reverse order:

  • Construct a set of proposal Quanta by selecting rows from the BJQ outputs that are unique over just the Quantum dimensions of this Task.
  • Optionally filter the proposal Quanta by keeping only those that produce outputs that are consumed as inputs by a (later) Quantum already in the graph.  Whether we do this step should be controllable by the user, probably via some kind of per-DatasetType "minimal vs. maximal" setting - a Quantum is not filtered if it produces any DatasetTypes configured as "maximal".
  • Attach unresolved (no dataset_id) DatasetRefs to each Quantum for all regular inputs and outputs - not prerequisite inputs.

This gives us a QuantumGraph that should be valid for the collection(s) used to create it, as long as we continue to assume that at least one input of each dataset type is enough to make a valid Quantum.  I think it makes sense to follow that up with a new stage that attempts to resolve input DatasetRefs to the datasets in a particular collection (which may now differ from the one used to construct the initial QuantumGraph).  This stage would iterate through Tasks in forward order:

  • Aggregate across all Quanta for the Task the lists of data IDs for each input and output DatasetType for that Task.
  • Perform a bulk follow-up query for all instances of each DatasetType.  Cache the results and use these to avoid duplicate queries as iteration proceeds
  • Iterate over all of the Task's Quanta again, attaching resolved DatasetRefs and removing Quanta for which inputs are not present in either the follow-up query results or the set of output DatasetRefs from previous Quanta.  Resolve cases where some or all outputs already exist, raising exceptions or filtering out those Quanta (as indicated by user settings).

This last stage would now be one that could be repeated (starting from the original QuantumGraph) with different sets of input collections.

Implementation Details

I'm currently planning to put most of the initial code for this in pipe_base rather than daf_butler, which may involve having a lot SQLAlchemy-dependent code there.  That will involve having that new code access private members of SqlRegistry for now, but I already have plans to make public interfaces for Registry to obtain SQLAlchemy "selectables" (tables, views, subqueries) that can be used for this in the future.  Unlike the previous incarnation of this code, it seems too specific to QuantumGraph generation to belong in daf_butler.  We may then be able to retire or simplify a lot of the QueryBuilder stuff in daf_butler (as it now seems at least some of that is more specific to QuantumGraph generation that it seemed at the time it was added).

Limitations

This design prohibits (for now) Tasks with label, calibration_label, or skypix in their quantum dimensions (note that we have no such Tasks right now) or output DatasetTypes.  Input DatasetTypes may use these dimensions, regardless of whether they are prerequisites (but right now all DatasetTypes that do use these dimensions are in fact marked as prerequisites).

Future Extensions

  • It would make a lot of sense to delegate to each Task the responsibility of filtering out Quanta that don't have the right number of inputs, as it's the Tasks that can determine that best.  This would probably work best as an abstract classmethod on PipelineTask, but an abstract instance method or a helper class should be on the table, too.  A complication is that the code that does this will probably want to retrieve information from the Registry in order to know how many inputs they could be getting, such as the number of detectors that could appear in a visit, or the number of visits that could have been included in a coadd.  It's not clear how best to define an API to do that.
  • If we have two stages of QuantumGraph generation, and the dataset_ids and prerequisite inputs are only included at the second stage, we should consider whether we should use the same QuantumGraph and Quantum classes to represent both stages.  One possibility might be to have a class hierarchy for each, separating common functionality into base classes.  This is closely related to the question of whether DatasetRefs should always represent "resolved" datasets.
  • If the QuantumGraph BJQ only ever uses visit_tract_join for spatial relationships, we may be able to drop the other join tables.  This is closely related to the refactoring underway (but now probably deferred until  DM-19988 - Getting issue details... STATUS is done) on DM-17023 - Getting issue details... STATUS .


  • No labels

3 Comments

  1. I need to read more to understand all details, for now just a general comment. My understanding (maybe incorrect) is that our biggest issue is that dimension constraints are applied to all dataset types at once, e.g. if user expression limits tracts/patches then it is applied not only to output datasets but also to all inputs. One way to fix this issue is to apply constraints to only specific dataset types, which I think effectively makes all dimensions "per-DatasetType dimensions". And I'm sort of convinced that this should be a correct way of handling things. (If we do it this way it means user expression will be more complicated, instead of "dimension = X" it would have to specify "DatasetType.dimension = X").

    1. Thanks, Andy Salnikov.  I've now gotten far enough along in implementing the above plan to realize it just doesn't work very well (if it works at all, it'd be because I added a bunch of little special-case hacks that may only apply to the current pipeline).  But I think you're definitely on to something with the solution being related to expanding the per-DatasetType dimensions concept.  And I think we can avoid making the user expressions more complicated in the common case by implicitly assuming that unqualified dimensions refer to the quantum data IDs of the last task in the pipeline, which would usually be correct.

      The thing that scares me about this approach is that it seems to blow up the size of the big query even more - it seems naively like it'd need to involve joining multiple aliased copies of each dimension table into the query.  I think we'd need to work pretty hard to (dynamically!) pre-simply the SQL to make that work.

      1. I thought about this a bit further, and I think there's a hybrid between what I originally proposed here and your per-DatasetType dimensions that might work.

        To start with, I think they already have one thing in common: my focus on a big join query that only includes quantum dimensions effectively declares all other dimensions to be per-DatasetType (and resolved independently per-Quantum).  That yields the correct behavior for both temporal calibration lookups and spatial refcat lookups, so I think that probably is a safe definition of a per-DatasetType dimension.  It also helpfully defines a set of dimensions that are not per-DatasetType, and we can restrict the user expression to just those.  And that's why I think my proposal on this page does effectively solve the refcat problem (and generalize the solution to the calibration lookup problem.  But it does that because both of those involve pre-existing inputs that are never produced by this Pipeline; it can't solve (on its own) the jointcal problem, where the datasets that would be filtered by strict dimension intersection are both produced and consumed by the Pipeline.

        To handle the jointcal case without blowing up the big join query with aliased copies of all of the dimensions, I'm now thinking about trying to chunk up the Pipeline into sub-graphs (Pipeline-style graphs, not QuantumGraph-style graphs), each of which would form a subquery in the big join query, with its own aliased copies of at least some dimensions.  I think the query for the current Pipeline (+jointcal) can probably be split up into only 2-3 sub-graphs, which is much better than the many tens of dataset types.  What I haven't fleshed out is:

        • How do you define the Pipeline sub-graphs?  I think you need to at least draw boundaries whenever there is a spatial join involved between a DatasetType's dimensions and the quantum dimensions of the Task that consumes or produces it.  There may need to be other kinds of boundaries as well.
        • Does each sub-graph sub-query have its own aliased copy of dimensions involved?  Or can some common dimensions ever be pulled out of the sub-queries into the larger query?  I think that's actually starting down the road to a recursively nested algorithm: sub-graph boundaries only involve changes in some dimensions, so any group of sub-graphs in which a set of dimensions stays constant can be considered a parent sub-graph.  That actually  ties in nicely with the per-DatasetType-dimensions idea if you think of a single DatasetType as just the lowest level of the hierarchy.

        I need to go run some careful thought experiments on this idea now.  Right now it feels both quite promising and an unfortunately large departure from the way we'd conceived of this problem in the past.