The Run-Chaining Problem

The Gen2 butler uses a "lazy" mechanism to connect its data repositories - a repository contains pointers to one or (rarely) more parent repositories, which are searched in order until a requested dataset is found.

In Gen3, we had long recognized that this was not sufficient as as way to combine collections; in other contexts, we needed something more fine-grained, and have planned to use a "database tag" mechanism to satisfy that need.  This would involve having a row in a database table for every dataset-collection combination, and the contents of a collection would be the set of rows for that collection.  We would still maintain the existing "one dataset per dataset type and data ID in a collection" invariant via a unique constraint on the tag table.

To simulate the Gen2 behavior in which one constructs a butler pointing to a leaf data repository and automatically gets access to its parents, we have thus far copied dataset references from input collections into output collections (i.e. created new tag rows).  But this isn't actually any better at rigorously handling the multiple-parent case than the Gen2 approach (as we once thought it might be).  Consider the following case:

  • visits 10 and 11 are run through single-epoch processing, putting calexps into collection A
  • visits 11 and 12 are run through single-epoch processing, putting calexps into collection B
  • patch 50 is coadded from visits 10, 11, and 12, using inputs from (A, B) in that order, putting coadds into collection C
  • patch 51 is coadded from visits 10, 11, and 12, using inputs from (B, A) in that order, putting coadds into collection D
  • patches 50 and 51 are run through coadd processing, using inputs from (C, D), putting an object catalog into collection E
  • what should Butler(collection="E").get("calexp", visit=11) return?

The fundamental problem here is of course that a collection search path - whether evaluated lazily or used to create tag rows aggressively - just isn't a substitute for fine-grained provenance (which we also plan to have in Gen3).  But it's still extremely useful - not least because it's usually what fine-grained provenance would give you - and hence something we'd want to support.  And while the tag approach isn't conceptually worse than the lazy approach (just no better), it also creates a ton of tag rows, most of which will never be used (raw datasets and master calibrations get tagged every time a new run is created, for example).

Problems with runs having collections...

The current Gen3 definitions associate a dataset with exactly one run, and this relationship can never be changed (the dataset can never be removed from the run or associated with a different run).  A dataset can also be associated with any number of collections, and those relationships can be changed at any time in any way.  But a run is also associated with a collection , and that's used to add the run's datasets to that collection when they're inserted, and that opens the door to mayhem:

  • a dataset can be removed from its run's collection, without being removed from that run;
  • a dataset can be associated with a collection that is associated with some run other than its own run (and this is what always happens when we associate an input dataset into an output collection);
  • one can change the collection a run is associated with, and then use that collection in a new run.

This is at least a recipe for massive confusion, but because we also use the run's collection name in filename templates to (try to) ensure they're unique, and we don't change those filenames when we change what's in a collection, we don't actually have any guarantee about filename uniqueness at all, and that's a serious problem.

...solved by runs being collections

The prototype I've recently put together for upcoming Registry changes addresses the problems in the run/collection relationship by making it so a run is a (special) collection.  Datasets are still added with exactly one run and associated with that run forever, but instead of also tagging the dataset, we treat that run association as collection relationship in its own right (albeit one represented differently in the database, and hence one that has to be queried differently - but the Python code would take care of this distinction when generating queries).  And, crucially, it would be impossible to associate datasets into a run after they are inserted.

This simplification solves all of the run vs. collection problems, but it means we can't associate input datasets to some processing with the output run, unless we create both a run and a tagged collection to hold the outputs, with different names.  Having two names might seem redundant, but I'd argue that it's better to have different names for the recursive and non-recursive collections than to use the same name to mean something different in different contexts, and I think it's really good to have a single namespace for collections of different types: it means that all you need to rigorously define a group of datasets is a single string.

Putting it all together

Making runs a type of collection naturally opens the door to additional types of collections.  We've already implicitly discussed the other "normal" type of collection - tagged collections - and the prototype discusses a third (calibration collections) that isn't meaningfully different from tagged collections for the purpose of this discussion.

To switch back to lazy chaining of processing, then, we can add a fourth type - we'll call them virtual collections for now - that are just a description of how to search other collections (of any type).  When we do some processing, and create a run, we'd also create a virtual collection consisting of that run and all of its input collections.  Those would still be different collections, with different names - as noted in the last section, I consider that a feature, not a bug.

Implementing this should be pretty straightforward; we already support ordered multi-collection searches for datasets in Registry, because we allow multiple collections as inputs (and even different collection search paths for different dataset types) when generating QuantumGraphs.  It just makes sense to make that functionality available in Butler itself, and there's already a ticket for doing that ( DM-19617 - Getting issue details... STATUS ).  Once that's done, we just need APIs and tables for defining and saving virtual collection definitions, and logic to recursively expand them into the kind of search path of nonvirtual collections we already have code for.


  • No labels