Goals

I believe the architecture described below achieves all of these goals (even if it is sometimes vague about the details).  It also addresses (in part) some issues I didn't set out to work on at all:

Overview and Glossary

The new Registry is a concrete class that defines the only interface (of those described here) that is considered public to code outside daf_butler.  It is a thin wrapper around RegistryLayer that performs error-checking, friendly error messages, and type coercion before delegating any real work to the classes below.  In the near term, Registry will have a single RegistryLayer; in a multi-layer Registry future, it will have an ordered list of RegistryLayer  instances, and delegate to them in turn.

RegistryLayer is a simple struct-like class that aggregates a Database instance and several "manager" classes responsible for different kinds of entities that can exist in a Registry.

Database is an abstract base class whose instances represent a particular schema/namespace and provide low-level database operations.  Database subclasses correspond to different database engines (SQLite, Oracle, etc).

The manager classes are all abstract base classes that act as containers and factories for an associated "record" or "records" class (also an ABC, unless otherwise noted below):

All manager and records classes have a reference to the Database that backs their layer.  Registry interacts directly with the manager and records objects, but rarely (if at all, at least after construction) with Database.

The prototype itself is on branch u/jbosch/prototyping, and mostly in the daf.butler.prototyping subpackage (more direct pointers to useful content appear later in this document).  None of this code can be expect or even import cleanly, but the documentation for the interface classes is ready for review.  The implementation classes are best considered an in-depth thought experiment to check that the interfaces were viable.

Prototype Details

Changes to Concepts and Primitives

The current codebase has classes for Run and Quantum that subclass Execution (which is not used otherwise), while collections are always represented as strings.  The relationship between a run and its collection is ambiguous and ultimately unsound - it provides a loophole that blocks us from being able to guarantee unique filenames (at least).  The prototype drops the Execution class (and table), and instead treats a run as a kind of collection (see Dataset/Collection Constraints).  Both runs and collections are now represented by strings in high-level code, and by CollectionRecord and RunRecord (which provide access to the integer surrogate ID) within Registry and its component classes.  The host+timespan information provided by Execution has now just been moved directly into the run and quantum tables.

Registering collections (not just runs) is now explicit, not something that happens implicitly when a collection is first used.

The prototype codebase currently uses new DatasetHandle classes instead of DatasetRef; DatasetHandle has different classes that represent different levels of knowledge about a dataset (in particular, "resolved" means we know its ID and origin, and that it exists).  I'm not sure, in retrospect, that the split into multiple classes was worthwhile, and it may be reverted, but this is largely tangential to the rest of the architecture, and it is not worth reviewers' time to think about.

The prototype codebase also uses special classes to represent iterables of DatasetHandle and DataCoordinate.  These were motivated in large part by a desire to have containers/iterables that remember what "optional" aspects their elements have - in particular, whether datasets are resolved and data ID are expanded.  The existing prototype code doesn't really utilize this much, and they may not be a part of the final design.  I'd like to see how they interact with the high-level Registry interface and particularly its query operations before deciding; they could also be a very nice way to represent query results.

Dataset/Collection Constraints

The prototype defines two new enumerations that together describe the kinds of constraints we place on the datasets that can be in a collection.

First, there are three types of collections:

class CollectionType(enum.IntEnum):
    """Enumeration used to label different types of collections.
    """

    RUN = 1
    """A ``RUN`` collection (also just called a 'run') is the initial
    collection a dataset is inserted into and the only one it can never be
    removed from.

    Within a particular run, there may only be one dataset with a particular
    dataset type and data ID, regardless of the `DatasetUniqueness` for that
    dataset type.
    """

    TAGGED = 2
    """Datasets can be associated with and removed from ``TAGGED`` collections
    arbitrarily.

    For `DatasetUniqueness.STANDARD` dataset types, there may be at most one
    dataset with a particular type and data ID in a ``TAGGED`` collection.
    There is no constraint on the number of `DatasetUniqueness.NONSINGULAR`
    datasets.
    """

    CALIBRATION = 3
    """Each dataset in a ``CALIBRATION`` collection is associated with a
    validity range (which may be different for the same dataset in different
    collections).

    There is no constraint on the number of datasets of any type or data ID
    in ``CALIBRATION`` collections, regardless of their `DatasetUniqueness`
    value.

    There is no database-level constraint on non-overlapping validity ranges,
    but some operations nevertheless assume that higher-level code that
    populates them ensures that there is at most one dataset with a particular
    dataset type and data ID that is valid at any particular time.
    """

and (in this respect) two different kinds of dataset types:

class DatasetUniqueness(enum.IntEnum):
    """Enumeration indicating the kind of uniqueness constraints to
    define on data IDs of this dataset type.
    """

    STANDARD = 2
    """There may only be one dataset with this dataset type and a particular
    data ID in a `~CollectionType.TAGGED` collection.
    """

    NONSINGULAR = 3
    """There may be any number of datasets with this dataset type and a
    particular data ID in a `~CollectionType.TAGGED` collection.
    """

This table summarizes the constraints we place, given both of these:


DatasetUniqueness.STANDARDDatasetUniqueness.NONSINGULAR
CollectionType.RUNdata ID + dataset typedata ID + dataset type
CollectionType.TAGGEDdata ID + dataset typeno constraint
CollectionType.CALIBRATIONno constraint*no constraint

* there are no constraints on CALIBRATION collections in the database, but we expect the higher-level code that produces them and assigns validity ranges to ensure that there is no duplication within each validity range when that makes sense for the dataset type.

Implications for Calibration Products Production

The introduction of CALIBRATION collections with validity ranges is tied to the removal of validity ranges from the dimensions system, where they are currently linked to the calibration_label  dimension.  Whether the calibration_label  dimension itself is removed as well depends on how we want to use RUN collections when building master calibrations.

Once a validity range is known for a master calibration, it can be associated with a calibration collection, and that validity range added to that association can be used along with the rest of its data ID (e.g. instrument+detector  for bias, instrument+detector+physical_filter  for flats) to retrieve it any given point in time.  I am relatively confident that this a solid model for using master calibration datasets, at least once a complete suite of calibrations is produced.

Before it is associated into a calibration collection, however, we still need a way to uniquely identify a master calibration, at the very least so we can later identify which dataset we're associating with each validity range.  In Gen2, we used a calibDate  field for this, which no one was happy with, and that's what the calibration_label  was intended to replace (with the content of that label still very much TBD; it could be date based, or computed from the constituent visits, or a counter...): the full data ID for a bias would be instrument+detector+calibration_label .  With calibration_label no longer needed for validity ranges, we now have two ways to proceed:

The Database Class and SQLAlchemy Usage

Database fully encapsulates the SQLAlchemy connection and engine objects, requiring all write operations to go through specialized methods.  That works because we seem to only need relatively simple version of these operations, and in at least some cases it's necessary because we depend on non-standard functionality that most database engines nevertheless support, but in different ways (e.g. sync, replace).  Encapsulating write operations also allows Database complete control over transactions, including the handling operations that can interrupt transactions (both table creation and sync, which may require retries inside its own transaction to be safely concurrent for some database engines).  We consider performing any operation that could interrupt a transaction in within a transaction block a logic error, regardless of whether it actually would in the database in question.  Happily, all of our high-level operations that need to be implemented on top of transaction-breaking low-level operations (e.g. registering new dataset types, adding new collection) seem to be the kind where this is not a major problem.

Database does not attempt to provide much of an abstraction around SQLAlchemy's tools for constructing and executing SELECT queries.  Fully encapsulating SQLAlchemy for SELECT queries would be a huge waste of our time, as we'd essentially be reimplementing a large chunk of SQLAlchemy's own abstracts.  And we don't have to – while we will be writing fairly complex queries, we don't expect to need to use extensions to standard SQL.

Table definition and creation follows a middle path.  We have our own classes (not just in the prototype, but on master) for representing table, field, and foreign key specifications.  The manager and records objects use these to describe what they need to a Database, which can then customize the translation to SQLAlchemy before returning the SQLAlchemy representation to the manager and records objects.  A major change from the current Registry implementation is that the schema is now defined programmatically by the manager and records classes, instead of in YAML configuration.  This keeps them close to the code that depends on them, and it allows us to encapsulate normalization and other schema design changes behind stable Python interfaces.

The result of these choices is that sqlalchemy.sql.FromClause and sqlalchemy.schema.Table objects (and the objects that comprise them) are quite common at the manager and records interface level.  Other SQLAlchemy objects are mostly hidden behind Database, and no SQLAlchemy usage is exposed in the public Registry interface.

The prototype for the Database interface is nominally complete, in that no further changes or additions seem necessary right now (aside perhaps from clearer expectations on exceptions that will be raised for different error conditions).  Small changes as it transitions from a prototype to production code are of course expected.  There is no concrete implementation of Database in the prototype, but the base class does provide default implementations for most methods (and will provide more in the production version).  The ABC does have complete API documentation, and that is worth reading for reviewers interested in more detail.

Manager and Records Classes

The manager and records interfaces are structured largely to encapsulate decisions about to map our in-memory data structures to tables, including both normalization/denormalization and vertical/horizontal partitioning.  This encapsulation solves an number of problems:

In some cases, these interfaces also provide an abstraction over whether entities that are relatively few in number (collections, dataset types) are aggressively fetched from the database to avoid repeat queries, fetched only when needed, or something in between (i.e. cached after first use).

The separation into managers and records classes establishes a boundary between operations that depend only on "static" tables that can be expected to always be present in any layer (all a manager class interacts with) and those that may also depend on a "dynamic" table (one that is only created in a layer when needed).  Table creation can be transaction-interrupting, and the same is actually true of the Database.sync operation we would use on the static "meta" tables that remember which dynamic tables exist, so it's very important that the API make it clear when table creation could happen.  The two-class pattern achieves this by putting all transaction-interrupting operations inside the manager class's register  method, which is one of two ways a records object can be obtained (and the other, a get  method on the manager, simply returns None if no such records object already exists).  The CollectionManager  is a partial exception to this pattern - we do not expect to need any dynamic tables for collections, but we do need to use Database.sync  when registering a new collection.

The manager and records interfaces do not attempt to encapsulate decisions on what kinds of primary keys we would use for different entities.  The interfaces explicitly require each entity to have one of three distinct kinds of primary key; it turns out that the kind of primary key has implications even for the primitives that represent this entities in Python, so abstracting over those choices does not seem worth the effort.  In particular:

Another common pattern in the manager and records interfaces is the presence of one (or occasionally two) select  methods that return sqlalchemy.sql.FromClause  objects.  Those may represent tables directly if appropriate, but more frequently will represent SELECT queries that can be used as subqueries to form more sophisticated queries.  This approach has already been worked out to a large extent on master in the queries subpackage, and while that will need changes to work with the new prototype, I'm not concerned about the viability of the approach overall.  The select  methods (and the higher-level query APIs that utilize them) are significant for the extension to multi-layer registries, however, because they represent the only operations for which a straightforward "try layers on order until one succeeds" (or an equally simple variant on that) doesn't work.  Instead, the query system for a multi-layer Registry  would call select methods on all layers, drop those for which the result is None  (which may happen if, e.g., the layer knows it has no datasets of a particular dataset type), and combines the rest via UNION or UNION ALL before passing the result of that to the query builder system (which could otherwise remain largely unchanged).  That means these operations are the only operations that cannot work in general on a Registry backed by layers with heterogeneous database engines, and even there they could sometimes work if the logic is clever enough to notice when all layers that actually contribute to a particular query are backed by the same connection.

The manager and records interfaces sometimes depend on each other:

The abstract base classes for all of the manager/records pairs are mostly complete and fully documented in the prototype, and are worth looking at for further details:

There are probably some operations missing (querying collections and handling composite datasets come to mind), but enough is present to give me confidence that the overall architecture is stable.  I believe all operations that involve nontrivial interactions between different components have been included.

The prototype also includes reference implementations for all of these, but these are undocumented and completely untested; as noted above, they were written primarily as thought experiments to validate the interfaces (though I do intend to copy from them heavily when writing the production versions).  I'm not sure we will ever need more than one implementation for most of these (collections and opaque tables in particular are pretty straightforward), but it nevertheless seems prudent to separate interface from implementation for all of them.

Registry Itself

While there is a Registry  class in the prototyping directory, it's not nearly as complete as the rest of the prototype I've described here - not only is it interface-only right now (despite the vision of it being a concrete class), but I haven't actually looked at it at all since the rest of the prototype came together in its current form.  I do expect there to be changes to the public Registry API associated with moving to the prototype, but these should be straightforward: separating transactional from transaction-breaking operations, adding vectorization, and using more consistent verbs for similar operations on different entities.

I'm of two minds about whether to continue the prototyping work to extend it to Registry  itself, or to just dive into moving the existing prototype ideas into the production version and updating the Registry APIs in the process.  I don't think I need to treat the implementation of Registry itself as another thought experiment to establish the viability of the overall architecture of the prototype, but I do imagine that doing so would help clarify and stabilize the details of the manager/records and Database APIs before those go into active use.  This needs to be weighed against the need to start actually putting a lot of this functionality onto master sooner to unblock others (especially the vectorization and SAVEPOINT reduction that will speed up at-scale ingest to the point where it's actually usable).

Remaining Challenges for Multi-Layer Registry

In a single-layer Registry, all of our logical foreign key constraints and unique constraints can be real database constraints.  This is not true in a multi-layer Registry:

The other obvious gap in this design in moving to multiple layers is in Registry construction and configuration.  I think this would need to involve at some level specifying concrete class names for the managers/records and Database in configuration, but there are simply a lot of details to work out.  It does seem to be a much simpler problem if we prohibit "rebasing" of layers - once some layer B has been created on top of some other layer A, one would never be able to load B without A.  Hopefully we will be able to maintain flexibility in the opposite direction (not requiring B in order to load A), of course.