Gen3 Database Schema Versioning

Identifiers for versions

Each repository database will include the following in metadata tables:

The daf_butler git repository will contain the general schema version as a Python constant. This is the number written into new repositories created with that version of daf_butler.

Any dimensions configuration file (e.g. config/dimensions.yaml in daf_butler) will contain a dimensions version.

Any repo configuration file (inlcuding the defaults in daf_butler) will contain Database, manager, and datastore class names.

Extension-class compatibility

Compatibility for class names is the simplest: all configuration and in-database class names for managers, the Database subclass, and datastores must be identical, and should be checked at Registry client startup.

On (or before), these configuration values will be written into the per-repo config file when a repository is created, rather than being linked to the default configuration in daf_butler.

Whenever reasonable, backward- or forward-incompatible schema changes should be handled by adding new extension classes. Old repositories will then be unaffected, as the previous extension classes they are configured to use will still exist (unless they are deprecated and ultimately removed - something we can manage on a case-by-case basis, and postpone as long as desirable).

The implementation of a named extension class may change arbitrarily between Python versions as long as those changes are not reflected in the schema (or the interpretation of the schema). If a named extension class does undergo a change that is reflected in the schema, the general schema version must be updated.

Dimension compatibility

The dimension configuration will (on DM-24660, or earlier) be written into the per-repo config file when a repository is created, rather than being linked to the default configuration in daf_butler, and we will start updating the version number in the default config file whenever it is modified.

If the dimension configuration file has version X.Y.Z1 and the database has version X.Y.Z2, with Z1 > Z2, they are fully compatible. Patch version changes are thus expected to cover only very minor changes, such as changing the length of string fields.

If the dimension configuration file has version X.Y1.n* and the database has version X.Y2.*, they have limited compatibility:

Minor version changes are expected to cover the addition of new dimensions and the addition of new metadata columns. This will require code changes to implement, however; at present adding new dimensions to a DimensionUniverse is actually fairly disruptive, and almost any change would require a major version increment.

If the dimension configuration file and database differ in major version, they are incompatible and it may not be possible to initialize the Registry client at all (so a migration script will definitely be needed).

Modifications to dimension configuration already in repositories (as opposed to the defaults in daf_butler) are effectively forks of the main configuration in daf_butler, and should be versioned as X.Y.Z+dX.dY.dZ, where X.Y.Z is fixed at daf_butler version at which it diverged, and dX.dY.dZ is the version of the fork (starting at 0.0.0).

General Schema compatibility

The major, minor, and patch components of the general schema version behave just like those of the dimensions schema version, but they correspond directly to a daf_butler software version rather than the dimensions configuration:

Not all daf_butler code changes should involve a schema version increment - only those that change the schema in a way that is not captured by an extension class or dimensions configuration should.

The general schema version is thus not the same as any software version applied to the daf_butler codebase (in EUPS/git/etc.).  Those will both increase monotonically, and hence the relationships between software and schema versions will always be straightforward, but it will not be one-to-one.

Migrations

Migrations always correspond to the complete repository schema (including datastores), and hence the begin and end points of a migration can only be identified by all of the above identifiers:

For convenience, we probably want to define a hash that combines all of these; this hash would ideally then have a one-to-one relationship with the hash of the actual DDL used by Alembic (I think). There would be no overall ordering for these hashes.

We should not expect to have a migration between any pair of possible hashes. While we may want to eventually have a policy stating when a migration must be created in order to make a code or convention change, we should not expect a migration to be created for every change to the schema.

One of the main things I hope to get from Alembic would be a way to organize and compose these migrations on-demand, along with a way to store compound migrations after they have been created, especially in cases where composition isn't fully automatic.

Preference order for versioning

Changes to the dimensions tables should always be made via changes to the dimensions configuration, except for large-scale structural changes to how the dimensions configuration is interpreted and represented in the database. These will generally have to be identified by changes to the general schema version, but in rare cases they may be implemented via a new DimensionRecordStorageManager subclass (which is preferred when it is possible).

For other changes, adding a new extension subclass (even when this results in some code duplication) should generally be preferred over incrementing the general schema version, as this makes those changes opt-in on a per-repository basis. Exceptions to this rule include:

Limitations

This proposal is at present only about backwards compatibility, and it has no notion of forward compatibility at all (i.e. no support for older code reading newer data repos). It's clear we could support that in at least a few cases, but I'm not sure the use cases for doing so are compelling enough to merit the additional complexity in the versioning logic. It would be much safer, for example, to check versions at client startup and fail immediately in the case that the repo is newer than the code/config, rather than expect each code component to carefully proceed and rigorously guard against repo corruption while working with a schema the code author cannot possibly have anticipated.

This proposal also defines version increments as transitive: X.Y.2 is fully compatible not just with X.Y.1, but X.Y.0 as well. This is a statement about when we should increment various version numbers, and how this is at some level case-by-case; a change that might otherwise be a patch-level change when applied to the immediate previous version might need to be a minor version increment instead if it is not fully compatible with some earlier patch version on the same minor release.