Gen3 Database Schema Versioning

Identifiers for versions

Each repository database will include the following in metadata tables:

  • a semver (X.Y.Z) general schema version;
  • a semver (X.Y.Z) dimensions version;
  • the fully-qualified name of the registry.interfaces.Database subclass;
  • the fully-qualified names of all registry manager classes (i.e. registry/managers section of config);
  • the fully-qualified names of all datastores.

The daf_butler git repository will contain the general schema version as a Python constant. This is the number written into new repositories created with that version of daf_butler.

Any dimensions configuration file (e.g. config/dimensions.yaml in daf_butler) will contain a dimensions version.

Any repo configuration file (inlcuding the defaults in daf_butler) will contain Database, manager, and datastore class names.

Extension-class compatibility

Compatibility for class names is the simplest: all configuration and in-database class names for managers, the Database subclass, and datastores must be identical, and should be checked at Registry client startup.

On DM-24660 - Getting issue details... STATUS (or before), these configuration values will be written into the per-repo config file when a repository is created, rather than being linked to the default configuration in daf_butler.

Whenever reasonable, backward- or forward-incompatible schema changes should be handled by adding new extension classes. Old repositories will then be unaffected, as the previous extension classes they are configured to use will still exist (unless they are deprecated and ultimately removed - something we can manage on a case-by-case basis, and postpone as long as desirable).

The implementation of a named extension class may change arbitrarily between Python versions as long as those changes are not reflected in the schema (or the interpretation of the schema). If a named extension class does undergo a change that is reflected in the schema, the general schema version must be updated.

Dimension compatibility

The dimension configuration will (on DM-24660, or earlier) be written into the per-repo config file when a repository is created, rather than being linked to the default configuration in daf_butler, and we will start updating the version number in the default config file whenever it is modified.

If the dimension configuration file has version X.Y.Z1 and the database has version X.Y.Z2, with Z1 > Z2, they are fully compatible. Patch version changes are thus expected to cover only very minor changes, such as changing the length of string fields.

If the dimension configuration file has version X.Y1.n* and the database has version X.Y2.*, they have limited compatibility:

  • all read-only operations are supported
  • all dataset insertion/deletion and collection manipulation is supported
  • dataset type registration and dimension insertion may not be supported, depending on which dimensions are referenced; Registry is guaranteed to raise exceptions without modifying the repository when an unsupported operation is requested.

Minor version changes are expected to cover the addition of new dimensions and the addition of new metadata columns. This will require code changes to implement, however; at present adding new dimensions to a DimensionUniverse is actually fairly disruptive, and almost any change would require a major version increment.

If the dimension configuration file and database differ in major version, they are incompatible and it may not be possible to initialize the Registry client at all (so a migration script will definitely be needed).

Modifications to dimension configuration already in repositories (as opposed to the defaults in daf_butler) are effectively forks of the main configuration in daf_butler, and should be versioned as X.Y.Z+dX.dY.dZ, where X.Y.Z is fixed at daf_butler version at which it diverged, and dX.dY.dZ is the version of the fork (starting at 0.0.0).

General Schema compatibility

The major, minor, and patch components of the general schema version behave just like those of the dimensions schema version, but they correspond directly to a daf_butler software version rather than the dimensions configuration:

  • differences in the patch version do not break backwards compatibility at all;
  • differences in the minor version permit read-only access, and may permit some write access;
  • differences in the major version represent changes that are not backwards compatible at all.

Not all daf_butler code changes should involve a schema version increment - only those that change the schema in a way that is not captured by an extension class or dimensions configuration should.

The general schema version is thus not the same as any software version applied to the daf_butler codebase (in EUPS/git/etc.).  Those will both increase monotonically, and hence the relationships between software and schema versions will always be straightforward, but it will not be one-to-one.

Migrations

Migrations always correspond to the complete repository schema (including datastores), and hence the begin and end points of a migration can only be identified by all of the above identifiers:

  • general schema version;
  • dimensions version;
  • all extension class names (Registry managers, Database, Datastores).

For convenience, we probably want to define a hash that combines all of these; this hash would ideally then have a one-to-one relationship with the hash of the actual DDL used by Alembic (I think). There would be no overall ordering for these hashes.

We should not expect to have a migration between any pair of possible hashes. While we may want to eventually have a policy stating when a migration must be created in order to make a code or convention change, we should not expect a migration to be created for every change to the schema.

One of the main things I hope to get from Alembic would be a way to organize and compose these migrations on-demand, along with a way to store compound migrations after they have been created, especially in cases where composition isn't fully automatic.

Preference order for versioning

Changes to the dimensions tables should always be made via changes to the dimensions configuration, except for large-scale structural changes to how the dimensions configuration is interpreted and represented in the database. These will generally have to be identified by changes to the general schema version, but in rare cases they may be implemented via a new DimensionRecordStorageManager subclass (which is preferred when it is possible).

For other changes, adding a new extension subclass (even when this results in some code duplication) should generally be preferred over incrementing the general schema version, as this makes those changes opt-in on a per-repository basis. Exceptions to this rule include:

  • New Database and Datastore subclasses should be created only when necessary.

  • Patch-level changes to the general schema version whose migrations can be generated automatically should be preferred to the creation of a new extension subclass. We don't want to create new Manager subclasses every time we add a non-key column or change a string field length, for example.

Limitations

This proposal is at present only about backwards compatibility, and it has no notion of forward compatibility at all (i.e. no support for older code reading newer data repos). It's clear we could support that in at least a few cases, but I'm not sure the use cases for doing so are compelling enough to merit the additional complexity in the versioning logic. It would be much safer, for example, to check versions at client startup and fail immediately in the case that the repo is newer than the code/config, rather than expect each code component to carefully proceed and rigorously guard against repo corruption while working with a schema the code author cannot possibly have anticipated.

This proposal also defines version increments as transitive: X.Y.2 is fully compatible not just with X.Y.1, but X.Y.0 as well. This is a statement about when we should increment various version numbers, and how this is at some level case-by-case; a change that might otherwise be a patch-level change when applied to the immediate previous version might need to be a minor version increment instead if it is not fully compatible with some earlier patch version on the same minor release.

  • No labels

5 Comments

  1. It still does not fit entirely into my brain but here are few random comments anyways:

    • With potentially everything being versioned we need something stable to depend upon, so at least the metadata needs to be reliably readable by all possible client versions. It could be versioned too but it still needs to be (at least partially) forward- and backward-compatible.
    • Regarding fully-qualified names - that is potentially unstable with regard to refactoring, e.g. simply moving a class to a different package or subclassing would probably break things. For extension classes I think the name is supposed to define that kind of implementation for a particular manager type, would it be more stable/flexible to store some arbitrary string for an implementation and use that as an argument for a factory method?
    • Having Database  class name in metadata is a chicken & egg problem - to read metadata we already need to use specific Database  type to access database. I'd think that this is probably not very useful at all, we probably are not going to have more than one implementation for any given backend?
    • SemVer may be confusing when taking about database schema, I think usual convention is that major version increment means incompatible change, minor version is for backward-compatible changes, and patch is just a bug fix (change in behavior not visible to external world). If we expose (e.g. document) our schema then any change is potentially incompatible, even trivial change in string length. It would be possible to write backward-compatible version of Python code that hides that simple change in schema but then we should probably avoid exposing schema to outside world.
    • Management of the migrations could be a complicated topic with lots of branching/forking. I can imagine that each DRP release have its separate registry schema which is frozen at a point in time but still may need occasional bug fixes. If daf_butler is a single source of truth for all those schema it could become an unholy mess in 10 years. It may be better to have a separate package(s) for that purpose, though right now I don't even understand how software releases are going to map onto data releases.
    • To repeat myself, I think that Alembic as a tool can probably be used for management of migrations (with some creative approach maybe) but it does not control the process, that should be defined by ourselves.

    I need to read and understand more. Kian-Tat Lim, I think some input on this huge topic would be useful from Architecture Team too.


    1. Regarding fully-qualified names - that is potentially unstable with regard to refactoring, e.g. simply moving a class to a different package or subclassing would probably break things. For extension classes I think the name is supposed to define that kind of implementation for a particular manager type, would it be more stable/flexible to store some arbitrary string for an implementation and use that as an argument for a factory method

      Having some kind of global mapping from string to class name is indeed more robust against refactoring, but it adds significant complexity in making sure that mapping is fully populated when you use it.  In pex_config-based extensions in the pipelines, we've used that approach extensively with import-time logic that lets extension packages add new things to the global mapping, and the results has been a lot of pain trying to make sure packages are imported.

      I think my attitude now is that if your hierarchy is closed for extension - or at least all subclasses must  live in the same package - then a global mapping with string names to classes is a reasonable way to proceed.  If you want to support extensions outside the base package, then fully-qualified names are still better even though they are fragile.  The registry manager classes and datastores we're considering here currently are all in daf_butler, so a global mapping may well be the way to proceed, but if so we should avoid any import-time logic and probably just declare that extensions cannot be defined in other packages.  A middle ground might be to let the mapping be defined in per-repository config (with daf_butler defaults) instead of as a global variable in daf_butler; that's how the Database subclasses are already handled already.

    2. I'd think that this is probably not very useful at all, we probably are not going to have more than one implementation for any given backend?

      I don't have a use case for having multiple Databases for one backend, and from a maintenance standpoint it's absolutely something we should try to avoid.  I only included Databases because I figured having another customization point we could use for backwards compatibility might  someday be useful if it lets us avoid some really painful major migration.  That's totally hypothetical, and I think I agree with you that including the Database class with the other extendable classes here is probably more trouble than its worth.

    3. If we expose (e.g. document) our schema then any change is potentially incompatible, even trivial change in string length. It would be possible to write backward-compatible version of Python code that hides that simple change in schema but then we should probably avoid exposing schema to outside world.

      I have indeed been moving in the direction of considering our schema private and Python the only public interface.  We will probably have to make some exceptions (I'm thinking vaguely about interfacing with Rucio and DBB stuff, which I know little about).  But I have come to the conclusion that SQL just doesn't have the kind of abstractions necessary to hide implementation details, so trying to do so is just too painful to be worthwhile - the only tool you have is views, and those are unevenly supported in different database engines and not (directly) support by SQLAlchemy at all.

  2. I think I would like stricter and more well-defined rules for version compatibility,. Saying "some write access possible" is not very helpful, I would prefer that clients knew in advance that write will fail without trying and waiting until it fails. I though about possible compatibility levels, here is the list of what I  think are reasonable options:

    1. full backward and forward compatibility for both reads and writes
    2. backward compatibility for reads/writes, forward compatibility for reads
    3. backward compatibility for reads and writes
    4. backward compatibility for reads only
    5. no compatibility

    I think #1 logically maps to patch-level differences (X.Y.m and X.Y.n for any combination of m and n). There is probably very limied set of changes that are fully compatible, still I think it's worth supporting that option.

    #5 is obvious candidate for major version changes.

    For minor version number change I'm not sure whether we want #2, #3, or #4. My assumption would be that if we can read old schema from new release then likely we can also write into it. Still it may be safer to say that we only support reading to avoid potential problems. And as we do not have any requirements for version management at this point we can chose strictest rules that can help us to implement those rules. So I'd vote for #4 as an option for minor version changes.