Goals

  • Normalize the dimension fields in the current (monolithic) dataset table.
  • Add a place to put per-DatasetType metadata, including regions and timespans.
  • Add support for special DatasetTypes:
    • "Global" DatasetTypes
      • Examples: "raw", maybe external reference catalogs.
      • Always ingested, never produced by pipelines.
      • Have data IDs that are unique across all collections.
    • "Standard" DatasetTypes
      • All we have now.
      • Data IDs are unique within any single collection.
    • "Nonsingular" DatasetTypes
      • Data IDs are not unique within "tagged" collections (see below); frequently have empty data IDs (no dimensions).
      • Cannot be used as PipelineTask inputs from "tagged" collections.
      • Must be used for PipelineTask initInputs and initOutputs.
      • May be used for PipelineTask prerequisite inputs.
      • Major use case is master calibrations (more on that later).
      • Using these for schemas and configs should solve some existing problems with those.
  • Add support for special Collections:
    • "Run" Collections
      • A Collection associated with a Run.
      • Datasets are associated with this collection implicitly via their Run, and cannot be associated or disassociated later.
      • Only collection that can be used in filename templates (others don't guarantee uniqueness).
      • Only collection that can be used for PipelineTask outputs.
    • "Tagged" Collections
      • All we have now.
      • Datasets can be associated and disassociated freely.
    • "Calibration" Collections
      • Associates each member dataset with a validity range.
      • Validity ranges for a particular DatasetType are non-overlapping (probably not enforceable via DB constraint).
      • Member datasets must be nonsingular.

Designs

All designs below use the following example DatasetTypes:

  • raw
    • global
    • dimensions=(instrument, exposure, detector)
    • metadata=(region, datetime_begin, datetime_end, npix)
  • object
    • standard
    • dimensions=(skymap, tract, patch)
    • metadata=(nrows)
  • bias
    • nonsingular
    • dimensions=(instrument, detector)
    • metadata=(npix)

Dimensions not used in these are not shown below for simplicity.  Dimension tables, datastore tables, the quantum table, and the execution table are also not shown.
Database types shown as "enum" below may be either strings or ints - the point is that the set of possible values is fixed.

Current Schema + (uid, site) Split

This is the current schema, with autoincrement fields (eg. dataset_id) split into (uid, site), and a table listing all collections  We'll want to make those changes regardless.

This schema does not meet any of the goals stated above; it's shown here for reference as a starting point.

One field here that merits explanation is dataset_ref_hash.  This is a secure (i.e. we require and assume no collisions) hash of the dataset's dataset_type_name and data ID (i.e. dimension) fields.  It's included in dataset as the source of truth, and duplicated in dataset_collection  so it can be used in the unique constraint there.

CREATE TABLE "dataset" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "dataset_type_name" varchar NOT NULL,
  "run_uid" integer NOT NULL,
  "quantum_uid" integer,
  "dataset_ref_hash" varchar NOT NULL,
  "instrument" varchar,
  "exposure" integer,
  "detector" integer,
  "skymap" varchar,
  "tract" integer,
  "patch" integer,
  FOREIGN KEY ("dataset_type_name") REFERENCES "dataset_type" ("dataset_type_name"),
  FOREIGN KEY ("run_uid", "site") REFERENCES "run" ("uid", "site")
);

CREATE TABLE dataset_type (
  "dataset_type_name" varchar PRIMARY KEY,
  "storage_class" varchar NOT NULL
);

CREATE TABLE "collection" (
  "name" varchar PRIMARY KEY,
  "type" enum NOT NULL,
  UNIQUE (name)
);

CREATE TABLE "run" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "collection" varchar NOT NULL,
  UNIQUE ("collection"),
  FOREIGN KEY ("collection") REFERENCES "collection" ("name")
);

CREATE TABLE "dataset_collection" (
  "dataset_uid" integer PRIMARY KEY,
  "dataset_site" integer PRIMARY KEY,
  "collection" varchar PRIMARY KEY,
  "dataset_ref_hash" varchar NOT NULL,
  UNIQUE ("dataset_ref_hash", "collection"),
  FOREIGN KEY ("collection") REFERENCES "collection" ("name"),
  FOREIGN KEY ("dataset_uid", "dataset_site") REFERENCES "dataset" ("uid", "site")
);


Partition datasets (Only)

This design partitions the dataset table into a common table and one extra per-DatasetType table.  That addresses the first two goals - data ID field normalization and metadata - but not the special kinds of dataset uniqueness and lookups we need.

CREATE TABLE "dataset" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "dataset_type_name" varchar NOT NULL,
  "dataset_ref_hash" varchar NOT NULL,
  "run_uid" integer NOT NULL,
  "quantum_uid" integer,
  FOREIGN KEY ("dataset_type_name") REFERENCES "dataset_type" ("dataset_type_name"),
  FOREIGN KEY ("run_uid", "site") REFERENCES "run" ("uid", "site")
);

CREATE TABLE dataset_type (
  "dataset_type_name" varchar PRIMARY KEY,
  "storage_class" varchar NOT NULL
);

CREATE TABLE "collection" (
  "name" varchar PRIMARY KEY,
  "type" enum NOT NULL,
  UNIQUE (name)
);

CREATE TABLE "run" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "collection" varchar NOT NULL,
  UNIQUE ("collection"),
  FOREIGN KEY ("collection") REFERENCES "collection" ("name")
);

CREATE TABLE "dataset_collection" (
  "dataset_uid" integer PRIMARY KEY,
  "dataset_site" integer PRIMARY KEY,
  "collection" varchar PRIMARY KEY,
  "dataset_ref_hash" varchar NOT NULL,
  UNIQUE ("dataset_ref_hash", "collection"),
  FOREIGN KEY ("collection") REFERENCES "collection" ("name"),
  FOREIGN KEY ("dataset_uid", "dataset_site") REFERENCES "dataset" ("uid", "site")
);

CREATE TABLE "dataset_raw" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "instrument" varchar NOT NULL,
  "exposure" integer NOT NULL,
  "detector" integer NOT NULL,
  "region" varchar,
  "datetime_begin" datetime,
  "datetime_end" datetime,
  "npix" integer,
  UNIQUE ("instrument", "exposure", "detector"),
  FOREIGN KEY ("uid", "site") REFERENCES "dataset" ("uid", "site")
);

CREATE TABLE "dataset_object" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "skymap" varchar NOT NULL,
  "tract" integer NOT NULL,
  "patch" integer NOT NULL,
  "nrows" integer,
  FOREIGN KEY ("uid", "site") REFERENCES "dataset" ("uid", "site")
);

CREATE TABLE "dataset_bias" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "instrument" varchar NOT NULL,
  "detector" integer NOT NULL,
  "npix" integer,
  FOREIGN KEY ("uid", "site") REFERENCES "dataset" ("uid", "site")
);


Specialized uniqueness, with hashes

This is the first design that should meet all of the goals stated above.  The primary difference is that dataset_collection is split into dataset_collection_tagged (for "tagged" collections) and dataset_collection_calibration (for "calibration" collections).  Associations for "run" collections would be via the run_uid in the common dataset table (we could start doing this now with no schema changes, actually - just changes to query-generation logic).

The other changes involve the addition of UNIQUE constraints to several tables:

  • "Global" DatasetTypes have a constraint in their per-DatasetType table (e.g. dataset_raw, below).  Note that because that the table is restricted to a single DatasetType and all data ID fields are present, there's no need for dataset_ref_hash  in this constraint.

  • "Run" collections use the constraint involving dataset_ref_hash  in the common dataset  table (this is redundant but harmless for DatasetTypes with uniqueness="global").

  • "Tagged" collections use the constraint involving dataset_ref_hash the dataset_collections_tagged table.

In addition, we define the dataset_ref_hash  for "nonsingular" DatasetTypes to be a combination of their uid  and site.  That effectively makes the dataset_ref_hash constraints on them do nothing, which is what we want for tagged collections, and maybe acceptable for run collections.


CREATE TABLE "dataset" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "dataset_type_name" varchar NOT NULL,
  "dataset_ref_hash" varchar NOT NULL,
  "run_uid" integer NOT NULL,
  "quantum_uid" integer,
  UNIQUE ("dataset_ref_hash", "run_uid", "site"),
  FOREIGN KEY ("dataset_type_name") REFERENCES "dataset_type" ("dataset_type_name"),
  FOREIGN KEY ("run_uid", "site") REFERENCES "run" ("uid", "site")
);

CREATE TABLE dataset_type (
  "dataset_type_name" varchar PRIMARY KEY,
  "storage_class" varchar NOT NULL,
  "uniqueness" enum NOT NULL
);

CREATE TABLE "collection" (
  "name" varchar PRIMARY KEY,
  "type" enum NOT NULL,
  UNIQUE (name)
);

CREATE TABLE "run" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "collection" varchar NOT NULL,
  UNIQUE ("collection"),
  FOREIGN KEY ("collection") REFERENCES "collection" ("name")
);

CREATE TABLE "dataset_collection_tagged" (
  "dataset_uid" integer PRIMARY KEY,
  "dataset_site" integer PRIMARY KEY,
  "collection" varchar PRIMARY KEY,
  "dataset_ref_hash" varchar NOT NULL,
  UNIQUE ("dataset_ref_hash", "collection"),
  FOREIGN KEY ("collection") REFERENCES "collection" ("name"),
  FOREIGN KEY ("dataset_uid", "dataset_site") REFERENCES "dataset" ("uid", "site")
);

CREATE TABLE "dataset_collection_calibration" (
  "dataset_uid" integer PRIMARY KEY,
  "dataset_site" integer PRIMARY KEY,
  "collection" varchar PRIMARY KEY,
  "datetime_begin" datetime NOT NULL,
  "datetime_end" datetime NOT NULL,
  FOREIGN KEY ("dataset_uid", "dataset_site") REFERENCES "dataset" ("uid", "site"),
  FOREIGN KEY ("collection") REFERENCES "collection" ("name")
);

CREATE TABLE "dataset_raw" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "instrument" varchar NOT NULL,
  "exposure" integer NOT NULL,
  "detector" integer NOT NULL,
  "region" varchar,
  "datetime_begin" datetime,
  "datetime_end" datetime,
  "npix" integer,
  UNIQUE ("instrument", "exposure", "detector"),
  FOREIGN KEY ("uid", "site") REFERENCES "dataset" ("uid", "site")
);

CREATE TABLE "dataset_object" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "skymap" varchar NOT NULL,
  "tract" integer NOT NULL,
  "patch" integer NOT NULL,
  "nrows" integer,
  FOREIGN KEY ("uid", "site") REFERENCES "dataset" ("uid", "site")
);

CREATE TABLE "dataset_bias" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "instrument" varchar NOT NULL,
  "detector" integer NOT NULL,
  "npix" integer,
  FOREIGN KEY ("uid", "site") REFERENCES "dataset" ("uid", "site")
);

The most common queries for datasets with this schema will have one of the following forms.  Frequently these will be used as subqueries in much larger queries (such as in QuantumGraph generation); in that context the WHERE constraints below on data ID fields (instrument, exposure, detector, skymap, tract, patch) will move to JOIN ON clauses.  A key point for all of these queries is that we can expect the collections to be fully provided by the user; that means we'll be able to perform initial queries (and cache results as necessary) to identify the collection type and look up associated run_uid  values as necessary (and hence WHERE constraints below on collections will stay in the WHERE clause even when these are subqueries).

-- Data ID search for raw (uniqueness=GLOBAL) in any collection:
SELECT uid, site FROM dataset_raw WHERE instrument=? AND exposure=? AND detector=?;

-- Data ID search for object (uniqueness=STANDARD) in a type=RUN collection:
SELECT dataset.uid, dataset.site
  FROM dataset
  JOIN dataset_object ON (dataset.uid = dataset_object.uid AND dataset.site = dataset_object.site)
  WHERE dataset.run_uid = ? AND dataset.site = ?
        AND dataset_object.skymap=? AND dataset_object.tract=? AND dataset_object.patch=?;

-- Data ID search for object (uniqueness=STANDARD) in a type=TAGGED collection:
SELECT dataset_object.uid, dataset_object.site
  FROM dataset_object
  JOIN dataset_collection_tagged ON (dataset_object.uid = dataset_collection_tagged.dataset_uid
                                     AND dataset_object.site = dataset_collection_tagged.dataset_site)
  WHERE dataset_collection_tagged.collection = ?
    AND dataset_object.skymap=? AND dataset_object.tract=? AND dataset_object.patch=?;

-- Temporal search for bias (uniqueness=NONSINGULAR) in a type=CALIBRATION collection:
SELECT dataset_bias.uid, dataset_bias.site
  FROM dataset_bias
  JOIN dataset_collection_calibration ON (dataset_bias.uid = dataset_collection_tagged.dataset_uid
                                          AND dataset_bias.site = dataset_collection_tagged.dataset_site)
  WHERE dataset_collection_calibration.collection = ?
    AND dataset_bias.instrument=? AND dataset_bias.detector=?
    AND dataset_collection_calibration.datetime_begin < ? AND dataset_collection_calibration.datetime_end > ?;

Dropping Hashes and Eliminating Joins

This design also meets all of the stated goals, but also makes the most common queries join-free at the expense of more per-dataset tables.

Differences from the previous design include:

  • The dataset_ref_hash  field has been removed from all tables.  This a nice simplification for the Python code, as we no longer need code to compute the hash, and it also makes the behavior of the unique constraints more natural and obvious from the database side, as they no longer depend on a field whose meaning can change from row to row.
  • We now have two (for most datasets) or three (for master calibrations) per-dataset tables that duplicate the data ID fields; the additional tables replace the monolithic dataset_collection_*  tables.  The total number of rows is conserved between these designs, while the average width should be about the same (depending on how many data ID fields are used on average, and how many bytes are used in dataset_ref_hash).
  • The run_uid  field is duplicated in the per-DatasetType tables, allowing the unique constraint for run collections to be defined there.  That also gives us the option of whether or not to define a unique constraint on nonsingular datasets in run collections.


CREATE TABLE "dataset" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "dataset_type_name" varchar NOT NULL,
  "run_uid" integer NOT NULL,
  "quantum_uid" integer,
  FOREIGN KEY ("dataset_type_name") REFERENCES "dataset_type" ("dataset_type_name"),
  FOREIGN KEY ("run_uid", "site") REFERENCES "run" ("uid", "site")
);

CREATE TABLE dataset_type (
  "dataset_type_name" varchar PRIMARY KEY,
  "storage_class" varchar NOT NULL,
  "uniqueness" enum NOT NULL
);

CREATE TABLE "collection" (
  "name" varchar PRIMARY KEY,
  "type" enum NOT NULL,
  UNIQUE (name)
);

CREATE TABLE "run" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "collection" varchar NOT NULL,
  UNIQUE ("collection"),
  FOREIGN KEY ("collection") REFERENCES "collection" ("name")
);

CREATE TABLE "dataset_raw" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "run_uid" integer NOT NULL,
  "instrument" varchar NOT NULL,
  "exposure" integer NOT NULL,
  "detector" integer NOT NULL,
  "region" varchar,
  "datetime_begin" datetime,
  "datetime_end" datetime,
  "npix" integer,
  UNIQUE ("instrument", "exposure", "detector"),
  FOREIGN KEY ("uid", "site") REFERENCES "dataset" ("uid", "site"),
  FOREIGN KEY ("run_uid", "site") REFERENCES "run" ("uid", "site")
);

CREATE TABLE "dataset_tag_raw" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "collection" varchar PRIMARY KEY,
  "instrument" varchar NOT NULL,
  "exposure" integer NOT NULL,
  "detector" integer NOT NULL,
  UNIQUE ("collection", "instrument", "exposure", "detector"),
  FOREIGN KEY ("collection") REFERENCES "collection" ("name"),
  FOREIGN KEY ("uid", "site") REFERENCES "dataset" ("uid", "site")
);

CREATE TABLE "dataset_object" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "run_uid" integer NOT NULL,
  "skymap" varchar NOT NULL,
  "tract" integer NOT NULL,
  "patch" integer NOT NULL,
  "nrows" integer,
  UNIQUE ("run_uid", "site", "skymap", "tract", "patch"),
  FOREIGN KEY ("uid", "site") REFERENCES "dataset" ("uid", "site"),
  FOREIGN KEY ("run_uid", "site") REFERENCES "run" ("uid", "site")
);

CREATE TABLE "dataset_tag_object" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "collection" varchar PRIMARY KEY,
  "skymap" varchar NOT NULL,
  "tract" integer NOT NULL,
  "patch" integer NOT NULL,
  UNIQUE ("collection", "skymap", "tract", "patch"),
  FOREIGN KEY ("collection") REFERENCES "collection" ("name"),
  FOREIGN KEY ("uid", "site") REFERENCES "dataset" ("uid", "site")
);

CREATE TABLE "dataset_bias" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "run_uid" integer NOT NULL,
  "instrument" varchar NOT NULL,
  "detector" integer NOT NULL,
  "npix" integer,
  UNIQUE (run_uid, site, instrument, detector),
  FOREIGN KEY ("uid", "site") REFERENCES "dataset" ("uid", "site"),
  FOREIGN KEY ("run_uid", "site") REFERENCES "run" ("uid", "site")
);

CREATE TABLE "dataset_tag_bias" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "collection" varchar NOT NULL,
  "instrument" varchar NOT NULL,
  "detector" integer NOT NULL,
  FOREIGN KEY ("collection") REFERENCES "collection" ("name"),
  FOREIGN KEY ("uid", "site") REFERENCES "dataset" ("uid", "site")
);

CREATE TABLE "dataset_validity_bias" (
  "uid" integer PRIMARY KEY,
  "site" integer PRIMARY KEY,
  "collection" varchar PRIMARY KEY,
  "instrument" varchar NOT NULL,
  "detector" integer NOT NULL,
  "datetime_begin" datetime NOT NULL,
  "datetime_end" datetime NOT NULL,
  FOREIGN KEY ("collection") REFERENCES "collection" ("name"),
  FOREIGN KEY ("uid", "site") REFERENCES "dataset" ("uid", "site")
);

As promised, the common queries are now much simpler:

-- Data ID search for raw (uniqueness=GLOBAL) in any collection:
SELECT uid, site FROM dataset_raw WHERE instrument=? AND exposure=? AND detector=?;

-- Data ID search for object (uniqueness=STANDARD) in a type=RUN collection:
SELECT uid, site FROM dataset_object
  WHERE run_uid=? AND site=? AND skymap=? AND tract=? AND patch=?;

-- Data ID search for object (uniqueness=STANDARD) in a type=TAGGED collection:
SELECT uid, site FROM dataset_tag_object
  WHERE collection=? AND skymap=? AND tract=? AND patch=?;

-- Temporal search for bias (uniqueness=NONSINGULAR) in a type=CALIBRATION collection:
SELECT uid, site FROM dataset_validity_bias
  WHERE collection=? AND instrument=? AND detector=? AND datetime_begin < ? AND datetime_end > ?;



  • No labels

3 Comments

  1. Unknown User (cs2018)

    We've been discussing the changes suggested on this page internally at NCSA and we have a couple larger concerns.  These discussions have focussed on trying to strike a balance between normalization/denormalization.

    Normalization will improve insert/update/delete performance. What we've observed from previous surveys and know about oltp-like systems that require high performance on large sets of data make this a primary concern for us. It also helps ensure there are no unintended inconsistencies in the data. 

    Denormalization/data redundancy will likely make the schema easier to navigate for operators and end users that need to query the system.

    • We think that separating datasets into distinct tables by type is a good thing in general. a single wide table to store all dataset types where each row has a different combination of nullable and non-null columns would lead to wasted space, unmanageable indexing issues, missing constraints, possible row-chaining and SQL that doesn't perform (as we have already seen). We are a little worried about what that means for total number of tables to navigate given the number of known dataset types and recognizing there are more to come.  We think storing some types that have most attributes in common in a single table or finding other ways to limit total number of dataset type tables is worth discussing.
    • Storing "collection" in the tagged dataset tables means duplication when datasets are part of more than one collection. We think a relationship table with the following structure better handles the many-to-many relationships between datasets and collections:


    dataset_collections

    ===================

    dataset_uid NUMBER

    dataset_site_id NUMBER

    collection_name VARCHAR

    That eliminates the need for dataset_tag_<type> tables.


    Those are the two biggest concerns we have. The following are other changes we think are a good idea:

    • Following up on slack conversation with Jim, perhaps change site_id to owner_id
    • Until we have a concrete reason to denormalize the validity tables, we'd like to only store the additional data (dataset_uid, owner_id, datetime_begin, datetime_end). Possibly we should discuss consolidating all of the validity tables into a single table with an additional dataset_type_id column.
    • Adding a description column (collection_desc) to the collection table can help keep names short but still allow someone to provide a longer description of the collection.
    • Not sure we understand the convention behind the uid instead of id.
    • Do collections and dataset_types need a site_id?  Will either of these be “copied” via an import or merge from a different “site”?
    • We'd like to see consistent use of surrogate number IDs for PKs (enum) for the collection and dataset_type tables.


    collection

    ==========

    collection_id NUMBER

    collection_name VARCHAR


    dataset_type

    ============

    dataset_type_id NUMBER

    dataset_type_name VARCHAR

    dataset_collections

    ===================

    dataset_uid NUMBER

    dataset_site_id NUMBER

    collection_id NUMBER 

    [collection_site_id NUMBER -- depends answer above]

    • For the dataset subtype tables, we recommend changing the column name uid -> dataset_uid to make the FK/PK naming convention consistent across tables.  (The uid is not being created specifically for those subtype tables.  It really is the dataset_uid.)
    • There are a number of VARCHAR columns that will be used to identify datasets. We recommend picking a standard way of storing that text with one of all-caps/init-caps/lower. That keeps SQL from necessarily using full table scans (resulting in poor performance) or additional function based indexes (that will need to be maintained during DML).  There may be scientific reasons that this can’t be done on certain VARCHAR columns (a lower case value and an upper case value are both valid and are different things).
    • As time goes on, we think there will be an increasing accumulation of run collections which are no longer wanted in the registry, waste space and cause performance issues. Thinking about a process to handle this is worthwhile.  “Campaign” tag collections added post running and possible removed based upon QA would be more typical for production. Finding datasets in a run (for provenance or cleanup) can be done with run_uid. If desired a tag collection could be added (and thus be removable) 
  2. We think storing some types that have most attributes in common in a single table or finding other ways to limit total number of dataset type tables is worth discussing.

    I certainly intend to design things such that we could switch to this mode with little disruption later, but I would like to start by splitting up datasets into separate tables and then looking at how often merging them would be possible.


    Storing "collection" in the tagged dataset tables means duplication when datasets are part of more than one collection. We think a relationship table with the following structure better handles the many-to-many relationships between datasets and collections...

    While simplifying the queries is nice, the big reason I was duplicating the data IDs in the tag tables (and hence needing different tag tables for different dataset types) was to have a place to define the UNIQUE constraints.  The alternatives are bringing back dataset_ref_hash  and including that in the dataset_collection  table, or not enforcing those constraints at the database level at all (which might be doable, but we'd have to take a closer look at many more higher-level operations to see if they can maintain those constraints efficiently and/or gracefully handle them being violated).

    • Following up on slack conversation with Jim, perhaps change site_id to owner_id

    I like "origin_id" even better; does that work for you?

    • Until we have a concrete reason to denormalize the validity tables, we'd like to only store the additional data (dataset_uid, owner_id, datetime_begin, datetime_end). Possibly we should discuss consolidating all of the validity tables into a single table with an additional dataset_type_id column.


    • Adding a description column (collection_desc) to the collection table can help keep names short but still allow someone to provide a longer description of the collection.


    (thumbs up)  Unlike the tagged collections, the data ID columns for the validity tables are not needed for constraints, so I was adding them only for query simplification, and that means making it a single table (per calibration collection) is unproblematic.

    • Not sure we understand the convention behind the uid instead of id.

    I'm trying to think of a naming scheme that distinguishes autoincrement IDs (especially dataset_id) from data IDs, which are of course groups of fields, but with name we're stuck with for historical reasons.  "uid" was something that proved useful for that purpose in an in-person discussion at Princeton a couple of weeks ago, but I'm not in love with it and would happily adopt some other scheme.


    • Do collections and dataset_types need a site_id?  Will either of these be “copied” via an import or merge from a different “site”?


    They will be, but I think it's desirable for the collection names themselves to already be globally unique as much as possible, and I was imagining we'd have naming conventions to enforce that (e.g. including user names in many cases), with little sympathy for names that clash because the conventions were ignored.

    • We'd like to see consistent use of surrogate number IDs for PKs (enum) for the collection and dataset_type tables.

    I've always thought that was a possibility but didn't want to assume that would be better without evidence.  Sounds like that's a safe assumption (or at least better than assuming we shouldn't have them).


    For the dataset subtype tables, we recommend changing the column name uid -> dataset_uid to make the FK/PK naming convention consistent across tables.  (The uid is not being created specifically for those subtype tables.  It really is the dataset_uid.)


    (thumbs up)

    There are a number of VARCHAR columns that will be used to identify datasets. We recommend picking a standard way of storing that text with one of all-caps/init-caps/lower. That keeps SQL from necessarily using full table scans (resulting in poor performance) or additional function based indexes (that will need to be maintained during DML). There may be scientific reasons that this can’t be done on certain VARCHAR columns (a lower case value and an upper case value are both valid and are different things).

    Is the implication here that VARCHAR comparisons (and indices) are case-insensitive?  I think in all other contexts where these names appear (e.g. Python) we assume case-sensitive string comparisons, so it's very unlikely we'd ever have different-capitalization versions of any particular name.  I'm not sure if that's sufficient to resolve the problem you're concerned about; we do have some systems of names (e.g. dataset types) which for historical reasons we have a mixture of case conventions (but again no cases where the same name appears with different casing).  If that is also a problem, I think we'll be able to switch to more consistent conventions eventually, but it will be much easier to do after Gen2 is retired.

    • As time goes on, we think there will be an increasing accumulation of run collections which are no longer wanted in the registry, waste space and cause performance issues. Thinking about a process to handle this is worthwhile.  “Campaign” tag collections added post running and possible removed based upon QA would be more typical for production. Finding datasets in a run (for provenance or cleanup) can be done with run_uid. If desired a tag collection could be added (and thus be removable) 

    It was very much a goal of this design that "run collections" would be implemented without any records appearing in any tag table, so they would only ever be linked to dataset via the run_uid columns; think of them as Python query-generation logic that would produce a search on dataset.run_uid (or dataset_x.run_uid, maybe) instead of a search that uses dataset_collections transparently when the given collection is a run collection.  So I think the bloat you're worried about accumulating simply won't exist (after this redesign), which is even better than being able to get rid of it later.

  3. Yes origin.  I just wrote the wrong one in the description.