1. Objective

This document summarizes known requirements for the planned Qserv Ingest System. The information presented here is based on a series of conversations with the members of the Databases and Data Access team as well as other members of the LSST DM. The document is linked from the following ticket:

DM-17655 - Getting issue details... STATUS

NOTE: the document may be updated in the future as more requirements will be captured.

2. Requirements

2.1. The general architecture (the top-level model)

It seems to be wise from a point of view of the good architectural principles to separate the proposed Ingest system into the following layers:

user data specific layer (loading workflow: to be responsible for handling a variable number of the input data formats (such as https://parquet.apache.org/, FITS, HDF5, etc. formats) and translating these data into TSV files suitable for ingesting into Qserv. The layer is planned to be (as an option) implemented using some existing workflow management system, such as Apache AirFlow. The workflows will be defined to suit specific needs of particular ingests (the pace the input data will be made available to the Ingest System, input data formats, available temporary storage for the input data and intermediate products, desired aggregate performance of ingests, reliability concerns for the underlying infrastructure, resource utilization concerns, etc.). It would also be a responsibility of the workflows to spatially partition (see LSST GitHub package partition and a documentation for Qserv for further details) the input data into chunks. It's expected that this layer may have multiple implementations (or customizations) to support various (beyond LSST) catalogs and input data formats. It's also possible that Qserv users might build their own ways of preparing data before loading them into Qserv. The later may include an option of not having any formal workflow and using the below defined Qserv Ingest layer directly.
user data agnostic layer (within Qserv): to be built into Qserv Replication and Ingest system, and it will be responsible for loading the input TSV files into the corresponding catalog and chunk-specific tables within Qserv workers. It will be also a responsibility of this layer to ensure the integrity of the loaded data during the ingest.

The rest of this document is going to focus mostly on the last aspect (layer) of the Ingest system.

2.1.1. Interactions between the layers

The loading workflows will be using a specially prepared (yet to be developed in a context of the present development effort) binary application(s) for feeding TSV files into the Qserv Ingest and Replication system. The system would be also extended to accept catalog and schema definitions which are necessary for creating the corresponding databases and tables within MySQL/MariaDB servers of Qserv. Specific details on how tis API would look like are beyond a scope of the present document.

2.2. Loading campaigns and super-transactions

It's desirable to split a process of ingesting large catalogs into the well defined campaigns, were each such campaign would be associated with a certain amount of the unique input data to be translated and loaded into Qserv as a set of TSV files within some kind of super-transaction. A mechanism of the super-transaction would ensure the catalog integrity within Qserv once the transaction is committed (or successfully ended). It's going to be up to designers of the loading workflow to decide how much data and which data are going to be loaded within a particular super-transaction. A basic assumption here is that the amount of efforts for repeating the loading effort is to be balanced with the corresponding risk of failure to complete the super-transaction. Parameters of the load are probably going to be derived empirically or be based on some existing constrains (available storage, timing requirements, etc.). It is also possible to envisage an option of supporting multiple simultaneous campaigns (each associated with its own unique super-transactions) in flight for the performance or other reasons.

The super-transactions is going to be a mechanism to be built into the Qserv Ingest and Replication System. Each such transaction will accommodate a consistent increment of data added into MySQL/MariaDB tables across all (or some, depending on the spatial distribution of the input data and a chunk/replica disposition within Qserv) workers. It's required that the very first step to be made by a loading workflow would be to ensure the Ingest system started a new transaction before feeding the properly prepared and partitioned TSV files into the system. Transactions will be uniquely identified by a small unsigned integer number - the transaction identifier. The Ingest System will provide an API allowing the loading workflows to obtain a number of the most recently started transaction, or get this number directly from an operation requesting the new transaction to begin. Once started, a transaction could be ether committed or rolled-back. Once committed no further data could be loaded into Qserv in a context of the transaction. Transactions could be rolled back only if they'r still open. This would cleanly remove from Qserv any partially ingest data loaded in a context of the failed transaction and put Qserv catalogs back to the same state it were before the beginning of the transaction. Transactions can't be reused.

There is also a requirement to make the transaction identifiers visible when querying catalogs for the purposes of bookkeeping or data provenance. It's expected to be up to the workflow designers to associate the transaction identifiers with the corresponding input data loaded into Qserv during the campaigns.

Another requirement was to allow workflow designers to designate a data column (or columns) of the input data as the transaction identifiers. Though, this idea has the following problems:

it's not guaranteed that input data will have consistent values of the keys across all campaigns
the keys may not be consistent across different tables
the size of the key may be large, which may result in a noticeable overhead for narrow tables (such as LSST's ForcedSource)
and, finally - this may make the implementation of the super-transactions inside the Ingest System more fragile

2.3. Metadata management

The current design of Qserv requires that certain metadata on catalogs served by Qserv had to be retained at (or be available to) the czar node (or nodes - in the yet to be developed multi-master configuration) of the service. This includes (as of today):

catalog schemas
the so called secondary index (for a quick lookup of chunks/sub-chunks based on the known values of the object ID in the director (LSST's Object) table
partitioning parameters of the director (LSST's Object) and dependent tables (LSST's Source, ForcedSources, and alike)
the empty chunk list (to become just a chunk list)
scheduling parameters of the worker query processing engines

These resource are used by Qserv's czar component to rewrite original queries, distribute them across the worker nodes, and optimize operations on the worker nodes.

Additional requirements may be imposed by the (yet to be formulated) Qserv Security Model (TODO: a link is required here). As of today, little is known about these requirements. Though, one of those might come from an R&D effort to implement query validation and authorization checking in a form of the preflight (or speculative) query execution of the original queries on czar before submitting the secondary (rewritten) queries down to the workers. It may require the new Ingest System to support operations on the MySQL accounts and authorization records on the catalogs' prototype tables in czar before, during or after ingesting catalogs into Qserv. More details on this above mentioned development effort can be found at: DM-18569 - Getting issue details... STATUS

2.4. Non-functional requirements (resources, etc.)

The inner (data-agnostic) layer of the Ingest system is expected not to use more than 10% (the number needs to be aligned with the LSST Sizing Model) of extra disk space on top of the final size of the fully replicated (for a desired replication level) final catalog during the ingest. This extra space may be required for temporary data kept within the Ingest System during the ingest. The space requirements for the workflow layer are beyond a scope of tis document.

An implementation of the Ingest system is expected to utilize computing resources (worker nodes, disks, networks) allocated to Qserv. Though, this is not strictly required.

It's also possible to support ingesting data into new catalogs while serving previously loaded catalogs within the same instance of Qserv. It is presently NOT a requirement to allow reading from catalogs which are being ingested. Though, it's possible to consider making the in-progress catalogs readable between super-transactions (when no loading is happening).

The ingest system should tolerate failed worker nodes up to a level which is determined by a requested replication level of the partially ingest catalogs. The minimum replication level will always be 2, which means no more than one worker node would be allowed to be offline during the ingest.

3. Related developments

Here is a summary of modifications to the Qserv and its Replication system which would be needed to implement the new system:

Space shortcuts

Page tree

1. Objective

2. Requirements

2.1. The general architecture (the top-level model)

2.1.1. Interactions between the layers

2.2. Loading campaigns and super-transactions

2.3. Metadata management

2.4. Non-functional requirements (resources, etc.)

3. Related developments

Space shortcuts

Page tree

Evaluate Ingest Improvement Requirements for Qserv

1. Objective

2. Requirements

2.1. The general architecture (the top-level model)

2.1.1. Interactions between the layers

2.2. Loading campaigns and super-transactions

2.3. Metadata management

2.4. Non-functional requirements (resources, etc.)

3. Related developments