The steps listed here describe what happens under the hood; they do not necessarily describe steps that would be apparent to the user, and certainly do not describe steps that would normally be invoked separately by the user.

Step 1: Apply Algorithm Config Overrides

Apply traditional configuration overrides to a Pipeline, coming from the command-line or an obs_* package.

Inputs:

Pipeline
dictionary of {name: ConfigOverride}
camera (optional)

Outputs:

Pipeline

Implementation

Applying a dict of ConfigOverrides or a (name, ConfigOverride) pair should probably be a method on Pipeline itself.

A shared component should exist that augments a command-line argument parser with config-override options, yielding a {name: ConfigOverride} dictionary when arguments are parsed.

A shared component should exist that looks up and applies configuration overrides from an obs_* package provided camera.

Execution frameworks should always apply obs_* overrides before command-line overrides (assuming both are actually supported by that execution framework).

Complications

Applying obs_* overrides is much simpler in Gen2, where each Data Repository only holds data from one camera. That is no longer the case in Gen3, so a data expression that yields Quanta for multiple cameras doesn't have a single obs_* package from which overrides should be loaded.

The simplest solution to this problem is to just limit the capabilities of Gen3, by requiring any multi-Camera QuantumGraphs to not use obs_* overrides. That would in practice require those graphs to be split up and run separately, which would be inconvenient for users.

I don't have a complete solution for this in mind, but because this is new functionality beyond what Gen2/CmdLineTask could do, I suggest we do not try to solve this problem until we've reached parity with Gen2/CmdLineTask functionality. The best idea I've come up with so far involves somehow restricting some SuperTasks in a Pipeline to certain data IDs (e.g. those with a particular camera or filter), essentially allowing one sub-Pipeline for some data IDs and other sub-Pipelines for others. I have no idea how difficult this would be to implement in Preflight, and that's why I'd like to put it off for now.

Step 2: QuantumGraph Generation

Inputs:

Pipeline (read-only)
Registry (read-only - unless temporary tables are needed?)
User-provided data expression and/or custom SQL statements

Outputs:

QuantumGraph

Implementation:

Should be a shared Python component that is unencumbered by the needs of any other steps (e.g. command-line parsing, configuration handling). All execution frameworks are expected to use the same code to generate QuantumGraphs.

Step 3: Apply Resource Configuration

"Resource configuration" is configuration that should not affect the scientific content of any final data products. This includes (but is not limited to):

how many cores are assigned to each Quantum of a particular SuperTask;
whether intermediate/diagnostic datasets are retrieved from worker nodes, or even persisted at all;
what file format should be used for certain dataset types;
options specific to specific SuperTasks (e.g. subregion size in AssembleCoadd).

Resource configuration options should not be included in a SuperTask's ConfigClass; it must be possible to construct a SuperTask before resource configuration is applied. In any case, a relatively small fraction of resource configuration options will apply to specific SuperTasks.

Some resource configuration is Butler configuration, especially for worker-node Butlers - note that while QuantumGraph generation requires a read-only Registry, a full Butler need not be constructed until after Apply Resource Configuration.

Given the above, I think it probably makes sense to use YAML instead of pex_config for all resource configuration, and to define a new mechanism to pass SuperTask-specific resource configuration values to those SuperTasks (perhaps kwargs passed to SuperTask.runQuantum and SuperTask.run).

Inputs:

QuantumGraph
Resource Configuration (YAML)

Outputs:

execution framework dependent: includes Butler configuration, options for specific SuperTasks, and options entirely specific to the execution framework, in whatever form is most useful for that execution framework.

Implementation:

I don't see any obvious way yet to share code between different execution frameworks for this step.

At least some of the structure of a resource configuration YAML file will be specific to the execution framework, but we should probably try to design a common configuration structure for components and ideally some top-level parts of the configuration tree that may be common (if not required) across different execution frameworks. This should include:

Butler configuration for execution environment (includes file formats, dataset types for which I/O should be elided, etc)
SuperTask-specific resource configuration options (relevant for all execution frameworks, I think)

Step 4: Global Initial I/O

Prepare input and output Collections and Runs. Write software versions, compute environment information, the Pipeline itself (together called "Launch Provenance" for now), etc. to a global (i.e. "not worker scratch") data repository.

Inputs:

QuantumGraph
Launch Provenance (includes the Pipeline)
(global) Butler configuration

Outputs:

Butler instance (optional; not needed after this step if execution will write to scratch Butlers)
(in data repository) a new or existing Run to be associated with all outputs
(in data repository) any new or existing custom Execution records to be associated with the Run by this particular execution framework.
(in data repository) a new or existing Collection to be associated with all outputs, now associated with all pre-existing inputs
(in data repository) persisted Datasets recording all Launch Provenance (some associated with Runs, some associated with custom Execution records)

Implementation:

While this step isn't the same for all execution frameworks, there is a lot in common, and it would be highly desirable to have some shared code here.

At present, we need a Run in order to create a Butler that can write, but we also need to be able to write to a data repository to be able to create a Run and save the Launch Provenance datasets already associated with it. I think this means we need a shared high-level component for simultaneously creating a Run, associating it with a Collection, writing the Launch Provenance datasets associated with it (see Run.environment_id and Run.pipeline_id in DMTN-073), and creating a Butler with that Run and Collection. Ideally, that component would be extensible to allow custom Executions and Launch Provenance Datasets to be created at the same time.

Code to walk a QuantumGraph and add all input Datasets to the Collection that will be used for output should also be a shared component (maybe the same one, maybe not). Note that it's desirable that the output Collection contain not just the outputs of the Pipeline and its direct inputs, but all indirect inputs going all the way back to raw data. I think that means using the recorded Quantum provenance for the direct inputs to find these and associate them as well.

Step 5: Quantum Execution

Run all Quanta. Stage data to and from workers as needed.

Inputs:

QuantumGraph
(worker) Butler configuration
results of step 3 (execution framework dependent)

Outputs:

(in data repository) output datasets, including any desired intermediates
(in data repository) QuantumGraph-level provenance

Implementation:

While different execution frameworks will do completely different things here, I think two shared components jump out as being generally useful:

A QuantumGraph-traversal helper that maintains a list of completed Quanta and can return a sequence of "ready" Quanta.
A helper class that uses Python multiprocessing to execute a QuantumGraph on a single node, suitable for use by a fully-featured "laptop" command-line execution framework and multi-node execution frameworks that want to use it to execute per-node subsets of a larger QuantumGraph. Note that this means it must not do any of the previous steps described on this page (as those may be done differently by the higher-level execution framework), and it should expect to operate on snippets of the resource configuration, not the whole tree.

Space shortcuts

Page tree

Step 1: Apply Algorithm Config Overrides

Inputs:

Outputs:

Implementation

Complications

Step 2: QuantumGraph Generation

Inputs:

Outputs:

Implementation:

Step 3: Apply Resource Configuration

Inputs:

Implementation:

Step 4: Global Initial I/O

Inputs:

Outputs:

Implementation:

Step 5: Quantum Execution

Inputs:

Outputs:

Implementation:

12 Comments

Andy Salnikov

Jim Bosch

Michelle Gower

Andy Salnikov

Michelle Gower

Jim Bosch

Michelle Gower

Jim Bosch

Michelle Gower

Jim Bosch

Michelle Gower

Jim Bosch