View Source

Andy Salnikov and I talked about the representation of the data collected by the common Activator code from all the define_quanta() calls during pre-flight.

We started with a work plan for the Activator for a particular campaign roughly like this (omitting many previous and following steps):

assemble input and output constraints (e.g., from the command line)
- the source of the constraints would be specific to a particular Activator implementation (e.g., it would be likely to be different for the CmdLineActivator and the one used in the Level 2 / DRP production system)
- we need to define a common form in which all the different Activators can express the constraints once they've parsed/loaded/etc. them from their particular sources
ask the list of SuperTasks in the Pipeline for their Units
- this is common code that all Activators can share
perform the "big join" data query from Jim Bosch's design
- this is common code that all Activators can share
- a RepoGraph is the result of this query
loop forward through the SuperTasks in the Pipeline, calling define_quanta()
- this is common code shared by all Activators

We then considered what to do with the results returned.

Imagine two classes, Dataset and Job. (We'll probably need different names; Dataset is already taken, and Job leads to confusion with "batch job", one of which may be associated with many "Jobs".)

A Dataset object represents just that, a single dataset, defined by a fully-specified DataId and a Butler dataset type (and implicitly mapped to a concrete external artifact by the Repository configuration in force).

It has two additional attributes:

producer, which refers to 0 or 1 Job: 0 if this Dataset represents a pre-existing dataset from the point of view of the Activator, or 1 if the Dataset is predicted to be created in the campaign.
users, which refers to 0..n Jobs

A Job object represents a single application of run_quantum() for a single SuperTask in the Pipeline. It has three relevant attributes:

supertask_info, which refers to a TBD data structure that identifies the SuperTask to be run and its configuration
inputs, a list of 0..n Datasets (yes, we think that 0 can work, e.g., for random generation of data)
outputs, a list of 1..m Datasets (it looks like 0 might also work, e.g. for Square's monitoring SuperTasks, but inputs and outputs can't both be empty)

In addition, the data structures produces by the common Activator code should then include:

for each SuperTask in the Pipeline, a list of all the Jobs associated with it
for each dataset type that exists in the graph, a list of all the Datasets with that type

We don't current propose that the graph explicitly include Job-to-Job dependency links; these can be extracted from traversal of Job-to-inputs-to-producer links.

This is a concept, not a Python design, just yet. It makes sense to look at existing workflow management packages to see whether their data models can accommodate this. This is especially relevant because we want to be able to persist the graph and reconstruct it, so that the Pre-flight phase can be separated from the "submit units of work" phase.

The generation of this graph is done by common code shared by all Activators.

The intent is for this graph then to be usable by all concrete Activators (e.g., the Level 2 / DRP production system) as the input for their construction of a concrete execution plan. It is at that stage, then, that a concrete Activator could determine, e.g., that it was going to wrap up 100 CCDs' worth of ISR-SuperTask "Jobs" into a single batch job. The concrete Activator would then be responsible for building the resulting batch-job-level aggregate DAG.

Concrete Activators might not end up using the persistency of the common execution plan graph, but might restate the graph after some post-processing in an Activator-specific form and persist it that way. This is TBD. It is not a requirement that the output of one Activator's Pre-flight phase be usable by a different concrete Activator's Run phase!

The MVP version of CmdLineActivator will do both Pre-flight and Run together and is not required to be able to persist its execution plan, but it would be useful if a later version added this capability, allowing the persisting of the plan as an ancillary output, or even allowing a user to optionally perform the two phases separately. This should facilitate testing, among other things. The default behavior should continue to be to do both phases under a single command.