Usage

The following steps are a conceptual list of the steps involved in using SuperTask (actually a group of them, in general) to run a production. This may be a small- or medium-scale production run by Science Pipelines developers when testing new functionality, an automated job run by the continuous integration system, a large-scale testing run, or even a piece of a full data release production in operations. This may not comppletely capture the SUIT use case for SuperTask, but I believe it captures all others.

  1. A developer (typically Science Pipelines) defines a Pipeline, which is conceptually an execution-ordered list of (SuperTask class, config instance) tuples.
  2. Other developers (typically Science Pipelines or SQuaRE) modify the Pipeline by inserting additional SuperTasks and/or modifying configuration. This may include applying camera-specific configuration overrides from obs_* packages. The Pipeline should be editable in both these ways via both a Python API and by editing a human-readable text format (e.g. YAML or pex_config-like Python).
  3. An operator (typically Science Pipelines, SQuaRE, or NCSA) passes the Pipeline to a PreFlightActivator (e.g. by providing a filename for a serialized form of the composite). The operator passes a DataRepository and input and output DatasetExpressions to the PreFlightActivator at the same time. A PreFlightActivator *may* provide a way for the operator to provide additional Task configuration overrides at this time, but this is conceptually just a convenience that mostly delgates to interfaces in step (2). the PreFlightActivator then instantiates each SuperTask with its configuration, and the SuperTask configuration is frozen from this point forward.
  4. The PreFlightActivator processes the DatasetExpressions by passing them to the defineQuanta methods of the SuperTasks in the Pipeline in either forward order (starting from the input expression) or reverse order (starting from the output expression). The outputs of each call to defineQuanta are added to the inputs (if going forward) or the inputs of each call are added to the outputs (if going in reverse), yielding a superset of all Quanta to be processed. This is a superset because each interior (not first or last) call to defineQuanta is only constrained by its inputs or its outputs, not both. The PreFlightActivator then trims out the unnecessary Quata, by calling the defineQuanta methods again in the opposite order.
  5. The PreFlightActivator uses the set of Quanta to be processed to build a ScienceDAG that contains runQuantum invocations and datasets as nodes. The ScienceDAG may represent multiple processing stages, and we assume that any failures that are not necessarily fatal to downstream jobs will be represented via incomplete or placeholder outputs. Only unexpected and fatal failures may cause a section of the ScienceDAG to be invalidated. We will also assume that any runtime decisions about which datasets will be used as input (e.g. the PVIs used as input to a coadd) will be a small delta on the set of possible inputs determined by defineQuanta (e.g. so staging unneeded inputs is not a significant source of inefficiency). Together, this means the ScienceDAG will be sufficient to define the workflow for running all SuperTasks in the Pipeline without additional human intervention or sequence points.
  6. The ScienceDAG and JobConfiguration (separate from the SuperTasks own configurations) are passed to a BatchActivator to launch the jobs the Science DAG represents. This will generally involve translating the DAG to a form specific to the BatchActivator (and possibly adding intermediate nodes for staging data) and interacting with a scheduler. If the PreFlightActivator and BatchActivator are the same object or are closely related, passing the Science DAG may just be a function call, but it should also be possible for the PreFlightActivator to serialize the DAG and have a human operator pass it to a different BatchActivator. The BatchActivator may record data-repository-wide provenance at this time.
  7. WorkerActivators launched by the BatchActivator instantiate SuperTasks in worker process (if needed), and call runQuantum one or more times.

Concepts

SuperTask

Abstract base class. Represents a configurable stage in a production as a Task that can operate soley on inputs and outputs from the Butler, with a consistent interface for defining units of work (defineQuanta), processing a unit of work (runQuantum), and providing access to intrusive provenance (TBD or not significantly different from CmdLineTask).

Pipeline

Concrete class. Conceptually simple: just a sequence of SuperTasks classes and their configurations. Should permit editing of both the set of SuperTasks and overrides of configuration via both a Python API and a human-readable text format. May need some functionality for linking or maintaining constraints between configuration parameters in different SuperTasks.

DataRepository

Concept. Same as the current Butler Data Repository concept; this aggregates:

  • A Mapper that defines what datasets exist and how they related to each other (and possibly how to map them to concrete files).
  • Actual datasets, possibly held in "parent" Data Repositories.

DatasetExpression

Concept. A description of the datasets that should be used (as input) or generated (as output), possibly containing wildcards ("all CCDs") and explicit values. Bounds constraints ("dateobs between X and Y" or "patches=[4:6, 5:7]") would also be desirable but are not clearly necessary and are not in the baseline design.

PreFlightActivator

Concrete class. A class responsible for combining a Pipeline, a DatasetExpression, and a DataRepository to generate a ScienceDAG. Our baseline design assumes DataIdGenerators will solve the problem of mapping different DatasetExpressions to different DataRepositories, and hence we should only actually need one concrete PreFlightActivator implementation.

ScienceDAG

Concrete class. A directed acyclic graph with each node either

  • a single invocation of a specific SuperTask's runQuantum method, or
  • a concrete dataset (name and data ID).

The class representing this graph should be serializable (does not need to be human readable), and should permit iteration over all jobs involving a given SuperTask and all nodes representing different IDs for a given named dataset. Each runQuantum node should have links to the dataset nodes it uses as input and output and each dataset node should have links to the runQuantum nodes that produce and consume it. It may need to support extracting subgraphs.

JobConfiguration

Concept. A set of parameters that control how jobs are run that should not affect the scientific results at all. This includes information about how to pack jobs into nodes (for SuperTasks are flexible in the number of cores they use) and which intermediate datasets may be elided from the set of final outputs, and probably many other things outside of my domain of expertise.

BatchActivator

Class. An object that executes all jobs represented by a ScienceDAG. We anticipate at least two distinct BatchActivator implementations:

  • one that runs on a single node using Python multiprocessing;
  • one used for operations at NCSA, probably using Pegasus.

We also need to run in two other contexts:

  • traditional HPC resources with shared filesystems and no operations database (e.g. per-institution batch clusters, today's lsst-dev)
  • continuous integration.

These two could probably use the same BatchActivator if CI is run on traditional HPC; they could use a generalization of either of the two above planned implementations. At present it is not clear how best to support these two additional contexts.

It is not clear whether different BatchActivators need to have anything in common, or whether they would benefit from having a common base class.

WorkerActivator

Class. A harness object that runs one or more invocations of a runQuantum from one or more SuperTasks. For the simple single-node BatchActivator, the WorkerActivator and BatchActivator are probably the same class, and the WorkerActivator may be able to use existing SuperTask instance. For a more complex BatchActiavtor, the associated WorkerActivators will be responsible for instantiating SuperTasks and may be responsible for staging input and output data as well.


Requirements Not Addressed

Reprocessing/Retries

If an output dataset (including intermediates) already exists in the Data Repository (or a parent repository), how do we specify whether to regenerate it (let's assume for now that it was originally generated way in a way that should be identical to what we're about to produce)?

If we include it at step 3, we can avoid including processing we'll ultimately skip in the ScienceDAG, but it might make more sense to include this in JobConfiguration in step (6), since like the rest of the JobConfiguration it doesn't affect the results.

The best approach may be to eliminate the desire for anyone to want to reprocess when the output dataset exists in the repo, by making sure we make a new Data Repository in the chain for each concrete SuperTask that is ever run (the repos-like-git concept), while generally making sure the user only sees the (always-updated) tip of the chain. With that in place, if you want to restart processing at a particular step, you could always "branch" from the step before, and never worry about having instances of the dataset you're producing already present. If you then needed to combine processing from different branches - representing the same SuperTasks run with (generally) the same configurations on different data IDs - that would be like a git merge commit, in which we produce a new repository with multiple parents. I'm not totally sure this model holds up to close scrutiny, but it's at least a useful straw man.

I'm sure the NCSA team already has something in mind for how to handle provenance of retries in operations, and we should definitely learn from that (which I assume is derived from DES experience). However, like all other provenance and I/O concepts, I think it's important to map that the Butler and Data Repository concepts so we can use them in other contexts.

Configuration Consistency

Even with regular Tasks, the current configuration system does not have a way for managing multiple related parameters scattered throughout the tree. With SuperTask allowing groups of Tasks to be related more dynamically, this problem will only get worse.

Avoiding Camera-Specific Data ID Usage

Passing data IDs as plain dicts to Task code allows algorithmic code to use camera-specific quantities they should not have access to. Data IDs passed to SuperTasks should probably instead be opaque

Design Problems

From a Science Pipelines perspective, the design problems all seem localized to Step 4 in the Usage section above:

  • defineQuanta is constrained from only one side, leading to overexpanded wildcards and unneeded quanta.
  • Trimming unneeded quanta may be hard or impossible. Needing to call each defineQuanta method twice is a strong indication that this is not the right interface.
  • Making defineQuanta responsible for both expanding wildcards and defining units of execution seems like too much for a method that must be implemented by every SuperTask (even if these are largely delegated to DataIdGenerator/Butler). We should try to move all wildcard expansion into the PreFlightActivator or some other class with fewer concrete implementations.
  • Relationship information from DataIdGenerators is not passed between SuperTasks, because the dict of lists of dicts used to pass data IDs to defineQuanta doesn't provide a place to put it. Without a way to pass that relationship information through, we will have to make multiple calls to the same or very similar DataIdGenerators with essentially the same inputs.
  • The DataIdGenerator concept is too incomplete to see how it would help; assuming defineQuanta delegates to it just punts all of the problems with defineQuanta to another piece of code. While there may be fewer DataIdGenerator implementations than defineQuanta implementations, we still seem to at least have a double-dispatch problem between relationships and butler back-ends; it may be a triple-dispatch problem if camera-specific data IDs allow those relationships to be defined differently for different cameras.


  • No labels

2 Comments

  1. Sorry for so many inline comments, but I think I'm done with commenting. BatchActivator is a new thing and I wanted to understand how it relates to other pieces, but as a concept to me there is not much abstraction in that thing, it is entirely specific to a workflow system (and may not be even needed in some cases). 

    I think we are on the right track with decomposing the whole thing into concepts. I'd like to see better names for some of those concepts, here are some suggestions from me, and I'm sure there are better names:

    • composite - I'd call this a Pipeline
    • PreFlightActivator - it does not activate anything, if its sole responsibility is to build DAG maybe call it DAGFactory or something similar?
    • BatchActivator - again it's not activating anything directly, but it takes DAG and schedules things for execution so something like DAGScheduler?
    • WorkerActivator - that is the only true activator, so maybe just call it TaskActivator or SupertTaskActivator?

    I am worried about this doubly-constrained two-pass quanta splitting and I have no idea how it can be implemented in general but I'm thinking about it.

    1. No need to apologize at all, thanks for all the comments!  I think I basically agree with you on all of these comments:

       - I agree there may not be any need for a common interface or common code for BatchActivator.

       - I like all of your proposed names better than the (lazy) ones I was using, though we may want to avoid "DAG" since we'd have to write it as "Dag" according to our coding standards and that's ugly.  That's perhaps a superficial reason, but it's enough to make me prefer "Graph" instead.