DRAFT

Work in progress

As part of resuming active development work on SuperTask, and to support work on the workflow system(s) that will take advantage of it, I am beginning the work of collecting the various design writeups and diagrams from the last year onto Confluence.  

Several more pages will follow soon.  In the spirit of openness I'm posting them one at a time, but it will help me manage my time better if readers hold their questions and comments until a few more have come out.  You will have many questions, I am sure, if you only read this high-level summary.

Thanks. - Gregory Dubois-Felsmann 

 

A SuperTask is run by an Activator.

Basic definition of SuperTask:

A SuperTask may be composite. The compositeness of the SuperTask is visible to all Activators. This means that composite SuperTasks are not a pure implementation of the Composite pattern, though that is the inspiration. This is believed to be required in order to permit Activators to support composite SuperTasks that involve changing the axis of parallelization between steps (e.g., shift from looping over raw exposures to looping over sky tiles).

A SuperTask represents a unit of (generally transformational) work to be performed on data. As found in a release, a SuperTask knows the *types* of inputs and outputs that it can/will process. The specific data items it will process (defined by dataIds) are supplied through the Activator-SuperTask interface.

In general a SuperTask receives the content of its inputs and produces its outputs by invoking the get() and put() methods of the Butler. The final design may require it to do so through a shim that assists in maintaining provenance and enforcing guarantees relevant to a production system; this is still under review. The shim will provide an interface nearly identical to Butler.get() and Butler.put(). The shim may be implemented as a Decorator pattern, in which case it will be completely invisible to the SuperTask, or it may have a small difference relating to the provision of dataset types.  The "Butler shim" idea will be described elsewhere in this page tree.

The design supports Supertasks both inferring their output dataIds from their inputs and inferring their input dataIds from specified output dataIds. Model use cases for these are “co-add these specified inputs” versus “produce a co-add for sky tile X from all the data available through your supplied Butler”, respectively. It is up to a specific concrete SuperTask as to which of these, or both, it supports. The ability of the Butler to map between dataIds is the core of this, but specific concrete SuperTasks are permitted to do more or different work of this kind than the Butler alone supports.

The SuperTask base class is a subclass of Task. This is so that SuperTask can take advantage of the configuration mechanism for Tasks. The hierarchy of Tasks in a specific application therefore extends all the way up to the top-level SuperTask, and each level is addressable for, e.g., configuration discovery and overrides.

Leaf-node SuperTasks should in general be usable Tasks - that is, they should provide a data-transformation interface through the Task.run() method in terms of explicit Python-domain, in-memory objects, and then implement the SuperTask’s execution interface — run_quantum() — as a retrieval of data objects via Butler.get(), invocation of the Task.run() method, and then output of data from the Struct returned from the Task via Butler.put().

Basic definition of Activator:

The Activator is responsible for providing the Butler (and, if present, the shim) for the SuperTask’s use. It is also responsible for instantiating the SuperTask to be run and for providing necessary inputs to the LSST configuration parameter mechanism for the SuperTask. For example, the “command line Activator” identifies the SuperTask to be run by name, locates and instantiates it, and provides for command-line overrides of config parameters of the SuperTask. It also creates a Butler based on (a) provided or defaulted repository/ies.

An Activator is responsible for arranging for the execution of a SuperTask’s run_quantum() method one or more times over a set of dataIds. Via collaboration with the SuperTask interfaces, the Activator is able to determine the parallelization and scatter-gather behavior that is permissible and/or required to implement the workflow defined by the SuperTask on the provided dataIds. The details of this are described elsewhere. By way of example, the basic “command line Activator” will support “-j”-style parallelization of simple one-to-one transformation workflows (e.g., generation of calexps from raw images).

Extensibility:

The SuperTask abstraction’s axes of extensibility are

  • the definition of an open-ended set of “leaf-node” (non-composite) SuperTasks that do actual algorithmic work
  • the definition of concrete SuperTasks that represent a variety of types of generic composition of sub-SuperTasks. All such concrete SuperTasks must inherit from a subclass of SuperTask currently called WorkflowTask
  • concrete SuperTasks that represent a specific composition of specific concrete sub-SuperTasks; such a specific composition should be thought of as representing a “pipeline”. (“run ISR on raw images, co-add them, and run detection on the co-adds” could be an example)

Subclasses of SuperTask should not be written that depend on specific running environments or perform any bulk data I/O other than through the Butler. (Other specific types of I/O such as logging are, of course, permitted, as is the use of ephemeral I/O such as that provided by afw.display. Detailed policies for this will be constructed as the deployment and use of SuperTask develops. The basic goal of the design is that any SuperTask can be run in any environment in the LSST project - from the command line on a laptop, behind the automated QC harness, as a back-end extension to the SUIT, on a large scale in the Level 1 or Level 2 systems on the LSST production facilities, or on a large scale by a user in Level 3 processing.

One unanswered question in the design is whether these restrictions are a policy (“administrative control”) that can be violated in order to preserve other useful features of SuperTask in certain narrow cases, or whether it should be treated as an unbreakable essential design feature. For example, if we wish to be able to package a database-ingest operation as a SuperTask (because of the configuration and controllability features of SuperTask) it may still be necessary or desirable to permit it to write directly to the database without Butler mediation.

The Activator abstraction’s axes of extensibility are

  • the support of new environments in which to run LSST code — workflow systems, big-data frameworks, remote-execution systems such as container execution control systems, etc.
  • the variety of types of parallelization and scatter-gather workflows that can be supported. Some may support only multicore parallelization on a single host, others may support execution across clusters or using dynamic cloud services. Some may support just-in-time scatter-gather workflows (like those definable as DAGs in batch systems) while others may provide only fully serialized execution of steps in the workflow.

Activators may be complex and composite. For instance, an Activator providing support for running on a batch production system may involve an Activator subclass that runs on the head node that harvests information about the SuperTask to be run, and then a different Activator subclass that runs on each worker node to actually invoke the SuperTask on the chunk of data to be processed on that node.

Specific concrete Activators may be designed to operate as a service and may:

  • maintain a single Python interpreter context in existence over multiple activations
  • support running more than one SuperTask, under some external control specific to the Activator (e.g., the Firefly back-end extension Activator may be configured with a list of SuperTasks known and available to a particular Firefly session, which it invokes under control of inputs from the client UI).

The Activator-SuperTask interface permits concrete Activators to record coarse-grained provenance at the level of leaf-node SuperTasks. These records would be of the form “SuperTask X generated outputs {(outDataId_n,outDatasetType_n), etc.} via Butler.put() after reading inputs {(inDataId_j,inDatasetType_j}, etc.} via Butler.get()”. Concrete Activator implementations are not required to do this, but any Activator used in production should have this capability.

This is not the only provenance mechanism that LSST DM will have, but it does have the advantage of being completely non-intrusive — it will work for any SuperTask.