Table of Contents


Initial Meeting 2020-12-14

Date

Attendees

Goals

  • Define the campaign management problem and understand what is available and what is missing to solve it

Discussion items


Overall problem: How we manage processing

What happens if data or code is not good?
Use cases:
  • Ad hoc campaigns in Commissioning
  • Reprocess same data in multiple ways, comparing
  • Data good for specific purposes but not general processing
  • Randomized samplings from seeing distribution
  • 20 year or deeper stacks, selections from those
  • What data is good or bad and how much so
    • Many different quality metrics
  • Tying V&V back to data: QA results need to be connected to input data used to measure

Not sure Registry is best place to put quality information
e-logging has its own database

Processing results may change what data is to be taken, not just processed

Take QA and use that to decide what data and code to execute

Dataset definition:
  1. All information used to select datasets (whether from systems or humans) is known up front before any campaign execution
  2. Information used to select datasets for downstream pipelines depends on outputs of upstream pipelines
    1. Break pipelines into two campaigns, one with upstream, one with downstream; reduces to first problem
    2. Have a mechanism for pipeline outputs to control downstream workflow


Systems needed:
  • Systems to control what data is taken (not scheduler, but ScriptQueue may be adequate)
  • Systems to control what data is used as input (major missing piece)
    • Dataset with list of dataIds or datasets (store those as well as who generated them and why)
    • Queries to Registry or other databases -- but more than SQL is needed; complex programs may be needed; human filtering as well
  • Systems to control what code is used to process that data (Gen3 PipelineTasks should be OK)
  • Systems to generate metrics based on inputs/outputs (faro is the basis; should be OK, need to demonstrate scalability)
  • Systems to assist users in translating metrics into data taken/input or processing (including visualization, dashboards, etc.)

Databases needed:
  • Registry (may contain some but not all information, including information in Collections and visit definitions)
  • e-logging database
  • QA result metrics
  • Human-generated "tag" database -- always about images? Could be about tiles

Current Butler/BPS:
  • Don't want to tell it to run an entire DR because it takes a long time to build entire QGraph
    • Humans currently have to intervene to simplify campaigns to avoid scaling problems
    • Need to make sure that failures don't cause all processing to fail

PanDA does not do campaign management, but it can observe campaign progress as it executes a workflow graph
ctrl_bps can generate the workflow graph for PanDA
"Random" campaigns:
  • Routine processings (like CPP, template generation)
  • Ad hoc processing (need flexible campaign description)


Fermi-GLAST had people looking at quality metrics to add quality flags

Campaign database:
  • Campaign definition (including input data and pipelines)
  • Campaign definer
  • Outputs/metrics
  • Reason for campaign

Humans loading information into some Butler database to control inputs to campaign processing is OK
That information is distinct from automatically generated metrics because it has to be human-loaded on human timescales
Registry is not suitable for storing this because schema is supposed to be well-controlled; this information needs to have a very flexible, frequently-changing schema
Querying "user tables" is a mechanism for doing input selection as well as QA
  • Need to distinguish between staff and general science users

Action items

Interface Discussion 2021-01-26

Date

Attendees

Goals

  • Define the human input interface for dataset selection for pipeline execution

Discussion items

Desire for human entered dataset criteria
  • Something is "wonky" for some particular purpose (negative)
  • Or "I want to select these particular datasets to work on" (positive)

Exposure log (LSE-490 3.2.4.2, DMTN-173, Russell Owen prototype) does not fulfill all these needs
  • Write by any Rubin staff, but not science users
  • Read by any data rights holder, maybe anyone worldwide
  • Does not include QA metrics
    • Scalar pipeline-generated values could be added into a Registry table

Registry?
  • If boolean, can be expressed as collection
    • Persistent, sharable, but only for datasets (may need to invent a concept for collection of DataIDs)
    • In initial "friendly user" mode, everyone can see everyone's collections
    • But later, we need to have sharing permissions on these as well as figure out deriving collections and quota issues
    • No provenance information associated, unfortunately; need to record externally
  • Otherwise must be outside due to flexible schema, user entry
  • Could allow people to upload temporary lists of datasets or DataIDs as a pipetask extension
    • Would prefer one if suitable, but may need both
    • Not persistent, not sharable
    • Start with input of DataIDs as JSON since that serialization already exists

Passing knowledge of "badness" through execution of quantum graph — not strictly part of this topic
  • Sentinel datasets currently understood only by the pair of tasks involved
  • Need something more official in Registry
  • Bad file database for eventual investigation

Conclusions

The Middleware Team will write a tool to upload sets of DataIDs in a file in JSON format to a Butler Registry.  This will require creating a new table in the Registry schema to hold this new "DataID Set" concept; the resulting sets will be persistent and sharable.  The DataIDs must use primary dimension keys in their JSON representation; these can be obtained from the Registry or astro_metadata_translator.  The Middleware Team may create a batch API to aid in performing this conversion to primary keys, but otherwise it is the user's responsibility.

At a later time, the Middleware Team may support an additional, more human-friendly upload file format, perhaps based on CSV.

The Middleware Team will create an API to enable such a "DataID Set" to be resolved into a set of datasets that can be persisted as a user-defined TAGGED Collection.

The Middleware Team will modify the pipetask command-line execution tool to allow the upload of sets of DataIDs in upload file format into temporary tables that can be used in the data query expression as inclusion or exclusion conditions.

It is expected that users will develop their own tools/scripts/notebooks for querying all relevant available data sources, including the EFD, the Exposure Log, the Registry, SQuaSH and other QA metrics, and external references, in order to generate DataID sets in upload file format.  The Middleware Team may develop libraries or frameworks to simplify writing these, particularly emphasizing VO query integration.


Strategy/Database/Tools Discussion 2021-02-05

Date

Attendees

Goals

  • Define the campaign execution strategy including campaign database and tools

Discussion items

  • Types of campaigns
    • DRP
    • Mini-DRP
    • Other periodic (CPP, templates)
    • Bi-weekly production
    • Developer
    • Commissioning
  • Campaign scope (versus workflow/job)
    • Multiple tracts, resource limitations, "blast radius"
    • Global synchronization points
    • User intervention points
  • Campaign database
    • Campaign definition (including input data and pipelines)
    • Campaign definer
    • Reason for campaign
    • Outputs/metrics, progress/status
    • Centralized vs. per-user?
    • Documentation and "attachments"/"supplementary material" in GitHub, Confluence, DMTNs?
    • Anything else?
  • Tools
    • Create campaign
    • Edit campaign
    • View campaign
    • List campaigns
    • Web-based or command line?
    • Repurpose existing tool/database?