Initial Meeting 2020-12-14

Date

14 Dec 2020

Attendees

Goals

Define the campaign management problem and understand what is available and what is missing to solve it

Discussion items

Overall problem: How we manage processing

What happens if data or code is not good?

Use cases:

Ad hoc campaigns in Commissioning
Reprocess same data in multiple ways, comparing
Data good for specific purposes but not general processing
Randomized samplings from seeing distribution
20 year or deeper stacks, selections from those
What data is good or bad and how much so
- Many different quality metrics
Tying V&V back to data: QA results need to be connected to input data used to measure

Not sure Registry is best place to put quality information

e-logging has its own database

Processing results may change what data is to be taken, not just processed

Take QA and use that to decide what data and code to execute

Dataset definition:

All information used to select datasets (whether from systems or humans) is known up front before any campaign execution
Information used to select datasets for downstream pipelines depends on outputs of upstream pipelines
1. Break pipelines into two campaigns, one with upstream, one with downstream; reduces to first problem
2. Have a mechanism for pipeline outputs to control downstream workflow

Systems needed:

Systems to control what data is taken (not scheduler, but ScriptQueue may be adequate)
Systems to control what data is used as input (major missing piece)
- Dataset with list of dataIds or datasets (store those as well as who generated them and why)
- Queries to Registry or other databases -- but more than SQL is needed; complex programs may be needed; human filtering as well
Systems to control what code is used to process that data (Gen3 PipelineTasks should be OK)
Systems to generate metrics based on inputs/outputs (faro is the basis; should be OK, need to demonstrate scalability)
Systems to assist users in translating metrics into data taken/input or processing (including visualization, dashboards, etc.)

Databases needed:

Registry (may contain some but not all information, including information in Collections and visit definitions)
e-logging database
QA result metrics
Human-generated "tag" database -- always about images? Could be about tiles

Current Butler/BPS:

Don't want to tell it to run an entire DR because it takes a long time to build entire QGraph
- Humans currently have to intervene to simplify campaigns to avoid scaling problems
- Need to make sure that failures don't cause all processing to fail

PanDA does not do campaign management, but it can observe campaign progress as it executes a workflow graph

ctrl_bps can generate the workflow graph for PanDA

"Random" campaigns:

Routine processings (like CPP, template generation)
Ad hoc processing (need flexible campaign description)

Fermi-GLAST had people looking at quality metrics to add quality flags

Campaign database:

Campaign definition (including input data and pipelines)
Campaign definer
Outputs/metrics
Reason for campaign

Humans loading information into some Butler database to control inputs to campaign processing is OK

That information is distinct from automatically generated metrics because it has to be human-loaded on human timescales

Registry is not suitable for storing this because schema is supposed to be well-controlled; this information needs to have a very flexible, frequently-changing schema

Querying "user tables" is a mechanism for doing input selection as well as QA

Need to distinguish between staff and general science users

Action items

Define campaign execution strategy including campaign database and tools Campaign Management , Campaign Management 12 Feb 2021 (Leanne will help, mid-Feb for write-up)
Define interface for human input to dataset selection for pipeline execution outside Registry Campaign Management 01 Feb 2021 (mid-Jan to talk to Campaign Management and Campaign Management about Gen3 direction, end-Jan for write-up)

Interface Discussion 2021-01-26

Date

26 Jan 2021

Attendees

Goals

Define the human input interface for dataset selection for pipeline execution

Discussion items

Desire for human entered dataset criteria

Something is "wonky" for some particular purpose (negative)
Or "I want to select these particular datasets to work on" (positive)

Exposure log (LSE-490 3.2.4.2, DMTN-173, Russell Owen prototype) does not fulfill all these needs

Write by any Rubin staff, but not science users
Read by any data rights holder, maybe anyone worldwide
Does not include QA metrics

Scalar pipeline-generated values could be added into a Registry table

Registry?

If boolean, can be expressed as collection

Persistent, sharable, but only for datasets (may need to invent a concept for collection of DataIDs)
In initial "friendly user" mode, everyone can see everyone's collections
But later, we need to have sharing permissions on these as well as figure out deriving collections and quota issues
No provenance information associated, unfortunately; need to record externally

Otherwise must be outside due to flexible schema, user entry
Could allow people to upload temporary lists of datasets or DataIDs as a pipetask extension

Would prefer one if suitable, but may need both
Not persistent, not sharable
Start with input of DataIDs as JSON since that serialization already exists

Passing knowledge of "badness" through execution of quantum graph — not strictly part of this topic

Sentinel datasets currently understood only by the pair of tasks involved
Need something more official in Registry
Bad file database for eventual investigation

Conclusions

The Middleware Team will write a tool to upload sets of DataIDs in a file in JSON format to a Butler Registry. This will require creating a new table in the Registry schema to hold this new "DataID Set" concept; the resulting sets will be persistent and sharable. The DataIDs must use primary dimension keys in their JSON representation; these can be obtained from the Registry or astro_metadata_translator. The Middleware Team may create a batch API to aid in performing this conversion to primary keys, but otherwise it is the user's responsibility.

At a later time, the Middleware Team may support an additional, more human-friendly upload file format, perhaps based on CSV.

The Middleware Team will create an API to enable such a "DataID Set" to be resolved into a set of datasets that can be persisted as a user-defined TAGGED Collection.

The Middleware Team will modify the pipetask command-line execution tool to allow the upload of sets of DataIDs in upload file format into temporary tables that can be used in the data query expression as inclusion or exclusion conditions.

It is expected that users will develop their own tools/scripts/notebooks for querying all relevant available data sources, including the EFD, the Exposure Log, the Registry, SQuaSH and other QA metrics, and external references, in order to generate DataID sets in upload file format. The Middleware Team may develop libraries or frameworks to simplify writing these, particularly emphasizing VO query integration.

Strategy/Database/Tools Discussion 2021-02-05

Date

05 Feb 2021

Attendees

Goals

Define the campaign execution strategy including campaign database and tools

Discussion items

Types of campaigns
- DRP
- Mini-DRP
- Other periodic (CPP, templates)
- Bi-weekly production
- Developer
- Commissioning
Campaign scope (versus workflow/job)
- Multiple tracts, resource limitations, "blast radius"
- Global synchronization points
- User intervention points
Campaign database
- Campaign definition (including input data and pipelines)
- Campaign definer
- Reason for campaign
- Outputs/metrics, progress/status
- Centralized vs. per-user?
- Documentation and "attachments"/"supplementary material" in GitHub, Confluence, DMTNs?
- Anything else?
Tools
- Create campaign
- Edit campaign
- View campaign
- List campaigns
- Web-based or command line?
- Repurpose existing tool/database?

Space shortcuts

Page tree

Table of Contents

Initial Meeting 2020-12-14

Date

Attendees

Goals

Discussion items

Action items

Interface Discussion 2021-01-26

Date

Attendees

Goals

Discussion items

Conclusions

Strategy/Database/Tools Discussion 2021-02-05

Date

Attendees

Goals

Discussion items

Space shortcuts

Page tree

Campaign Management

Table of Contents

Initial Meeting 2020-12-14

Date

Attendees

Goals

Discussion items

Action items

Interface Discussion 2021-01-26

Date

Attendees

Goals

Discussion items

Conclusions

Strategy/Database/Tools Discussion 2021-02-05

Date

Attendees

Goals

Discussion items