Table of Contents
Initial Meeting 2020-12-14
Date
Attendees
Kian-Tat Lim (scribe)
Robert Lupton (convener)
- Robert Gruendl
- Hsin-Fang Chiang
- Richard Dubois
- Leanne Guy
Goals
- Define the campaign management problem and understand what is available and what is missing to solve it
Discussion items
- Ad hoc campaigns in Commissioning
- Reprocess same data in multiple ways, comparing
- Data good for specific purposes but not general processing
- Randomized samplings from seeing distribution
- 20 year or deeper stacks, selections from those
- What data is good or bad and how much so
- Many different quality metrics
- Tying V&V back to data: QA results need to be connected to input data used to measure
- All information used to select datasets (whether from systems or humans) is known up front before any campaign execution
- Information used to select datasets for downstream pipelines depends on outputs of upstream pipelines
- Break pipelines into two campaigns, one with upstream, one with downstream; reduces to first problem
- Have a mechanism for pipeline outputs to control downstream workflow
- Systems to control what data is taken (not scheduler, but ScriptQueue may be adequate)
- Systems to control what data is used as input (major missing piece)
- Dataset with list of dataIds or datasets (store those as well as who generated them and why)
- Queries to Registry or other databases -- but more than SQL is needed; complex programs may be needed; human filtering as well
- Systems to control what code is used to process that data (Gen3 PipelineTasks should be OK)
- Systems to generate metrics based on inputs/outputs (faro is the basis; should be OK, need to demonstrate scalability)
- Systems to assist users in translating metrics into data taken/input or processing (including visualization, dashboards, etc.)
- Registry (may contain some but not all information, including information in Collections and visit definitions)
- e-logging database
- QA result metrics
- Human-generated "tag" database -- always about images? Could be about tiles
- Don't want to tell it to run an entire DR because it takes a long time to build entire QGraph
- Humans currently have to intervene to simplify campaigns to avoid scaling problems
- Need to make sure that failures don't cause all processing to fail
- Routine processings (like CPP, template generation)
- Ad hoc processing (need flexible campaign description)
- Campaign definition (including input data and pipelines)
- Campaign definer
- Outputs/metrics
- Reason for campaign
- Need to distinguish between staff and general science users
Action items
- Define campaign execution strategy including campaign database and tools Campaign Management , Campaign Management (Leanne will help, mid-Feb for write-up)
- Define interface for human input to dataset selection for pipeline execution outside Registry Campaign Management (mid-Jan to talk to Campaign Management and Campaign Management about Gen3 direction, end-Jan for write-up)
Interface Discussion 2021-01-26
Date
Attendees
Goals
- Define the human input interface for dataset selection for pipeline execution
Discussion items
- Something is "wonky" for some particular purpose (negative)
- Or "I want to select these particular datasets to work on" (positive)
- Write by any Rubin staff, but not science users
- Read by any data rights holder, maybe anyone worldwide
- Does not include QA metrics
- Scalar pipeline-generated values could be added into a Registry table
- If boolean, can be expressed as collection
- Persistent, sharable, but only for datasets (may need to invent a concept for collection of DataIDs)
- In initial "friendly user" mode, everyone can see everyone's collections
- But later, we need to have sharing permissions on these as well as figure out deriving collections and quota issues
- No provenance information associated, unfortunately; need to record externally
- Otherwise must be outside due to flexible schema, user entry
- Could allow people to upload temporary lists of datasets or DataIDs as a pipetask extension
- Would prefer one if suitable, but may need both
- Not persistent, not sharable
- Start with input of DataIDs as JSON since that serialization already exists
- Sentinel datasets currently understood only by the pair of tasks involved
- Need something more official in Registry
- Bad file database for eventual investigation
Conclusions
The Middleware Team will write a tool to upload sets of DataIDs in a file in JSON format to a Butler Registry. This will require creating a new table in the Registry schema to hold this new "DataID Set" concept; the resulting sets will be persistent and sharable. The DataIDs must use primary dimension keys in their JSON representation; these can be obtained from the Registry or astro_metadata_translator. The Middleware Team may create a batch API to aid in performing this conversion to primary keys, but otherwise it is the user's responsibility.
At a later time, the Middleware Team may support an additional, more human-friendly upload file format, perhaps based on CSV.
The Middleware Team will create an API to enable such a "DataID Set" to be resolved into a set of datasets that can be persisted as a user-defined TAGGED Collection.
The Middleware Team will modify the pipetask
command-line execution tool to allow the upload of sets of DataIDs in upload file format into temporary tables that can be used in the data query expression as inclusion or exclusion conditions.
It is expected that users will develop their own tools/scripts/notebooks for querying all relevant available data sources, including the EFD, the Exposure Log, the Registry, SQuaSH and other QA metrics, and external references, in order to generate DataID sets in upload file format. The Middleware Team may develop libraries or frameworks to simplify writing these, particularly emphasizing VO query integration.
Strategy/Database/Tools Discussion 2021-02-05
Date
Attendees
Goals
- Define the campaign execution strategy including campaign database and tools
Discussion items
- Types of campaigns
- DRP
- Mini-DRP
- Other periodic (CPP, templates)
- Bi-weekly production
- Developer
- Commissioning
- Campaign scope (versus workflow/job)
- Multiple tracts, resource limitations, "blast radius"
- Global synchronization points
- User intervention points
- Campaign database
- Campaign definition (including input data and pipelines)
- Campaign definer
- Reason for campaign
- Outputs/metrics, progress/status
- Centralized vs. per-user?
- Documentation and "attachments"/"supplementary material" in GitHub, Confluence, DMTNs?
- Anything else?
- Tools
- Create campaign
- Edit campaign
- View campaign
- List campaigns
- Web-based or command line?
- Repurpose existing tool/database?