Butler Use Case and Requirements review 2017-12-15

The Data Butler component of DM will have a use case and requirements review on 2017-12-15 at 12 PM Project Time (PST).

Attendees:

Presenter: Tim Jenness (20171214-Butler-Requirements-Review.pdf)

DM Systems Engineering: Donald Petravick, John Swinbank, Kian-Tat Lim, Gregory Dubois-Felsmann, Robert Lupton, Zeljko Ivezic

Members of the Butler Working Group: Brian Van Klaveren, Russell Owen, Jim Bosch, Unknown User (pschella), Simon Krughoff, Michelle Gower

Others: Andy Salnikov, Fritz Mueller, Unknown User (npease), Margaret Gelman

BlueJeans:

Connecting directly from a room system?

1) Dial: 199.48.152.152 or bjn.vc
2) Enter Meeting ID: 293724745 -or- use the pairing code

Just want to dial in? (all numbers)
1) Direct-dial with my iPhone or
+1 408 740 7256+1 408 740 7256
+1 888 240 2560+1 888 240 2560 (US Toll Free)
+1 408 317 9253+1 408 317 9253 (Alternate Number)

2) Enter Meeting ID: 293724745

Objectives:

The review will check that the use cases and requirements are reasonably complete, correct, and verifiable prior to reviewing the associated design.

The primary documents to be reviewed are LDM-592 for the use cases and LDM-556 Section 1 "Data Abstraction Layer" for the requirements.

Notes:

Use Cases

Selection of use cases based on importance, difficulty

ARCH3:

Calibration will be a driver for getting enhanced information from the EFD to process pixels
Explicitly includes the case of overriding information that was already present with newer, better information

COMM4:

May involve low latency access

Tim Jenness needs to add use case for comparison of pre-release data products with previous Data Release

Use cases may overlap substantially; the extracted requirements may expose commonalities for particular systems, but other systems may have differences in handling the same use cases, so the use cases should be preserved

Overlap of use cases and ConOps (LDM-230) needs to be resolved

SQR14:

This and related use cases/requirements are "remote Butler"
Slides do summarize use cases, they are not word-for-word

Systems Engineering Comments

Don:

Pleased that this has advanced over spreadsheet
ARCH4: alerts may still be useful in batch
Driving use cases are covered

Donald Petravick to submit minor comments

How to do this?
- Send marked-up PDF to Tim
- Pull request on GitHub for LaTeX from master

Gregory:

Only proof-reading quibbles now, hard to determine completeness on one pass
Science Platform workshop showed we have underspecified Python environment for analyzing released data, which are often not simple files; Butler interface to tabular data needs to have a better story

Gregory Dubois-Felsmann to propose a set of models for how to analyze released data in Python notebooks DM-13560 - Getting issue details... STATUS

Robert:

Calibrations may be strange, need to be retrieved by unusual lookups
Don't see an explicit use case for "reruns" — a name for a set of processing — though this may be satisfied by merging Data Repositories

Robert Lupton will write up a use case for "reruns"

Remote Butler may not be totally captured

K-T:

LDF1/SCIVAL1: can this be optimized if there is too much overhead? If data is accessible, can turn off staging selectively; use cases do cover this running not on batch system
Strictness of notebook portability: expected to be no-change from Commissioning Cluster to NCSA
- RHL would consider changing a URI to be acceptable
- GPDF would accept changing a single cell at the top of the notebook
- But either is probably OK with the LSP setting an environment variable
- Portability from LSST-managed environment to another is a goal
- Portability to a non-LSST-managed environment may not be

Remote Butler

Minimum level: Butler datastore plugin that understands TAP and SIA and can retrieve data but not put it
- But do need to query registry; this is via TAP currently; we are not expecting to expose a SQL server directly to the Internet
GPDF: Portability to laptop is a test case:
- Running outside a DAC but with full Internet accessibility to a DAC: even if we can't actually make that work, we should be designing it so that it could work
- Functionality related to batch job pre-flight to take a data subset and clone it to your laptop and then use that with the Butler
  - BrianVK: Subsetting is not actually remote Butler; it is a sandboxed Butler which is different; import/export vs. dynamic interface
- No requirement to do file-oriented put operations via a remote Butler
RHL: Freeze-drying of pipelines is a related use case
Jim Bosch: If things are stored on filesystems, do we need to go through VOSpace?
- GPDF: presumption in the DAC is that we are accessing the underlying POSIX filesystems
- Could potentially use a FUSE-mounted VOSpace if such can be done
- Do not need to work with non-LSST VOSpaces
Fritz: pushed back on it in the past because of worries that implementation would have unforeseen difficulties, but fine with ensuring architecture can support it

Requirements

Priorities: 1a = first version, 1b = by end of 2018, 2 = by Commissioning, 3 = later; although it is already unlikely that all 1a will be done by the end of S18; confusing that these priorities are different from LSE-61

Tim Jenness to come up with a better way of indicating priority and schedule

0083:

This is what SuperTask needs for preflight: taking a specification for units of work and dividing it up by requirements for SuperTask algorithms into executable chunks; this site-independent determination of work is used by the workflow system to prepare and/or stage data and to harvest outputs when job is complete

0053:

ConcreteDataset is really an InMemoryDataset

0081:

Butler can be asked for provenance for InMemoryDatasets, but not necessarily within InMemoryDataset itself
Provenance might be a composite part of certain Dataset Types but not all

What are the components that the requirements apply to?

Data Discovery System
Data Input System
Data Output System
Data Repository Creation System
- Also other tools related to Data Repositories
- Data Repository is metadata that is independent of exactly how files are stored

DataRepository == Data Repository; formatting will be fixed

0012:

Implicitly includes rerun concept

0011:

Could apply to both workflow preflight and remote laptop, but also for creating test datasets
- Although Butler not expected to be used for workflow, especially transfer of files
- Transferring files poses potential problems (where from and to? by what mechanism?)

0088:

"not normally known" should actually be "must be known"

0003:

Version is of implementation of the repository and its data model, not versions of the data itself
Virtual data from old datasets and old repository versions? Schema evolution of database?
- Migration process may be complicated; doing this for a local database might be different than the Consolidated DB
- Schema evolution is in fact the primary thing here
RHL: Have to worry about code that works against a given repository but code changes; need to be able to use old Butlers to continue to read old Repositories
- Code that calls the Butler shouldn't have to change, but keeping an old Butler working against migrated Repositories is difficult
- Generally have to keep non-migrated versions of Repositories (for read-only frozen data); probably means keeping metadata in original form or enabling both upgrade and downgrade; shouldn't require copying data, only revising the metadata
- Need to make sure use cases cover this
- Gregory Dubois-Felsmann will communicate with DMSST to get use cases for preservation of old data and code DM-13559 - Getting issue details... STATUS
- Brian Van Klaveren will add a use case for being able to round-trip a Repository through an upgrade and back DM-13558 - Getting issue details... STATUS
Acceptable to require migration from pre-Gen3 to Gen3 Repositories

System Engineering Comments

Don: namespace for Repositories and evolution of names

Data model provides for tags, but no naming scheme for them; provides mechanism but no policy has been defined and the software does not enforce any policy

Outcome:

Systems Engineering is comfortable with moving forward with design and prototyping given above actions

A design review will follow to confirm that the design meets these use cases and requirements, hopefully in early January

Kian-Tat Lim to schedule Butler Design Review

Action Items:

Incomplete tasks recorded on this page appear below. As usual: if a task is too time-consuming to finish quickly, please create a JIRA story for it, add the JIRA story macro for it to the Confluence task above, and mark the task complete.

Description	Assignee	Task appears on
Brian Van Klaveren will add a use case for being able to round-trip a Repository through an upgrade and back DM-13558 - Getting issue details... STATUS	Brian Van Klaveren	Butler Use Case and Requirements review 2017-12-15
Gregory Dubois-Felsmann to propose a set of models for how to analyze released data in Python notebooks DM-13560 - Getting issue details... STATUS	Gregory Dubois-Felsmann	Butler Use Case and Requirements review 2017-12-15
Gregory Dubois-Felsmann will communicate with DMSST to get use cases for preservation of old data and code DM-13559 - Getting issue details... STATUS	Gregory Dubois-Felsmann	Butler Use Case and Requirements review 2017-12-15

Space shortcuts

Page tree