The Data Butler component of DM will have a use case and requirements review on 2017-12-15 at 12 PM Project Time (PST).

Attendees:

Presenter: Tim Jenness (20171214-Butler-Requirements-Review.pdf)

DM Systems Engineering: Donald Petravick, John Swinbank, Kian-Tat Lim, Gregory Dubois-Felsmann, Robert Lupton, Zeljko Ivezic

Members of the Butler Working Group: Brian Van Klaveren, Russell Owen, Jim Bosch, Unknown User (pschella), Simon Krughoff, Michelle Gower

Others: Andy Salnikov, Fritz Mueller, Unknown User (npease), Margaret Gelman

BlueJeans:

Connecting directly from a room system?

1) Dial: 199.48.152.152 or bjn.vc
2) Enter Meeting ID: 293724745 -or- use the pairing code


Just want to dial in? (all numbers)
1) Direct-dial with my iPhone or
+1 408 740 7256+1 408 740 7256
+1 888 240 2560+1 888 240 2560 (US Toll Free)
+1 408 317 9253+1 408 317 9253 (Alternate Number)

2) Enter Meeting ID: 293724745

Objectives:

The review will check that the use cases and requirements are reasonably complete, correct, and verifiable prior to reviewing the associated design.

The primary documents to be reviewed are LDM-592 for the use cases and LDM-556 Section 1 "Data Abstraction Layer" for the requirements.

Notes:

Use Cases

Selection of use cases based on importance, difficulty

ARCH3:

  • Calibration will be a driver for getting enhanced information from the EFD to process pixels
  • Explicitly includes the case of overriding information that was already present with newer, better information

COMM4:

  • May involve low latency access
  • Tim Jenness needs to add use case for comparison of pre-release data products with previous Data Release

Use cases may overlap substantially; the extracted requirements may expose commonalities for particular systems, but other systems may have differences in handling the same use cases, so the use cases should be preserved

Overlap of use cases and ConOps (LDM-230) needs to be resolved

SQR14:

  • This and related use cases/requirements are "remote Butler"
  • Slides do summarize use cases, they are not word-for-word

Systems Engineering Comments

Don:

  • Pleased that this has advanced over spreadsheet
  • ARCH4: alerts may still be useful in batch
  • Driving use cases are covered
  • How to do this?
    • Send marked-up PDF to Tim
    • Pull request on GitHub for LaTeX from master

Gregory:

  • Only proof-reading quibbles now, hard to determine completeness on one pass
  • Science Platform workshop showed we have underspecified Python environment for analyzing released data, which are often not simple files; Butler interface to tabular data needs to have a better story

Robert:

  • Calibrations may be strange, need to be retrieved by unusual lookups
  • Don't see an explicit use case for "reruns" — a name for a set of processing — though this may be satisfied by merging Data Repositories
  • Remote Butler may not be totally captured

K-T:

  • LDF1/SCIVAL1: can this be optimized if there is too much overhead? If data is accessible, can turn off staging selectively; use cases do cover this running not on batch system
  • Strictness of notebook portability: expected to be no-change from Commissioning Cluster to NCSA
    • RHL would consider changing a URI to be acceptable
    • GPDF would accept changing a single cell at the top of the notebook
    • But either is probably OK with the LSP setting an environment variable
    • Portability from LSST-managed environment to another is a goal
    • Portability to a non-LSST-managed environment may not be

Remote Butler

  • Minimum level: Butler datastore plugin that understands TAP and SIA and can retrieve data but not put it
    • But do need to query registry; this is via TAP currently; we are not expecting to expose a SQL server directly to the Internet
  • GPDF: Portability to laptop is a test case:
    • Running outside a DAC but with full Internet accessibility to a DAC: even if we can't actually make that work, we should be designing it so that it could work
    • Functionality related to batch job pre-flight to take a data subset and clone it to your laptop and then use that with the Butler
      • BrianVK: Subsetting is not actually remote Butler; it is a sandboxed Butler which is different; import/export vs. dynamic interface
    • No requirement to do file-oriented put operations via a remote Butler
  • RHL: Freeze-drying of pipelines is a related use case
  • Jim Bosch: If things are stored on filesystems, do we need to go through VOSpace?
    • GPDF: presumption in the DAC is that we are accessing the underlying POSIX filesystems
    • Could potentially use a FUSE-mounted VOSpace if such can be done
    • Do not need to work with non-LSST VOSpaces
  • Fritz: pushed back on it in the past because of worries that implementation would have unforeseen difficulties, but fine with ensuring architecture can support it

Requirements

Priorities: 1a = first version, 1b = by end of 2018, 2 = by Commissioning, 3 = later; although it is already unlikely that all 1a will be done by the end of S18; confusing that these priorities are different from LSE-61

  • Tim Jenness to come up with a better way of indicating priority and schedule

0083:

  • This is what SuperTask needs for preflight: taking a specification for units of work and dividing it up by requirements for SuperTask algorithms into executable chunks; this site-independent determination of work is used by the workflow system to prepare and/or stage data and to harvest outputs when job is complete

0053:

  • ConcreteDataset is really an InMemoryDataset

0081:

  • Butler can be asked for provenance for InMemoryDatasets, but not necessarily within InMemoryDataset itself
  • Provenance might be a composite part of certain Dataset Types but not all

What are the components that the requirements apply to?

  • Data Discovery System
  • Data Input System
  • Data Output System
  • Data Repository Creation System
    • Also other tools related to Data Repositories
    • Data Repository is metadata that is independent of exactly how files are stored

DataRepository == Data Repository; formatting will be fixed

0012:

  • Implicitly includes rerun concept

0011:

  • Could apply to both workflow preflight and remote laptop, but also for creating test datasets
    • Although Butler not expected to be used for workflow, especially transfer of files
    • Transferring files poses potential problems (where from and to? by what mechanism?)

0088:

  • "not normally known" should actually be "must be known"

0003:

  • Version is of implementation of the repository and its data model, not versions of the data itself
  • Virtual data from old datasets and old repository versions? Schema evolution of database?
    • Migration process may be complicated; doing this for a local database might be different than the Consolidated DB
    • Schema evolution is in fact the primary thing here
  • RHL: Have to worry about code that works against a given repository but code changes; need to be able to use old Butlers to continue to read old Repositories
    • Code that calls the Butler shouldn't have to change, but keeping an old Butler working against migrated Repositories is difficult
    • Generally have to keep non-migrated versions of Repositories (for read-only frozen data); probably means keeping metadata in original form or enabling both upgrade and downgrade; shouldn't require copying data, only revising the metadata
    • Need to make sure use cases cover this
    • Gregory Dubois-Felsmann will communicate with DMSST to get use cases for preservation of old data and code  DM-13559 - Getting issue details... STATUS
    • Brian Van Klaveren will add a use case for being able to round-trip a Repository through an upgrade and back  DM-13558 - Getting issue details... STATUS
  • Acceptable to require migration from pre-Gen3 to Gen3 Repositories

System Engineering Comments

Don: namespace for Repositories and evolution of names

  • Data model provides for tags, but no naming scheme for them; provides mechanism but no policy has been defined and the software does not enforce any policy

Outcome:

Systems Engineering is comfortable with moving forward with design and prototyping given above actions

A design review will follow to confirm that the design meets these use cases and requirements, hopefully in early January

Action Items:

Incomplete tasks recorded on this page appear below.  As usual: if a task is too time-consuming to finish quickly, please create a JIRA story for it, add the JIRA story macro for it to the Confluence task above, and mark the task complete.

DescriptionDue dateAssigneeTask appears on
  • Brian Van Klaveren will add a use case for being able to round-trip a Repository through an upgrade and back  DM-13558 - Getting issue details... STATUS
Brian Van KlaverenButler Use Case and Requirements review 2017-12-15
Gregory Dubois-FelsmannButler Use Case and Requirements review 2017-12-15
Gregory Dubois-FelsmannButler Use Case and Requirements review 2017-12-15

  • No labels