Meeting convened at 11am Project Time.

Attending


Next Meeting: Thursday August 10th @ 10am


Agenda

  • Overview of charge: LDM-563.
  • What is the butler?
  • Use Cases from each person (first drafts due next meeting)


What is the butler?

From LDM-152:


This component is the framework by which applications retrieve datasets from and persist datasets to file and database storage. It provides a flexible way of identifying datasets, a pluggable mechanism for discovering and locating them, and a separate pluggable mechanism for reading and writing them.


Q: What is an application?
Q: What is a dataset? What is a repository? What is a data ID?


Different aspects:


  • Serialization vs deserialization of python objects.
  • Data discovery (database, file, object store, calibrations)
  • Application APIs (get vs put vs Dataset “type”?)

Use Cases

Jim Bosch:

  • Running SuperTask/CmdLineTask need to be able to get and put files into the environment within which the process is running (eg batch environment).
  • Prior to launching SuperTask need to ask for a data structure that contains the metadata and relationships: graph like data structure capturing input to output relationships. [which visits/ccds overlap which tracts? Which PVIs already exists?]. (this implicitly covers determination of relevant calibrations).
    • How do you override the calibration files rather than using the default best ones that might affect all jobs.
    • Override options in DAG query?
    • Repository layering? Have multiple views into DBB that are virtual repositories.
    • Does the DBB understand butler repositories or does the butler support DBB as a plugin.
  • Have to stage data from DBB to processing environment on which a supertask runs. Is this butler’s job? Putting the data back as well?
  • Open datasets from jupyter notebooks in LSP.

Russell Owen:

  • UW: Responsible for template access, reference catalogs, ISR, Publishing of alerts.
  • Use Case: Publishing of alerts. How do we populate L1DB?
  • Supertask in L1 environment is slightly different. Crosstalk file support.
  • Programmatically generate dataset type. What about in productions? Registering them ahead of time? A single DRP should be frozen. Might be lots of churn leading up to a DRP freeze. Can a supertask dynamically come up with a dataset type that was not expected during pre-flight phase?
  • Multi-threaded config I/O? Possibly a Supertask execution framework requirements (add to SuperTask requirements?).

Brian Van Klaveren:

  • DAX services: data staging for cutout service. and virtual product generation?
    • Have to be able to stitch multiple files together. So we need to query metadata and stage multiple files.
  • Talk about random access vs stream access.
    • Data models supported by the butler. Different contexts of I/O.
    • Tables vs images
    • FITS vs HDF5 vs JSON vs byte streaming vs database record streaming
  • Relocatable Jupyter notebooks. That needs a remote butler.
    • TAP/ADQL/SIA could “just work”.
    • Or else use a VOSpace clone.

Simon Krughoff:

  • Subsection repositories based on datasets and data IDs.
    • Given this list of data IDs I need a coherent self-consistent standalone repo of PVIs and deep coadds.
      • Infer coadds based on datasets.
  • We need a mechanism for discovering data based on multiple axes (good seeing, bad seeing, time based, area of sky). Image characterization pipeline publishes relevant results to DBB, then later pipelines should be able to query based upon it.


  • No labels