ButlerWG Meeting 2017-08-10

Meeting began 10 Aug 2017 @ 10am; ended @ 11:45am Project Time.

Attending

Next meeting: Wednesday 16 Aug 2017 during unconference session of LSST2017. Starting 4pm (possibly 3:30pm). Room TBD.

Use Cases

Draft use cases were written by the WG members. Many of the use cases read as requirements rather than use cases. People were actioned to adjust their use cases before next week's meeting. We started by going through those from Simon Krughoff: Use cases for Science Platform. This resulted in a 50 minute discussion on defining terms, and in particular, what a "dataset" really means in this context.

What is a dataset?

A calexp was used as a concrete example. The concepts that we need to name are:

calexp the name? (currently "dataset type")
calexp with a particular data ID (a specific instance of a calexp that can be uniquely identified by its metadata). A "butler dataset"?
calexp in its in-memory form (for example an afw Exposure object).
A "collection of calexps" (all the files generated for a data release? "The Gaia dataset").
A persisted calexp (a FITS file; database table entries; a "blob" in git-speak).
Russell Owen also noted a need for a "prototype dataset".

The main outcome was that "dataset" as a term can cover large collections of data files but also individual data arrays within a single file (as for HDF5).

New use cases

During the discussion new Use Cases surfaced:

Tim Jenness noted the need for a butler to read in persisted data in one format and write out a processed version in a different format. This might be needed for the DAX services layer if the internal format differs from the format requested by the end user astronomer.
Simon Krughoff noted that it would be useful if plots (in PNG format) and, possibly, YAML files containing metrics could be handled by the butler. Michelle Gower indicated that DES does do this for PNG, although they do not read metadata from headers inside the files. The DBB would have to understand each supported format (and corresponding data model) if it was required to extract metadata from them.
Brian Van Klaveren noted that he would like the option of the butler being able to retrieve different object types associated with a specific dataset type. Some users might want an astropy view of a calexp and others might want an afw view. Jim Bosch worried that this was dangerous if driven by a configuration file and it would be safer if the class of the retrieved object was set in the get().

Operations

We discussed the interaction of the butler with the operational system. Tim Jenness had assumed that there could be a butler instance that would understand the DBB and would be able to retrieve files from the DBB to a local file system based on the data IDs retrieved from the SuperTask, creating a local butler repository that could be used by the associated processing job. Michelle Gower noted that with pegasus we are not in charge of file transfers at all and NCSA are not expecting the butler to be involved in retrieving from the DBB. This needs further discussion as there was some confusion as to why this is the case and what logic may have to be duplicated in DBB tools and the butler. For example, why can't Pegasus transport a tar file of the directory tree to the node? Would the supertask running on each node have to ingest the files into a butler from the flat filesystem? Once a job has been run and files have been created it was stated that the ingestion phase can be straightforward without a butler because every filename created during processing has to be known in advance so there can be no surprises or uncertainty. This does require that all relevant metadata is stored within the file being ingested and can not reside in separate files known only to the butler.

Tim Jenness stated that he thought there was a requirement in LSE-61 for data files retrieved from the archive to include the most up-to-date versions of WCS and PSF information. This would require a raw file, say, to be modified as part of data retrieval. It might be that this has to then be a DBB feature if the butler is not being used to seed the processing nodes. We have previously assumed that the butler would retrieve the raw file, then retrieve the PSF and WCS definitions, and then combine them when forming the in-memory object.

We noted that a Data Back Bone requirements document has been requested by Wil O'Mullane and this WG will be able to generate requirements for he DBB.

Alert Generation

Russell Owen wondered whether alerts being generated in the L1 system go through a butler. Does a butler publish to the kafka-like queue? Does some independent software do that? Does the L1DB itself get published via kafka or does the L1 system write the alerts to disk via a butler for later ingestion into the L1DB? Are L1DB ingests asynchronous?

Russell Owen / Brian Van Klaveren to talk to their teams to try to get clarity on how the L1 pipeline interacts with L1DB. 16 Aug 2017 .

Jim Bosch wanted to ensure that the use cases would not self censor. If it turns out that a use case is valid but is putting requirements on some other part of the system then the WG has to explicitly state that and ensure that the requirement is flowed correctly. Tim Jenness endorsed this view.

Space shortcuts

Page tree