ButlerWG Meeting 2017-08-24

Meeting began 24 Aug 2017 @ 10am; ended 11:58am Project Time.

Attending

Next meeting: 31 Aug 2017 @ 10am.

SuperTask: Jim Bosch described his concept for the interaction of SuperTask with the batch processing system. During pre-flight the butler would determine the files to use and would return a data structure that could be serialized to disk (say in YAML format) and sent to the processing node along with those data files. When the SuperTask starts the butler would be initialized with the YAML file so that it could easily locate the files in the specified directory without having to do an ingest phase involving reading the data headers.

Data Back Bone: There was a discussion of the interaction of Butler with DBB. It was stated that the DBB is likely to be a GPFS file system with some SQL databases to allow a location in the file system to be located. There is not necessarily planned to be an API for querying the DBB (either via REST call or by code written by DBB team that can be imported). It was noted that there are many reasonable use cases for a Butler to be configured to understand how to query the DBB. There was a strong indication from Michelle Gower that a butler should never be directly writing files to the DBB (or corresponding rows to database tables) and that DBB ingestion code would handle that.

Catalog Files: Unknown User (pschella) asked how catalogs would be handled by the butler for batch processing if everything has to be local and a job can not query Qserv. The answer was that the butler would request the catalog files not contents from a database table. The job would then have to understand that it will be handling multiple catalog files (retrieved by multiple butler .get() calls) and will have to join them and filter them to extract the relevant subset.

Large Files: Metadata from large files was discussed. Should the whole file be staged or if a job only requires the metadata can a subset be staged? This is effectively a use case for composite datasets and processing jobs should result in multiple output files to enable this.

Butler Abstraction API: Everyone agreed that the current API abstraction for I/O is something we want. You call get() and you get a Python object given a data ID; you call put() with a python object and it is persisted.

Partial IDs: There was some discussion of partial data IDs and possible confusion over "partial" implying a subset of a dataset versus "partial" meaning a data ID expression that can resolve to multiple data IDs. There was some preference for the term "data ID expression".

Seeing cuts: Simon Krughoff asked how the system should handle a data ID expression of "seeing < 0.5 arcsec". This triggered a discussion on metadata querying.

Metadata queries: Jim Bosch argued that the object passed to the butler should be opaque. The code that queries metadata services for relevant data IDs is distinct from the code that retrieves a particular data ID. The resulting opaque object returned from a metadata query should follow a specific interface but the implementation details should not be known to the user. It could internally be a direct reference to a local filename.

Use Case Spreadsheet: We are working on a shared Google spreadsheet to put the use cases into a common form with labels. The @NCSA columns are currently "Development", "Release" , "Personal", "Operations (Batch)" and "Operations (L1)". Jim Bosch wondered if it wouldn't be clearer from a use case perspective for "Personal" and "Release" to be more related to the "Science Platform".

Commissioning: It was noted that we don't really have any use cases on commissioning activities and the "Development" section does not really cover that.

Simon Krughoff to talk to Keith Bechtol about possible commissioning use cases (in particular using the commissioning cluster before data are in the DBB).

Quality Assessment: Michelle Gower asked what the general QA use cases were. The response was that we have QA for a data release, QA of a standard dataset as the software evolves, and QA of reference datasets but using different configurations.

Software QA: Simon Krughoff reported on the general QA software use case where each week the software is tested against a reference set of input data. The output results should be stored and should not be automatically deleted in case further investigation is required. It was not clear that the Development area was suitable for this. The question was raised as to whether these results should go into the DBB at all given that they are transient, even if on a long time scale. In that case where should they go and what mechanism should be used to access them? Does the QA job need to access catalog results from different outputs from the same inputs or are they ingested into databases and queried there?

Jim Bosch and Simon Krughoff to write some use cases for development activities, including details of data retention policies and backups.

Arbitrary files: It was asked if there is a use case for a task/supertask to be run on a set of files specified using a glob from the command line? This is something that is wanted by an OPS use case and a case could be made for this ti be a generally useful use case for science pipelines, although possibly at a lower priority. This is similar to processFile and obs_file functionality.

HDF5: HDF5 I/O was mentioned again but no use case exists.

Tim Jenness to write a use case for HDF5 I/O via the butler.

DataID Aliases: Brian Van Klaveren asked whether data IDs can be aliased. Are they unique? Can two different data IDs result in the same look up key? Jim Bosch responded that metadata query look ups convert data ID to an object that can be passed to get(). Simon Krughoff reiterated that data discovery and data retrieval are distinct operations that should not be conflated. Michelle Gower noted that a data ID is not unique because you need more information, in particular, you need to know what enclave you are operating in as Development and Release might have items that correspond to the same data ID but have different contents. Jim Bosch stated that DataID does not currently include the dataset type but it should. Russell Owen felt that would currently lead to more complicated code (currently the data ID can be fixed as the dataset type changes during processing) but that SuperTask may make the complexity disappear.

Plugability: There was a long discussion on plugability of the butler such that get() for a "calexp" returns an afw Exposure one time or an astropy NDData another. There is no easy way for a reader or writer method configured into a butler to ensure that a "calexp" always returns an afw object. This has to be done by convention. We want to have plug-ins, as that is a key part of the butler abstraction (you need to be able to say in a config file that a particular data set type is instantiated by a particular method).

Lazy Loading: This was mentioned but it was felt that it is no longer an important use case and it adds complication. If subsetting of a dataset (eg for a footprint cutout described above) could be done then there would be no need to worry about a large image being read in when only a small part of it is needed. Jim Bosch also noted that lazy loading is less important if supertasks can use a version of a dataset cached in memory.

Use Cases: The missing use cases should be added to the spreadsheet for next week. If there are duplicate use cases, indicate that in the ID column.

Russell Owen Add L1 use cases to spreadsheet.
Unknown User (xiuqin) Add SUIT use cases to spreadsheet.
Tim Jenness to look at OPS use cases from Michelle Gower and rewrite them for use case spreadsheet.

Requirements: Next week we need to move to requirements.

Tim Jenness create requirements table in the spreadsheet.

Space shortcuts

Page tree

Attending

1 Comment

Unknown User (xiuqin)