Draft - Many folks at NCSA have been on vacations as it is the use-them-or-lose them time for us. So not everyone has had a chance to provide input (e.g., Don, Margaret, Felipe).
For this version of the document, we started writing down representative cases where either files or databases are used. For the majority of the operations cases outside the actual pipeline execution it is not clear that these are actually Butler use cases. That doesn’t mean that some lower level code could be shared (e.g. afwImage). (Does not include cases where science platform, end-user, etc use cases involve accessing Data Backbone, etc.)
Off-line Batch Processing Service:
In order to run a task/SuperTask, the Batch Processing Service's compute job must set up input files, previously transferred to an empty local disk via some non-Butler mechanism, in a way accessible by the Butler being used by the task/SuperTask. (Currently the initial execution would be of the command-line task/activator).
The execution of a SuperTask inside a pipeline during operations wants an input file that doesn't exist in the local repository. This should be a failure (as opposed to helpfully fetching the file for the supertask).
The Batch Processing Service will transfer the output files out of the job scratch disk to either other working areas or back to Data Backbone (where the filename must be unique).
When we do production runs, yes the Butler could name the files whatever, and then we have to wrap every execution with a wrapper (executor) to rename the files appropriately. This obscures things on many levels (workflow system thinks it is only running 1 executable, digging around a filesystem while debugging an active job means different filenames), and we strongly prefer not to do so.
This does NOT mean that Butler itself has to be able to generate unique filenames. If it allows the user of the Butler (i.e., production framework) to tell it how to name output files (like it currently does but needs to be easier), this would be good enough for Operations.
The operator wants to execute a pipeline via the Batch Processing Service using a different calibration than as used previously. They edit some configuration changing the calibration lookup strategy and submit the processing attempt. Examples:
Use different set of calibration files
The Operations consolidated DB will have many files/data which should not be used as inputs for a particular execution. Need an easy way to do this.
Many data management tasks will need consolidated database access. It is not clear what is to be gained by going through the Butler. Most of these would never be used by anyone other than a few Operations staff. If should be using Butler, many more use cases would come here.
A file registration function/program (initiated via operator, batch processing service, raw file delivery, etc) will insert data about specific files into the database: file catalog, file metadata, provenance
A data management program will check file information stored in a database vs actual physical file information to ensure DB correctness as well as finding bit-rot.
An operator or data management tool needs to query any table in the DB (Yes this could/should be broken up into exact use cases. If needed I could start listing them individually.)
An operator will need to delete DB rows from bad or test runs that are not needed for historical provenance, e.g. objects.
Data ingestion codes need to insert non-file metadata science information into the consolidated DB: qc values, objects from catalogs, etc.
Note: Ingesting large number of objects in Operations will need a different manner of insertion than say a single row of metadata.
A file registration function/program needs to read metadata from a file to save in the Data Backbone (needs to have ability to be different/more than is needed for internal Butler use)
Should not have to have to do anything other than call function that reads the file. Does not seem to be a Butler use case, but more for the afwImage level.
At least in the Operations schema, will probably want to save metadata also for intermediate files (files generated during pipeline execution, but not brought back to Data Backbone)
The Batch Processing Service needs to gathering file provenance for inputs/outputs of a running pipeline (this file was used by what process and the process generated what file).
At least in the Operations schema, we will probably want to save file provenance also for intermediate files
Is this a use case for Butler?
Butler could report what objects(?) have been accessed via get/put.
The above should exactly match the pre-flight information for production executions.
Can’t access files not staged to job scratch
Processing will be aborted if expected output files are not created
Extra output files may either be left behind or bundled in a junk file tarball and can’t be used in downstream processing (especially true if not in same compute job)
So, the case of interest is whether the pre-flight said to use X files, but the code eliminated an input based upon science reasons after opening the file.
If a code eliminate a given input after reading it, Butler won’t be able to tell. Would be better if code that can eliminate an input explicitly tells what it did use.
Operations would prefer to have Butler report actual filenames used:
Filenames are what will be used in saving provenance
Helps illuminate problems/different expectations from translating data Ids to files on disk.
The Batch Processing Service wants to save execution provenance (times, execution host, software stack, memory usage, etc). Other than it being a DB insertion/update, does not seem to be a Butler use case.
The Batch Processing Service wants to save log file contents to something like LogStash. (Doesn’t sound like Butler use case)
Debugging processing issues:
Operations staff member (e.g., Robert Gruendl) manually gets/creates input files of interest. He wants to easily run a couple pipeline steps on their own workstation or LSST development cluster. He needs to easily set up the environment in order to be able to run. (e.g., Do files need to be in specific folder structures? (Currently yes))
Operations staff member (e.g., Robert Gruendl) needs to run queries against operational DB to perform quality checks across data. Must be flexible. Able to write/try various queries that could be complicated, use global temp tables, etc. Not clear that folks would not be using regular DB clients like sqlplus or in the case of a python program whether it needs to go through the Butler.
Operations staff member (e.g., Michelle Gower) needs to run queries against operational DB to check for non-science issues such as “bad” machine, dramatic increase in run times, failures, etc. Not clear that folks would not be using regular DB clients like sqlplus or in the case of a python program whether it needs to go through the Butler.
L1 Enclave
Raw files will be written to a FIFO filesystem in Observatory Operational System (OOS) by (Jim’s) forwarder for use by folks at the base. Observatory Operations Staff needs to use these files via SuperTasks and other tools (an observatory portal?)
The contents on disk changes as fast as the camera can cycle. This filesystem will be (NFS) mounted read only on commissioning cluster.
Observatory Operations staff member needs to extract a non-raw file (e.g., calibration) from the data backbone and store it on their human-managed filesystem in a manner that is usable by SuperTasks.
Observatory Operations staff member needs to delete a file from their human-managed filesystem in a manner that corrects information used by SuperTasks.
Data Backbone (not directly involved with processing):
A raw file ingestion service will ingest raw files into the data backbone (base comes later than NCSA)
Includes transfer of files + saving data into file catalog, metadata tables, and provenance tables (provenance means introduced to data backbone via Raw file path, not what processes were done to it at the base prior to this).
An Operator needs to manually ingest into the data backbone files created externally to the camera or Batch Processing Service to be used by pipelines
Includes (probably manual) renaming of files to unique names, moving files into place, saving data into file catalog, metadata tables, and provenance tables (provenance means an externally created file was introduced to the system. NOT saving whatever provenance external process was done.)
Alert Prompt Processing (Temp placeholders until Felipe can chime in. Many are directly Alert Prompt Processing pipeline related.)
Generically, this is all decoupled from the data backbone, consolidated DB and alert distribution system to allow it to continue processing despite outages in those other systems. At some independent cadence, information/files flow between them.
Forwarders deliver the crosstalked-corrected files to NCSA distributors which will make these files visible on the Prompt Processing cluster.
A separate setup process (name TBD) will populate Prompt Processing filesystems with input files needed for the night's processing.
Alert Prompt Processing pipeline will read template files from a template filesystem.
Alert Prompt Processing pipeline will write its output files to an output filesystem.
Alert Prompt Processing pipeline will read from and write to a separate real-time AP database.
Alert Prompt Processing pipeline will feed back image information like psf, wcs, etc back to the telemetry gateway service on the NCSA foreman which sends information back to the base. (If the feedback is via a file, then normal Butler use case. If the feedback is via direct messaging with the service, then not clear this is a Butler use case.)
A separate post process (name TBD) will copy output data from the L1 enclave into the Operations database and filesystem.
An EFD ETL process will extract from the base EFD, translate the data and load tables in NCSA consolidated database.
Will be working with databases, but this process shouldn’t have to use the Butler.
The database admin will initiate a data release process which extracts, translates, and loads the data from the operations database into the release database.
Mostly database work transforming data from the operational consolidated database to a release consolidated database. (Probably not a Butler use case as these are normally specialized/optimized )
Loading Qserv (again expecting not really a Butler use case if database to database, but if done on a file basis closer to the actual production then could be if the Qserv folks want to do so)
Disaster Recovery/Backup services will save data to other locations or tape. - not a Butler use case
The Data Backbone will store files on tape and must retrieve them for use by Batch Production Service (not a Butler use case for production uses)
Certain files will need to be stored on tape in certain matter for efficient retrieval
May be managed by 3rd party software (like Rucio)
Not clear how much end-users would be accessing files on tape. (if allowed, may be Butler use case).
The Data Backbone will deliver files to LSST DAC Chile (not a Butler use case)
Delivering files between NCSA and IN2P3 (not a Butler use case)
The Data Backbone will deliver raw files from NCSA to IN2P3
The Batch Processing Service will deliver other inputs for processing from NCSA to IN2P3
The Batch Processing Service will deliver outputs from processing at IN2P3 to NCSA
The Data Backbone will deliver release files from NCSA to IN2P3 (backup, IN2P3 DAC)
Bulk file delivery service/program will bulk copy files between NCSA and a non-LSST DAC (i.e., not Chile, NCSA, or IN2P3). Not a Butler use case.