DRAFT Comments and criticism expected!
Current Status
Now that crosstalk removal has been descoped from the Camera DAQ, there is only one version of the pixel image that can be and needs to be retrieved, rather than two. There are multiple destinations for this image:
- Camera Diagnostic Cluster for automated visualization and rapid analysis
- Active Optics System (AOS) for wavefront sensors only
- Observatory Operations Data Service (OODS) for automated and ad hoc (including human-driven) rapid analysis
- Data Backbone (DBB) for long-term reliable archival
- For most science and possibly calibration images, Prompt Processing for execution of automated processing pipelines (primarily the Alert Production).
The first four of these are located in Chile at either the Summit or Base. The Prompt Processing systems are located at NCSA, requiring transfer of the pixels over the international network. The desired latency for all of these is generally "as rapid as possible", with the exception of the DBB, which has up to 24 hours. The DBB also transfers data over the international network, but more slowly.
The first four of these are expecting to persist the pixel data as FITS files in RAM disk or persistent file storage (e.g. GPFS); as a result, it makes sense to do the same for Prompt Processing as well, though care should be taken to minimize latencies in order to avoid delaying alert generation. All of these need to ingest the image files into a Data Butler repository to enable pipeline code to access the images using `butler.get()`. It is currently expected that these would be separate Butler repos. The data ID used for the Butler should include either the image name or group ID and image ("snap") number as well as the raft and detector/CCD identifiers. Each system needs to send an event to indicate that the image has been ingested and is thus available to pipelines.
The current baseline, as implemented for LATISS, has an image writer component of the Camera Control System (CCS) writing to the Camera Diagnostic Cluster. Another instance of this image writer is intended to be configured and deployed for the AOS. The OODS and DBB are fed by the DM Archiver and Forwarders, which are a separate image retrieval and writing system. Prompt Processing has not yet been implemented, but it was supposed to use another instance of the Forwarders. In addition, a Catch-Up Archiver is meant to feed the DBB with images that were otherwise missed, including those taken during a Summit-to-Base network outage.
An independent Header Service retrieves image metadata from events and telemetry, writing a single metadata object for each image that can be used by any image writer.
Simplifications
Can some of these be combined?
Comparison of CCS with Archiver
The CCS image writer has advantages compared to the Archiver/Forwarder:
- It is written by personnel who are physically and administratively close to the DAQ team, as opposed to NCSA.
- The application of which it is a part is the source of truth for many metadata and timing items as opposed to an independent CSC.
- It has the shortest latency requirements.
- It has been extensively tested on test stands at SLAC and Tucson, and more recently has undergone 9-raft focal-plane testing and ComCam testing.
But it also has disadvantages:
- It is written in Java using a JNI wrapping of the DAQ C++ API; Java FITS APIs do not have identical feature sets compared with the Stack's Python/C++ APIs. The Forwarder is written using the DAQ C++ API alone. (In both cases, workarounds for DAQ API misfeatures may have to be pushed upstream.)
- It is intended for "streaming" use, where the prime goal is to capture and analyze the most recent image taken (only). The Forwarder is intended for "commanded" use, where the prime goal is to capture a specific image by name. (There are two components, the (JNI-wrapped) DAQ driver API, and the CCS image handling subsystem. The DAQ driver API is a complete mapping of the DAQ image API, and contains methods for retrieving any image from the 2-day store. The existing image handling subsystem is only designed currently to stream the most recent event, but we anticipate making it able to retrieve any event for image visualization)
- It will fail to capture data if a node fails during or between image captures; it must be manually reconfigured to recover to normal operation or else it continues to fail to capture data. The Forwarders are automatically tolerant to node failures between image captures and automatically recover (but lose data) if a node fails during image capture.
- It does not yet have Butler ingest or pipeline execution capability or cache management capability for a Butler repo (only a cron/find-based cleanup mechanism). The Archiver is already integrated with the OODS.
- It does not yet retrieve metadata written by the Header Service, although this is on the roadmap.
Catch-Up Archiver
An independent Catch-Up Archiver will be needed in any case. Neither the DM Archiver/Forwarder nor the CCS image writer can be considered 100% reliable in terms of capturing all science images. The Catch-Up Archiver could potentially reuse code from the image writer or Forwarder for pixel manipulation and file output as well as transfer to the DBB.
The Catch-Up Archiver can potentially live at the Summit. If 3 machines with 1 GB/sec inbound and outbound network bandwidth are allocated to the Catch-Up Archiver, it should be possible to copy data to the Base at the rate of one 12 GB image per 4 seconds, 4X the normal image capture rate. This is sufficient to empty the buffer after even a long outage. The Catch-Up Archiver does need to ingest to a Base-resident DBB and contact that DBB to know which images have already been archived.
Design Proposal
The primary simplification of the design that the descoping of crosstalk images allows is to remove the DM Archiver/Forwarder and use the CCS image writer in its place. This would remove a client from the DAQ, reducing the number of code bases that need to be supported. It would remove the need for any clients to live at the Base, also removing the need for DAQ networking to extend over the DWDM. Since the CCS is absolutely necessary for image-taking, it would remove the possibility of images being taken with the Archiver or Prompt Processing disabled. (Such images would eventually be retrieved by the Catch-Up Archiver, but they would be at least delayed for Prompt Processing purposes.) It eliminates the current duplication of images and resultant confusion. It ensures that engineering images taken under CCS control are captured the same way as images taken under OCS control (though possibly with less metadata).
The following additions would need to be made:
- Sufficient Summit-located compute resources, including hot spare nodes and network bandwidth, would need to be devoted to the Camera Diagnostic Cluster in order for it to also serve as the source of OODS, DBB, and Prompt Processing data. (I think an alternative would be to use the same client codebase but run 2 instances, one on diagnostic cluster, and one somewhere else for writing images to be be archived. It would need some thought on the pros and cons of such a scheme).
- The CCS image writer code would need to be enhanced to add robustness and fault tolerance, to provide focal-plane-level telemetry, and to interface with the OODS, DBB, and Prompt Processing. The mechanisms used by the current DM Archiver should serve as a reference, but they would have to be ported to the Java environment of the CCS. Changing from "command" mode to "stream" mode may require adjustment.
- Locating the entire OODS or DBB at the Summit is considered impossible at LSSTCam scale. Either the images would have to be copied from the Camera Diagnostic Cluster to the Base for ingest into those systems or direct ingest from the Summit to the Base would need to be arranged (skipped in the event of network outage). One possibility is to have the CCS image writer trigger a network copy to the Base upon successful image capture and then use the current "hand-off" mechanism to the OODS and DBB. This may require extending the CCS image writer to send messages or write to a shared database.
- Prompt Processing should be fed directly by an international network copy from the Camera Diagnostic Cluster, rather than having an extra hop through the Base, in order to minimize latency.
- A separate instance of the OODS code could be used to manage a historical image cache for visualization.
The first steps in a transition to this design would be:
- Have the image writer get metadata from the Header Service. This is already planned, but it would be critical to get this in place ASAP.
- After successful image capture, copy the image (with metadata header) to a hand-off machine. Send any messages or update any databases required to use the current OODS/DBB ingest code. At this point, minimal functionality would be available for LATISS and test stands, including ComCam.
- Implement the current Archiver telemetry and "successful ingest" events.
- Upgrade the CCS image writer with Archiver-based robustness.
While Tony Johnson (the prime CCS author) is quite busy with LATISS commissioning, ComCam testing in Tucson, and LSSTCam integration and testing at SLAC, at least Steve Pietrowicz from NCSA could help with the Java-based aspects of this transition.
Header Service
Another possible simplification is to integrate the Header Service with the CCS image writer code. This has potential difficulties:
- There will be a separate instance of the CCS image writer for the AOS. It may be difficult to keep these instances in synch or keep multiple metadata objects separate.
- Porting the current SAL-heavy Python code to Java may not be easy.
Nevertheless, this should be considered down the road, again because having the CCS perform this function can help ensure that it happens for every image and moves the metadata capture point close to the authoritative source for most of it.
13 Comments
Brian Stalder
IMHO this refactoring of the archiver is a bit late, since it must be working with ComCam in less than a month. Also the impact to CCS work at this point in the project, sounds like a non-starter.
Kian-Tat Lim
As I understand it, both Tony on the CCS side and the Archiver team are working to update their code for ComCam. It might be more efficient to have them work together than in parallel. Alternatively, the cost of integrating possibly already-working CCS image capture with the OODS might be less than revamping the Archiver/Forwarder for the 9-CCD ComCam. And a final alternative is to postpone this integration until after ComCam is working with the dual paths.
Kian-Tat Lim
In addition, for the Camera Diagnostic Cluster to perform one of its design functions, analyzing images, it must obtain reasonable header metadata and integrate with Butler ingest in order to run DM pipelines. So this isn't really new work, plus additional resources would be available to work on it.
Tony Johnson
In my opinion, I think if we work together on a single solution for ComCam and eliminate code duplication, we will get something which works faster than continuing on our current route.
Kian-Tat Lim
(My public apologies for being overly harsh in my characterization of the CCS and much thanks to Tony for corrections.)
Brian Stalder
DM and CCS seems to be in favor of this. Do we need any more study on the impacts of this? If not, can we make a decision whether to proceed?
Robert Gruendl
The first criticism I had of the above outline is that it should be acknowledged that this presents a new set of risks for Prompt Processing (specifically in timing) since it is no longer using a "tested" method to ensure data transfer and job management. I would not claim it is a severe risk but the methodology using the Forwarder architecture did have a developed base and more than a simple proof of concept behind it.
Kian-Tat Lim
I wasn't aware that any method had been selected, let alone tested, for Prompt Processing job management. I don't see any difference between the Forwarder triggering a
bbcp
copy (the baseline for Forwarder-to-Distributor data transfer, as I understand it) and the CCS image writer doing the same. So I don't think this increases the risk.Steve Pietrowicz
I've been out for a couple of days for a personal matter, and haven't had a chance to make comments on this yet. I will do so soon.
Steve Pietrowicz
These are comments on the original page K-T wrote… There have been some changes to the page while I was writing this.
Re: Advantages
Re: Disadvantages
The following is a slightly simplified version of what happens once the forwarder writes the file:
Once the forwarder delivers a file, it messages the Archiver that has done so. The Archiver then hard-links the file to a staging directory for the OODS and a staging directory for the DBB directory. It then deletes the original file to clean up after itself. The Archiver then messages the OODS to ingest the data. Hard-linking eliminates the time needed to create an additional file. It also instantly presents the file each staging area, avoiding having the DBB rsync transfer a partial file (and in version 1.0.0-rc2 of the OODS, attempting to do a butler ingest before the forwarder completed writing the file). Note that neither the OODS or the DBB should modify this file because it is the same file in each directory. Changing the data in one will change it in both.
There is also the issue of messaging to and from the OODS. The current messaging mechanism to trigger an ingest uses RabbitMQ.
The OODS sends out success/failure to the Archiver using RabbitMQ as well. It might be better to have the OODS directly issue SAL messages. It would need to have complete information (image origin - LATISS, ComCam, Catchup Archiver) in the ingest request message to send out the right type of message.
It’s possible if files were hard-linked to image-origin-specific directories, a UNIX notify method (lsst-dm/ctrl_notify) we could avoid using RabbitMQ messaging entirely. However, the OODS would still need to implement SAL messages.
Catch-Up Archiver Comments
Don’t we need more than one Catch-Up Archiver (LATISS, ComCam, MT), or do we expect that this CSC will be able to deal with each device simultaneously?
Design Proposal Comments/Questions
Header Service integration with CCS
I don’t see the advantage of integrating the HeaderService with the CCS. It would further complicate the CCS.
Overall comments:
As I stated above, I think we need to step back and think about what mechanisms we’re using for triggering the OODS. I’m a little concerned the straightforward mechanism we first used (and are using as of right now) is getting more complicated and less robust on restarts.
One of my other concerns is what type of messaging mechanism we use to indicate a file has been entirely written so it can be acted upon (hard linking, etc.) If we continue to use RabbitMQ, we’d have to get a Java library to integrate it. Since we already have to add a messaging library, this might be a good time to reconsider using Kafka, so LSST uses two messaging systems (KAFKA and DDS), rather than three.
Kian-Tat Lim
I'm mostly thinking about the ability to rapidly iterate and accurately communicate if changes are needed to the interface. The past practice has been to do this communication in person (along with hardware installation), which has involved travel arrangements.
I would foresee the CCS-based writer substituting for the Forwarder in this case, copying the newly-written file to the Archiver/hand-off location and allowing the hard-link/removal process to occur as it currently does.
I was reluctant to add another CSC, particularly one at the Base, but the use of salobj makes this a good suggestion. The OODS can presumably also act as a central gathering point for focal-plane-wide information.
Since incoming files are not expected to build up at the hand-off, the number of directories needing to be watched might not be too bad. Would this also work for Catch-Up Archiver, though?
We need one per DAQ partition, yes. I would hope that all instances could use identical code.
Distributors remain and can remain at the raft level.
The advantages I see are:
The first is the most important here. I agree there is an increase in complexity, especially with the large number of SAL events and telemetry messages that need to be watched.
I'm open to rethinking a message-triggering mechanism as long as the latency is sufficient for all users. I'm worried that 1 sec is not low enough.
Using Kafka to provide information about archiving status and even potentially Prompt Processing status from the LDF, where latency is not a serious concern, could be reasonable. (Also I believe that the CCS uses Java messaging internally so there is yet another system.)
Htut Khine
Sorry, I just found out about this document today. Even though I like the idea of having one source of truth data, there are questions that we should address what this decision entails -
Prompt Processing
HeaderService
OODS
Catchup Archiver
Network bandwidth is 1 GBits/second, which can move 125 MBytes/second. So, it can move about 1.5 GBytes per 4 seconds. Forwarder machines are attached to 10 GBits/second bandwidth. So, it would be about ~10 times slower. Also, this creates another problem if we buy new machines for these or allocate from the already existing 10 GBits forwarder machines.
Question
Overall comment
Steve Pietrowicz
Additional comments:
I'm not sure what you mean. Anything written to the OODS could be handled by this, and even with the Catch-Up Archiver, the number of directories involved would be 30 days worth.
—
If the inotify mechanism pans out (again, this has to be looked at more carefully with the directory creation and hard-links), there would be much less than 1 second latency.
——
I can't tell by how some of this is worded, but I presume that the copies here are pulls after a message that the file is ready. Is that the intent?
I think the main change we're talking about here is that the Forwarder itself no longer reads from the DAQ, but instead reads (copies) from the CDC when it is told to. This isn't just one system copying all the data from the CDC. This would be a series of raft Forwarders doing this.
We'd still need infrastructure to coordinate Forwarders to assign rafts, to detect when Forwarder systems go down so a new Forwarder could take over, which is what the ATArchiver does now. The main difference is that the old version faults when it detects a Forwarder is gone, and the new version wouldn't fault without trying to recover by getting a new Forwarder.
We'd still want the Archive Controller process doing the hard-links for the hand-off to the OODS and DBB. This is something a single process could handle.
—
Again, I presume this a pull after a message from the CCS.
This would require similar infrastructure to the Archiver described above; the main difference would be that the "Distributors" would be doing the pull instead of having any Forwarders, and instead of having an Archive Controller doing the hand-links, we'd have the rendezvous mechanism for workers.