DRAFT Comments and criticism expected!

Current Status

Now that crosstalk removal has been descoped from the Camera DAQ, there is only one version of the pixel image that can be and needs to be retrieved, rather than two. There are multiple destinations for this image:

  • Camera Diagnostic Cluster for automated visualization and rapid analysis
  • Active Optics System (AOS) for wavefront sensors only
  • Observatory Operations Data Service (OODS) for automated and ad hoc (including human-driven) rapid analysis
  • Data Backbone (DBB) for long-term reliable archival
  • For most science and possibly calibration images, Prompt Processing for execution of automated processing pipelines (primarily the Alert Production).

The first four of these are located in Chile at either the Summit or Base. The Prompt Processing systems are located at NCSA, requiring transfer of the pixels over the international network. The desired latency for all of these is generally "as rapid as possible", with the exception of the DBB, which has up to 24 hours. The DBB also transfers data over the international network, but more slowly.

The first four of these are expecting to persist the pixel data as FITS files in RAM disk or persistent file storage (e.g. GPFS); as a result, it makes sense to do the same for Prompt Processing as well, though care should be taken to minimize latencies in order to avoid delaying alert generation. All of these need to ingest the image files into a Data Butler repository to enable pipeline code to access the images using `butler.get()`. It is currently expected that these would be separate Butler repos. The data ID used for the Butler should include either the image name or group ID and image ("snap") number as well as the raft and detector/CCD identifiers. Each system needs to send an event to indicate that the image has been ingested and is thus available to pipelines.

The current baseline, as implemented for LATISS, has an image writer component of the Camera Control System (CCS) writing to the Camera Diagnostic Cluster. Another instance of this image writer is intended to be configured and deployed for the AOS. The OODS and DBB are fed by the DM Archiver and Forwarders, which are a separate image retrieval and writing system. Prompt Processing has not yet been implemented, but it was supposed to use another instance of the Forwarders. In addition, a Catch-Up Archiver is meant to feed the DBB with images that were otherwise missed, including those taken during a Summit-to-Base network outage.

An independent Header Service retrieves image metadata from events and telemetry, writing a single metadata object for each image that can be used by any image writer.

Simplifications

Can some of these be combined?

Comparison of CCS with Archiver

The CCS image writer has advantages compared to the Archiver/Forwarder:

  • It is written by personnel who are physically and administratively close to the DAQ team, as opposed to NCSA.
  • The application of which it is a part is the source of truth for many metadata and timing items as opposed to an independent CSC.
  • It has the shortest latency requirements.
  • It has been extensively tested on test stands at SLAC and Tucson, and more recently has undergone 9-raft focal-plane testing and ComCam testing.

But it also has disadvantages:

  • It is written in Java using a JNI wrapping of the DAQ C++ API; Java FITS APIs do not have identical feature sets compared with the Stack's Python/C++ APIs.  The Forwarder is written using the DAQ C++ API alone. (In both cases, workarounds for DAQ API misfeatures may have to be pushed upstream.)
  • It is intended for "streaming" use, where the prime goal is to capture and analyze the most recent image taken (only).  The Forwarder is intended for "commanded" use, where the prime goal is to capture a specific image by name. (There are two components, the (JNI-wrapped) DAQ driver API, and the CCS image handling subsystem. The DAQ driver API is a complete mapping of the DAQ image API, and contains methods for retrieving any image from the 2-day store. The existing image handling subsystem is only designed currently to stream the most recent event, but we anticipate making it able to retrieve any event for image visualization) 
  • It will fail to capture data if a node fails during or between image captures; it must be manually reconfigured to recover to normal operation or else it continues to fail to capture data.  The Forwarders are automatically tolerant to node failures between image captures and automatically recover (but lose data) if a node fails during image capture.
  • It does not yet have Butler ingest or pipeline execution capability or cache management capability for a Butler repo (only a cron/find-based cleanup mechanism).  The Archiver is already integrated with the OODS. 
  • It does not yet retrieve metadata written by the Header Service, although this is on the roadmap.

Catch-Up Archiver

An independent Catch-Up Archiver will be needed in any case. Neither the DM Archiver/Forwarder nor the CCS image writer can be considered 100% reliable in terms of capturing all science images. The Catch-Up Archiver could potentially reuse code from the image writer or Forwarder for pixel manipulation and file output as well as transfer to the DBB.

The Catch-Up Archiver can potentially live at the Summit. If 3 machines with 1 GB/sec inbound and outbound network bandwidth are allocated to the Catch-Up Archiver, it should be possible to copy data to the Base at the rate of one 12 GB image per 4 seconds, 4X the normal image capture rate. This is sufficient to empty the buffer after even a long outage. The Catch-Up Archiver does need to ingest to a Base-resident DBB and contact that DBB to know which images have already been archived.

Design Proposal

The primary simplification of the design that the descoping of crosstalk images allows is to remove the DM Archiver/Forwarder and use the CCS image writer in its place. This would remove a client from the DAQ, reducing the number of code bases that need to be supported. It would remove the need for any clients to live at the Base, also removing the need for DAQ networking to extend over the DWDM. Since the CCS is absolutely necessary for image-taking, it would remove the possibility of images being taken with the Archiver or Prompt Processing disabled. (Such images would eventually be retrieved by the Catch-Up Archiver, but they would be at least delayed for Prompt Processing purposes.) It eliminates the current duplication of images and resultant confusion. It ensures that engineering images taken under CCS control are captured the same way as images taken under OCS control (though possibly with less metadata).

The following additions would need to be made:

  • Sufficient Summit-located compute resources, including hot spare nodes and network bandwidth, would need to be devoted to the Camera Diagnostic Cluster in order for it to also serve as the source of OODS, DBB, and Prompt Processing data. (I think an alternative would be to use the same client codebase but run 2 instances, one on diagnostic cluster, and one somewhere else for writing images to be be archived. It would need some thought on the pros and cons of such a scheme).
  • The CCS image writer code would need to be enhanced to add robustness and fault tolerance, to provide focal-plane-level telemetry, and to interface with the OODS, DBB, and Prompt Processing.  The mechanisms used by the current DM Archiver should serve as a reference, but they would have to be ported to the Java environment of the CCS.  Changing from "command" mode to "stream" mode may require adjustment.
  • Locating the entire OODS or DBB at the Summit is considered impossible at LSSTCam scale.  Either the images would have to be copied from the Camera Diagnostic Cluster to the Base for ingest into those systems or direct ingest from the Summit to the Base would need to be arranged (skipped in the event of network outage).  One possibility is to have the CCS image writer trigger a network copy to the Base upon successful image capture and then use the current "hand-off" mechanism to the OODS and DBB.  This may require extending the CCS image writer to send messages or write to a shared database.
  • Prompt Processing should be fed directly by an international network copy from the Camera Diagnostic Cluster, rather than having an extra hop through the Base, in order to minimize latency.
  • A separate instance of the OODS code could be used to manage a historical image cache for visualization.

The first steps in a transition to this design would be:

  • Have the image writer get metadata from the Header Service.  This is already planned, but it would be critical to get this in place ASAP.
  • After successful image capture, copy the image (with metadata header) to a hand-off machine.  Send any messages or update any databases required to use the current OODS/DBB ingest code.  At this point, minimal functionality would be available for LATISS and test stands, including ComCam.
  • Implement the current Archiver telemetry and "successful ingest" events.
  • Upgrade the CCS image writer with Archiver-based robustness.

While Tony Johnson (the prime CCS author) is quite busy with LATISS commissioning, ComCam testing in Tucson, and LSSTCam integration and testing at SLAC, at least Steve Pietrowicz from NCSA could help with the Java-based aspects of this transition.

Header Service

Another possible simplification is to integrate the Header Service with the CCS image writer code. This has potential difficulties:

  • There will be a separate instance of the CCS image writer for the AOS.  It may be difficult to keep these instances in synch or keep multiple metadata objects separate.
  • Porting the current SAL-heavy Python code to Java may not be easy.

Nevertheless, this should be considered down the road, again because having the CCS perform this function can help ensure that it happens for every image and moves the metadata capture point close to the authoritative source for most of it.

  • No labels

13 Comments

  1. IMHO this refactoring of the archiver is a bit late, since it must be working with ComCam in less than a month.  Also the impact to CCS work at this point in the project, sounds like a non-starter.

  2. As I understand it, both Tony on the CCS side and the Archiver team are working to update their code for ComCam.  It might be more efficient to have them work together than in parallel.  Alternatively, the cost of integrating possibly already-working CCS image capture with the OODS might be less than revamping the Archiver/Forwarder for the 9-CCD ComCam.  And a final alternative is to postpone this integration until after ComCam is working with the dual paths.

  3. In addition, for the Camera Diagnostic Cluster to perform one of its design functions, analyzing images, it must obtain reasonable header metadata and integrate with Butler ingest in order to run DM pipelines.  So this isn't really new work, plus additional resources would be available to work on it.

  4. In my opinion, I think if we work together on a single solution for ComCam and eliminate code duplication, we will get something which works faster than continuing on our current route.

  5. (My public apologies for being overly harsh in my characterization of the CCS and much thanks to Tony for corrections.)

  6. DM and CCS seems to be in favor of this.  Do we need any more study on the impacts of this?  If not, can we make a decision whether to proceed?

  7. The first criticism I had of the above outline is that it should be acknowledged that this presents a new set of risks for Prompt Processing (specifically in timing) since it is no longer using a "tested" method to ensure data transfer and job management.   I would not claim it is a severe risk but the methodology using the Forwarder architecture did have a developed base and more than a simple proof of concept behind it.


    1. I wasn't aware that any method had been selected, let alone tested, for Prompt Processing job management.  I don't see any difference between the Forwarder triggering a bbcp copy (the baseline for Forwarder-to-Distributor data transfer, as I understand it) and the CCS image writer doing the same. So I don't think this increases the risk.

  8. I've been out for a couple of days for a personal matter, and haven't had a chance to make comments on this yet.  I will do so soon.



  9. These are comments on the original page K-T wrote… There have been some changes to the page while I was writing this.  


    Re: Advantages

    1. I don’t consider the proximity to the DAQ team versus NCSA to be an advantage. Both teams have been able to write images correctly.


    Re: Disadvantages

    1. I can’t comment on whether or not there have been issues with the JNI wrapping the DAQ API for C++. If it’s a question of the speed at which it would operate. JNI itself isn’t a hindrance, and there are methods of handing through JNI (Java NIO, if I remember correctly) at high speeds. I do agree that the Java FITS API not being an identical feature set compared to the DM Stack API could be a concern, but can we pinpoint a specific instance where it could be a problem? At one point, the DAQ C++ library (and the Java version, I believe) had issues with using multiple threads reading simultaneously. Was this resolved?


    1. Version 1.0.0-rc2 of the OODS is not tightly integrated with the Archiver. It is triggered by a timer at a configurable interval to scan data in a directory tree, which is currently 1-second maximum. Version 1.0.0-rc3 is tightly integrated with the Archiver, triggered by a RabbitMQ message from the ATArchiver. Additionally, OODS 1.0.0-rc3 can send a message back to the Archiver once a file has been ingested into the butler. The Archiver, in turn, issues a SAL message.  


     The following is a slightly simplified version of what happens once the forwarder writes the file:


     Once the forwarder delivers a file, it messages the Archiver that has done so. The Archiver then hard-links the file to a staging directory for the OODS and a staging directory for the DBB directory. It then deletes the original file to clean up after itself. The Archiver then messages the OODS to ingest the data. Hard-linking eliminates the time needed to create an additional file. It also instantly presents the file each staging area, avoiding having the DBB rsync transfer a partial file (and in version 1.0.0-rc2 of the OODS, attempting to do a butler ingest before the forwarder completed writing the file). Note that neither the OODS or the DBB should modify this file because it is the same file in each directory. Changing the data in one will change it in both.


     There is also the issue of messaging to and from the OODS. The current messaging mechanism to trigger an ingest uses RabbitMQ. 

    The OODS sends out success/failure to the Archiver using RabbitMQ as well. It might be better to have the OODS directly issue SAL messages. It would need to have complete information (image origin - LATISS, ComCam, Catchup Archiver) in the ingest request message to send out the right type of message.


     It’s possible if files were hard-linked to image-origin-specific directories, a UNIX notify method (lsst-dm/ctrl_notify) we could avoid using RabbitMQ messaging entirely. However, the OODS would still need to implement SAL messages.



    Catch-Up Archiver Comments


    Don’t we need more than one Catch-Up Archiver (LATISS, ComCam, MT), or do we expect that this CSC will be able to deal with each device simultaneously? 


    Design Proposal Comments/Questions


    1. The archiver controller’s functionality needs to be replicated for message receiving/hard-linking and OODS messaging. It is a separate process from the Archiver and doesn’t need to be written in Java.
    2. In the original design, twenty-one Base Forwarders paired with twenty-one LDF Distributors. Is the plan to have each raft handed off to each of the Distributors? Or do distributors go entirely away? If so, what determines where the files are so they can be handed off to Prompt Processing jobs?


    Header Service integration with CCS


    I don’t see the advantage of integrating the HeaderService with the CCS. It would further complicate the CCS.


    Overall comments:


    As I stated above, I think we need to step back and think about what mechanisms we’re using for triggering the OODS. I’m a little concerned the straightforward mechanism we first used (and are using as of right now) is getting more complicated and less robust on restarts.


    One of my other concerns is what type of messaging mechanism we use to indicate a file has been entirely written so it can be acted upon (hard linking, etc.) If we continue to use RabbitMQ, we’d have to get a Java library to integrate it. Since we already have to add a messaging library, this might be a good time to reconsider using Kafka, so LSST uses two messaging systems (KAFKA and DDS), rather than three.


  10. I don’t consider the proximity to the DAQ team versus NCSA to be an advantage. Both teams have been able to write images correctly.

    I'm mostly thinking about the ability to rapidly iterate and accurately communicate if changes are needed to the interface.  The past practice has been to do this communication in person (along with hardware installation), which has involved travel arrangements.

    The following is a slightly simplified version of what happens once the forwarder writes the file:

     Once the forwarder delivers a file, it messages the Archiver that has done so. The Archiver then hard-links the file to a staging directory for the OODS and a staging directory for the DBB directory. It then deletes the original file to clean up after itself. The Archiver then messages the OODS to ingest the data. Hard-linking eliminates the time needed to create an additional file. It also instantly presents the file each staging area, avoiding having the DBB rsync transfer a partial file (and in version 1.0.0-rc2 of the OODS, attempting to do a butler ingest before the forwarder completed writing the file). Note that neither the OODS or the DBB should modify this file because it is the same file in each directory. Changing the data in one will change it in both.

    I would foresee the CCS-based writer substituting for the Forwarder in this case, copying the newly-written file to the Archiver/hand-off location and allowing the hard-link/removal process to occur as it currently does.

    There is also the issue of messaging to and from the OODS. The current messaging mechanism to trigger an ingest uses RabbitMQ. 

    The OODS sends out success/failure to the Archiver using RabbitMQ as well. It might be better to have the OODS directly issue SAL messages. It would need to have complete information (image origin - LATISS, ComCam, Catchup Archiver) in the ingest request message to send out the right type of message.

    I was reluctant to add another CSC, particularly one at the Base, but the use of salobj makes this a good suggestion.  The OODS can presumably also act as a central gathering point for focal-plane-wide information.

    It’s possible if files were hard-linked to image-origin-specific directories, a UNIX notify method (lsst-dm/ctrl_notify) we could avoid using RabbitMQ messaging entirely. However, the OODS would still need to implement SAL messages.

    Since incoming files are not expected to build up at the hand-off, the number of directories needing to be watched might not be too bad.  Would this also work for Catch-Up Archiver, though?

    Catch-Up Archiver Comments

    Don’t we need more than one Catch-Up Archiver (LATISS, ComCam, MT), or do we expect that this CSC will be able to deal with each device simultaneously? 

    We need one per DAQ partition, yes.  I would hope that all instances could use identical code.

    Design Proposal Comments/Questions

    In the original design, twenty-one Base Forwarders paired with twenty-one LDF Distributors. Is the plan to have each raft handed off to each of the Distributors? Or do distributors go entirely away? If so, what determines where the files are so they can be handed off to Prompt Processing jobs?

    Distributors remain and can remain at the raft level.

    Header Service integration with CCS

    I don’t see the advantage of integrating the HeaderService with the CCS. It would further complicate the CCS.

    The advantages I see are:

    • Having the Header Service integrated into the CCS means that it can never be disabled or offline when an image is being taken.
    • Avoiding a round-trip through the LFA improves latency.
    • Many telemetry items are already coming from the CCS.  Not needing to round-trip them through SAL might be simpler, although they still need to be published to the EFD.

    The first is the most important here.  I agree there is an increase in complexity, especially with the large number of SAL events and telemetry messages that need to be watched.

    As I stated above, I think we need to step back and think about what mechanisms we’re using for triggering the OODS. I’m a little concerned the straightforward mechanism we first used (and are using as of right now) is getting more complicated and less robust on restarts.

    I'm open to rethinking a message-triggering mechanism as long as the latency is sufficient for all users.  I'm worried that 1 sec is not low enough.

    One of my other concerns is what type of messaging mechanism we use to indicate a file has been entirely written so it can be acted upon (hard linking, etc.) If we continue to use RabbitMQ, we’d have to get a Java library to integrate it. Since we already have to add a messaging library, this might be a good time to reconsider using Kafka, so LSST uses two messaging systems (KAFKA and DDS), rather than three.

    Using Kafka to provide information about archiving status and even potentially Prompt Processing status from the LDF, where latency is not a serious concern, could be reasonable.  (Also I believe that the CCS uses Java messaging internally so there is yet another system.)

  11. Sorry, I just found out about this document today. Even though I like the idea of having one source of truth data, there are questions that we should address what this decision entails - 

    Prompt Processing

    • TCP packet loss is gonna be an issue in the long haul network. Prompt Processing system involves much more than file transfer. It needs to at least handle packet loss, load balancing, job management. 
    • Currently, we do not achieve reliable performance with `bbcp`. In the prompt processing system, a good question would be whether to group files together to send or simply send one raft at a time? 
    • DAQ produces ~35MB ccd(compressed) x 9 ccd/raft x 22 rafts = ~7GB data per second for the whole camera. With 100 GBits internet, even if we are using all bandwidth from Chile to United States, we can push 12GB of data/second in theory. With enough packet loss, there will be a backlog of data while running the telescope and that should be accounted in the design of Prompt Processing. 
    • CCS diagnostic cluster has to be paired with machines in LDF for file transfer. There has to be code for this machine management, with some kind of messaging system. Kakfa, RabbitMQ or anything? 

    HeaderService

    • Current CCS writer does not have to worry about when to write a fitfile because it is not reliant on other services. But when CCS starts interacting with HeaderService, it has to wait/synchronize with when the header has finished writing the file. (Since CCS has to implement this step, this has to be figured out anyway.)
    • Also, integrating/reimplementing current HeaderService might involve further complexity and time.

    OODS

    • Interfacing with OODS has to be reimplemented. The easier approach would be scanning the directory. If messaging topology is used, CCS has to port current Archiver code for messaging or explore new messaging interface.

    Catchup Archiver

    • Correction, ```If 3 machines with 1 GB/sec inbound and outbound network bandwidth are allocated to the Catch-Up Archiver, it should be possible to copy data to the Base at the rate of one 12 GB image per 4 seconds, 4X the normal image capture rate.```
      Network bandwidth is 1 GBits/second, which can move 125 MBytes/second. So, it can move about 1.5 GBytes per 4 seconds. Forwarder machines are attached to 10 GBits/second bandwidth. So, it would be about ~10 times slower. Also, this creates another problem if we buy new machines for these or allocate from the already existing 10 GBits forwarder machines. 

    Question

    • "Prompt Processing should be fed directly by an international network copy from the Camera Diagnostic Cluster, rather than having an extra hop through the Base, in order to minimize latency." Isn't 100 GBits network from the Base to the United States? If so, having it at the summit doesn't necessarily make one less hop.

    Overall comment

    • DAQ has two modes - streaming and 2day store. If CCS currently uses streaming API, it has to write 2day store mode for catchup archiver.  Per Mike Huffer, this shouldn't be difficult. Catchup Archiver cannot use streaming API because data is no longer streaming. 
    • One source of truth is always better.
  12. Additional comments:


    Since incoming files are not expected to build up at the hand-off, the number of directories needing to be watched might not be too bad.  Would this also work for Catch-Up Archiver, though?


    I'm not sure what you mean.  Anything written to the OODS could be handled by this, and even with the Catch-Up Archiver, the number of directories involved would be 30 days worth.


    I'm open to rethinking a message-triggering mechanism as long as the latency is sufficient for all users.  I'm worried that 1 sec is not low enough.


    If the inotify mechanism pans out (again, this has to be looked at more carefully with the directory creation and hard-links), there would be much less than 1 second latency.


    ——

    I can't tell by how some of this is worded, but I presume that the copies here are pulls after a message that the file is ready.   Is that the intent?



    One possibility is to have the CCS image writer trigger a network copy to the Base upon successful image capture and then use the current "hand-off" mechanism to the OODS and DBB.  This may require extending the CCS image writer to send messages or write to a shared database.


    I think the main change we're talking about here is that the Forwarder itself no longer reads from the DAQ, but instead reads (copies) from the CDC when it is told to.   This isn't just one system copying all the data from the CDC.  This would be a series of raft Forwarders doing this.


    We'd still need infrastructure to coordinate Forwarders to assign rafts, to detect when Forwarder systems go down so a new Forwarder could take over, which is what the ATArchiver does now.  The main difference is that the old version faults when it detects a Forwarder is gone, and the new version wouldn't fault without trying to recover by getting a new Forwarder.


    We'd still want the Archive Controller process doing the hard-links for the hand-off to the OODS and DBB.  This is something a single process could handle.


    Again, I presume this a pull after a message from the CCS.


    Prompt Processing should be fed directly by an international network copy from the Camera Diagnostic Cluster, rather than having an extra hop through the Base, in order to minimize latency.


    This would require similar infrastructure to the Archiver described above;  the main difference would be that the "Distributors" would be doing the pull instead of having any Forwarders, and instead of having an Archive Controller doing the hand-links, we'd have the rendezvous mechanism for workers.