Location

SQuaRE Zoom: https://ls.st/wyp

Time

11:00 am PT

Attendees

Goals

  • Share knowledge of the EFD troubleshooting procedures and operation

Discussion items

TimeItemWhoNotes
15 min1. Review past week EFD events All

See #com-efd-status 

Wednesday, Aug 23 

  • Missing heartbeats for essentially all connectors at USDF around 5:40 am (PT), except for MTMount. It recovered by itself. 
  • We thought initially it was a data replication glitch related to the network incident reported by REUNA (https://lsstc.slack.com/archives/CPAAQQB7W/p1692794400139539), but it is related to Kafka at USDF (see below)

Thursday, Aug 24 

  • Missing heartbeats for LATISS, AuxTel, and Calsys connectors at USDF around 11:10 (PT). This time, a manual restart of the connectors was required. 
  • Note that we are still missing Kapacitor notifications from USDF to SLAC DM-40098 - Getting issue details... STATUS  
5 min2. USDF data corruption errors and repairer connectorsAngelo

We didn't have any data corruption errors this week. Repairer connectors are still running at USDF.


20 min

3. Inconsistencies in the EFD data between Summit and EFD



All
  • We are using DM-39723 - Getting issue details... STATUS to track this problem.
  • Data consistency checks notebook 
    • Are all messages being recorded at the Summit EFD?

    • Is Summit EFD data correctly replicated to USDF?

  • It shows missing data at USDF related to the Wednesday, Aug 23 event. We correlated that with connector restarts due to timeout errors connecting to the Kafka brokers at USDF.
  • Similarly, the Thursday, Aug 24 events are associated with errors trying to connect to the Kafka brokers at USDF.

Possible actions:

  • podAntiAffinity configuration to make sure Kafka pods do not run on the same node (same for Zookeeper) DM-40510 - Getting issue details... STATUS
  • Dedicated nodes to Kafka at USDF
  • Move Kafka data partitions to local disk
  • Review connector configuration. In normal conditions (i.e., no timeouts connecting to kafka brokers or connection errors) we don't expect message loss upon connector restarts. 
10 minAOB

Action items