Location
SQuaRE Zoom: https://ls.st/wyp
Time
11:00 am PT
Attendees
Goals
- Share knowledge of the EFD troubleshooting procedures and operation
Discussion items
Time | Item | Who | Notes |
---|
15 min | 1. Review past week EFD events | All | See #com-efd-status Wednesday, Aug 23 - Missing heartbeats for essentially all connectors at USDF around 5:40 am (PT), except for MTMount. It recovered by itself.
- We thought initially it was a data replication glitch related to the network incident reported by REUNA (https://lsstc.slack.com/archives/CPAAQQB7W/p1692794400139539), but it is related to Kafka at USDF (see below)
Thursday, Aug 24 - Missing heartbeats for LATISS, AuxTel, and Calsys connectors at USDF around 11:10 (PT). This time, a manual restart of the connectors was required.
- Note that we are still missing Kapacitor notifications from USDF to SLAC
DM-40098
-
Getting issue details...
STATUS
|
5 min | 2. USDF data corruption errors and repairer connectors | Angelo | We didn't have any data corruption errors this week. Repairer connectors are still running at USDF.
|
20 min | 3. Inconsistencies in the EFD data between Summit and EFD
| All | - We are using
DM-39723
-
Getting issue details...
STATUS
to track this problem.
- Data consistency checks notebook
- It shows missing data at USDF related to the Wednesday, Aug 23 event. We correlated that with connector restarts due to timeout errors connecting to the Kafka brokers at USDF.
- Similarly, the Thursday, Aug 24 events are associated with errors trying to connect to the Kafka brokers at USDF.
Possible actions: - podAntiAffinity configuration to make sure Kafka pods do not run on the same node (same for Zookeeper)
DM-40510
-
Getting issue details...
STATUS
- Dedicated nodes to Kafka at USDF
- Move Kafka data partitions to local disk
- Review connector configuration. In normal conditions (i.e., no timeouts connecting to kafka brokers or connection errors) we don't expect message loss upon connector restarts.
|
10 min | AOB |
|
|
Action items