2023-08-25 EFD Walkthrough Meeting notes

Location

Time

11:00 am PT

Attendees

Goals

Share knowledge of the EFD troubleshooting procedures and operation

Discussion items

Time	Item	Who	Notes
15 min	1. Review past week EFD events	All	See #com-efd-status Wednesday, Aug 23 Missing heartbeats for essentially all connectors at USDF around 5:40 am (PT), except for MTMount. It recovered by itself. We thought initially it was a data replication glitch related to the network incident reported by REUNA (https://lsstc.slack.com/archives/CPAAQQB7W/p1692794400139539), but it is related to Kafka at USDF (see below) Thursday, Aug 24 Missing heartbeats for LATISS, AuxTel, and Calsys connectors at USDF around 11:10 (PT). This time, a manual restart of the connectors was required. Note that we are still missing Kapacitor notifications from USDF to SLAC DM-40098 - Getting issue details... STATUS
5 min	2. USDF data corruption errors and repairer connectors	Angelo	We didn't have any data corruption errors this week. Repairer connectors are still running at USDF.
20 min	3. Inconsistencies in the EFD data between Summit and EFD	All	We are using DM-39723 - Getting issue details... STATUS to track this problem. Data consistency checks notebook Are all messages being recorded at the Summit EFD? Is Summit EFD data correctly replicated to USDF? It shows missing data at USDF related to the Wednesday, Aug 23 event. We correlated that with connector restarts due to timeout errors connecting to the Kafka brokers at USDF. Similarly, the Thursday, Aug 24 events are associated with errors trying to connect to the Kafka brokers at USDF. Possible actions: podAntiAffinity configuration to make sure Kafka pods do not run on the same node (same for Zookeeper) DM-40510 - Getting issue details... STATUS Dedicated nodes to Kafka at USDF Move Kafka data partitions to local disk Review connector configuration. In normal conditions (i.e., no timeouts connecting to kafka brokers or connection errors) we don't expect message loss upon connector restarts.
10 min	AOB

Action items

DM-40515 Test CSC is not recording data to the expected InfluxDB measurement Angelo Fausti

Space shortcuts

Page tree

Location

Time

Attendees

Goals

Discussion items

Action items