Date

Attendees

Goals

  • Share knowledge of the EFD troubleshooting procedures and operation

Discussion items

TimeItemWhoNotes
5min

1. Test CSC is not recording data to the expected InfluxDB measurement 

Michael
  • Angelo mentioned it might be a bug in the regexp that filters the Kafka topics in the connector configuration. Angelo will open a ticket to capture that.
 10 min 2. Michael's questions on troubleshooting procedures Michael
  • Topics with multiple partitions: hard to recover from data corruption errors as we have to repeat the recovery procedure many times. Hopefully, we get to the bottom of these data corruption errors at USDF.
  • Timeouts in Kafka Connect
    • Sometimes we get error 500 and need to try the kafkaconnect commands twice.
15 min2. Review past EFD events All

See #com-efd-status

Sunday Aug 13 

  • MTMount connector failed at USDF with a data corruption error
  • We believe that triggered a problem in MM2. Replication stopped a few hours later

Monday Aug 14 

  • MTMount connector failure recovered
  • Restarted MM2 to resume replication 

Tuesday Aug 15

  • 5 minutes glitch in the LHN around 2:05 am UTC (we didn’t get alerts because of DM-40098 - Getting issue details... STATUS
  • MM2 recovered automatically 

Thursday August 17 

  • Lost connectivity to the Summit for 20 minutes
  • MM2 recovered automatically
10 min3. USDF data corruption errors and repairer connectorsAngelo

We deployed a second set of connectors at USDF this week. The idea is that these connectors will help to isolate the data corruption events we see.

If both the current and repairer connectors fail to consume the same Kafka message, it indicates that the problem is a corrupted message instead of a bug in the connector.

10 min

4. Improvements to the Heartbeats dashboard



All

This dashboard has helped identify data missing in the EFD in conjunction with the connector alerts. 

  • Make sure we show heartbeats for all CSCs
  • We can group CSCs by connector
  • We can correlate heartbeats with alerts by filtering alerts by the AlertName corresponding to each connector. This will help us to identify timestamps around which data might be missing due to events and link with the data consistency checks.
  • We concluded that displaying the CSC state information does not add value. If we are not recording data, the last recorded state does not represent the current state of the CSC in the control system.

Action items

  •