Location

SQuaRE Zoom: https://ls.st/wyp

Time

11:00 am PT

Attendees

Goals

  • Share knowledge of the EFD troubleshooting procedures and operation

Discussion items

TimeItemWhoNotes
15 min1. Review past week EFD events All

See #com-efd-status 

Monday, Aug 28

  • Cycle 32 upgrade

Tuesday, Aug 29

  • OSPL daemon on yagan08 crashed at the Summit

Wednesday, Aug 30 

  • Kafka restarted at the Summit for an unknown reason. Self-consistency checks based on the private_seqnum indicate message loss that day, but we need to improve that check and see if the timestamp of missing data correlates with the timestamp of the Kafka restart.
  • Maintenance at USDF, InfluxDB Sink connectors rescheduled to another k8s node. Consistency checks show not message loss that day between Summit and USDF.

Thursday, Aug 31

  • !! We can't operate the Observatory (and thus collect any engineering data) since Thursday morning (Yagan maintenance) !!

NOTE: It seems that we are not missing Kapacitor notifications from USDF to Slack anymore DM-40098 - Getting issue details... STATUS  

5 min2. InfluxDB EnterpriseAngelo
  • Meeting with InfluxData on Thursday, Acceptance Criteria reviewed internally and shared with InfluxData 
  • Meeting again on Thursday, September 14th
  • We need to prepare for the POC phase 


15 min

3. Inconsistencies in the EFD data between Summit and EFD


All
  • We need to improve the Consistency Checks notebook
  • Improvements:
    • podAntiAffinity configuration applied to all environments DM-40510 - Getting issue details... STATUS
    • Upgrate kafka to 3.5.1 (important MM2 fixes)
    • Move Kafka data partitions to local disk at Base
    • Dedicated nodes to Kafka at USDF?
  • We need a "control environment". It could be Base (if we can move the Kafka data partitions to local disk) or Google. To deploy at Google we need to bypass the VPN since MM2 runs on the target cluster and needs to connect to the Summit. Other ideas?
5 min2. USDF data corruption errors and repairer connectorsAngelo

We haven't seen any data corruption errors this week. Repairer connectors are still running at USDF to help to investigate this issue.


10 minAOB

(Hsin-Fang): If we have time, I'd like to see what I may do for adding the JDBC connector.  It looks to me that the JDBC sink is already supported in kafka-connect-manager.  Testing before turning it on? 

Action items

  • DM-40515 Test CSC is not recording data to the expected InfluxDB measurementAnge lo Fausti
  • Fix Kapacitor configuration at Base