Location

SQuaRE Zoom: https://ls.st/wyp

Time

11:00 am PT

Attendees

Goals

  • Share knowledge of the EFD troubleshooting procedures and operation

Discussion items

TimeItemWhoNotes
30 min1. Review past week EFD events All

See #com-efd-status 

Saturday, September 9th 

  • Hearbeats missing for EAS CSC at USDF.
  • Data corruption errors in Kafka at USDF,  affected the lsst.sal.HVAC.manejadoraSblancaP04 topic causing the EAS connector to fail.  

Sunday, September 10th

  • Heartbeats missing for all CSCs at USDF
  • Data corruption errors in Kafka at USDF,  affected mirrormaker2-cluster-offsets topic, which is a MM2 system topic,  stopping replication.

Troubleshooting happened on Monday morning,  see details in DM-40723.

Monday, September 11th

  • Control system reset on the Summit.

Thursday, September 14th

  • Replica pool filled up on the Summit.
  • We wiped out the schemas in an attempt to restart Kafka manually. Need backup for the schema registry topic.
  • We recovered Summit by applying temporarily the consumer.auto.offset.reset=latest configuration. 
  • This had a rippled effect at USDF, USDF is not entirely recovered yet.
5 minCeph performance degradation on the Base clusterAngelo
  • Ceph on the Base cluster shows ~10x performance degradation on sequential writes (e.g. EFD database restore)

TODO: report details for Cristian on a ticket

  • Asked IT Chile for a server with local attached SSDs for performance comparison 
5 min2. InfluxDB Enterprise, POC timeline?Angelo, all
  • Meeting with InfluxData on Thursday
  • POC is looking toward November
  • We have to make sure we have a server with locally attached SSDs ready for the POC
  • Apparently we have a single-node Enterprise trial license that we can already use. (Angelo to confirm)
  • Next meeting October 5th.
10 min

3. Inconsistencies in the EFD data between Summit and EFD

Angelo
  • We are running Strimzi 0.37.0 and Kafka 3.5.1 now
  • Enabled MM2 auto restart feature
  • Reviewed connector configuration (DM-40664)

Next, deploy the Telegraf connector on the Base writing to InfluxDB v1 and compare results.  

10 minAOB

Action items

  •