Location
SQuaRE Zoom: https://ls.st/wyp
Time
11:00 am PT
Attendees
Goals
- Share knowledge of the EFD troubleshooting procedures and operation
Discussion items
Time | Item | Who | Notes |
---|
30 min | 1. Review past week EFD events | All | See #com-efd-status Saturday, September 9th - Hearbeats missing for EAS CSC at USDF.
- Data corruption errors in Kafka at USDF, affected the lsst.sal.HVAC.manejadoraSblancaP04 topic causing the EAS connector to fail.
Sunday, September 10th - Heartbeats missing for all CSCs at USDF
- Data corruption errors in Kafka at USDF, affected mirrormaker2-cluster-offsets topic, which is a MM2 system topic, stopping replication.
Troubleshooting happened on Monday morning, see details in DM-40723. Monday, September 11th - Control system reset on the Summit.
Thursday, September 14th - Replica pool filled up on the Summit.
- We wiped out the schemas in an attempt to restart Kafka manually. Need backup for the schema registry topic.
- We recovered Summit by applying temporarily the consumer.auto.offset.reset=latest configuration.
- This had a rippled effect at USDF, USDF is not entirely recovered yet.
|
5 min | Ceph performance degradation on the Base cluster | Angelo | - Ceph on the Base cluster shows ~10x performance degradation on sequential writes (e.g. EFD database restore)
TODO: report details for Cristian on a ticket - Asked IT Chile for a server with local attached SSDs for performance comparison
|
5 min | 2. InfluxDB Enterprise, POC timeline? | Angelo, all | - Meeting with InfluxData on Thursday
- POC is looking toward November
- We have to make sure we have a server with locally attached SSDs ready for the POC
- Apparently we have a single-node Enterprise trial license that we can already use. (Angelo to confirm)
- Next meeting October 5th.
|
10 min | 3. Inconsistencies in the EFD data between Summit and EFD | Angelo | - We are running Strimzi 0.37.0 and Kafka 3.5.1 now
- Enabled MM2 auto restart feature
- Reviewed connector configuration (DM-40664)
Next, deploy the Telegraf connector on the Base writing to InfluxDB v1 and compare results. |
10 min | AOB |
|
|
Action items