Location
SQuaRE Zoom: https://ls.st/wyp
Time
10:00 am PT
Attendees
Goals
- Share knowledge of the EFD troubleshooting procedures and operation
Health checks
Check | Summit EFD | USDF EFD |
---|---|---|
Kafka disk usage | 80G | 68G |
InfluxDB disk usage | 550G (11%) | 15TB (50%) |
Status of the connectors | Running | Running |
Last InfluxDB pod restart | Power-up of the computer room at the Summit | k8s upgrade State: Running |
Last Kafka pods restart | Power-up of the computer room at the Summit State: Running | k8s upgrade State: Running |
Monitoring dashboards
Discussion items
Time | Item | Who | Notes |
---|---|---|---|
10 min | Review the past week's EFD events | All | See #com-efd-status Data corruption errors at Kafka USDF K8s upgrade caused an interruption in data replication. To prevent that, we need to run more Mirror Maker replicas and configure Pod Disruption Budgets to guarantee minimum availability when nodes are drained (working on a PR for that) Fiber cut today between Summit and La Serena, operating on the backup wireless link. There was no interruption in the data replication to USDF. |
30 min | EFD presentation at JTM | All | See plans for USDF EFD, Summit EFD, and other activities |
10 min | Should we update EFD system requirements in LSE-30? | All | SQR-085 shows our best estimate for EFD storage requirements. We could use that to update the numbers in LSE-30 Frossie suggested showing LSE-30 requirements in the data rates plot.
TODO: still need to update SQR-085 to show a consolidated table with throughput for telemetry, events, and command topics per Michael's suggestion. |
AOB |