Panda Meeting 2023-10-18

(back to the list of all Panda meeting minutes)

Zoom Link

Time

8 am PT

Attendees

Mikolaj Kowalik Richard Dubois Wei Yang Michelle Gower Fabio Hernandez Edward Karavakis Peter Love Jhonatan Amado Wen Guan James Chiang Brian Yanny Jen Adelman-Mccarthy Zhaoyu Yang

Regrets

Agenda:

Keeping track of CM and Panda interaction: feature requests and issues: https://confluence.lsstcorp.org/x/f9lGDQ
1. We will open a ticket to summarize the current iDDS database issue
CM news
USDF Panda
1. prod iDDS db pods monitoring: https://grafana.slac.stanford.edu/d/z7FCA4Nnk/cloud-native-postgresql?orgId=1&refresh=30s&var-DataSource=Prometheus&var-vcluster=vcluster--usdf-panda&var-cluster=usdf-panda-idds&var-instances=All&var-namespace=All&var-resolution=5m&from=now-30d&to=now
2. prod iDDS database issue
  1. issue with idds db backup
    1. caused by mixture of s3 bucket secret
    2. the -dev instance pods are setup without backup, and use direct sync (synchronous replication?)
  2. Issue of two idds pods out of sync and lost data
    1. backup failure, which prevent WAL from being pushed to s3 bucket.
    2. Since the two idds db pods sync via backups, the two pods went out of sync.
    3. When the standby pod was prompted to primary, we lost data (some of the data were still available in WALs found in the previous primary pod's volume snapshot) from run 387 to 484
3. DB restore
  - after consulting with the CM team, we will not try to restore run 387 to 484 from those WALs
4. Improvements
  1. change replication from asynchronous to synchronous (Dan Speck suggest that we don't proceed on this until future discussion)
    - From Dan: Synchronous will require that commits are written to both the primary and standby replicas, but it also means if the standby replica is down then the primary is also down. It will decrease performance
  2. change max db connections from 100 to 1000 (the monitoring shows that the max connection is still 100 ?)
  3. change to transaction mode (Releases server connection after each transaction - low efficient but prevent connection from piling up, cause connection error ?)
  4. Panda team can add a few rows to the iDDS db and manually trigger failover between the two DB pods. Instructions:
    Install the cnpg kubectl plugin.Determine standby node with below commands per cluster.
```
kubectl cnpg status usdf-panda-server -n panda-db
kubectl cnpg status usdf-panda-idds -n panda-db
```
    Promote the standby node. Replace below with standby node name. The old primary node takes ~30 seconds to a minute to register as standby.
```
kubectl cnpg promote usdf-panda-server usdf-panda-server-1 -n panda-db
kubectl cnpg promote usdf-panda-idds usdf-panda-idds-3 -n panda-db
```
5. Anything else do we need before we open USDF Panda for business again?

Notes:

Ticket branch of bps_submission_panda (for remote generation of QBB) is deployed in CVMFS. They are not tied to a particular lsst release.
iDDS db related issues understood and address. Will conduct a failover test on Panda DB after the meeting (done as of Oct. 19).
Will stress test all 3 DFs during the rest of this week (in progress)

Space shortcuts

Page tree

Zoom Link

Time

Attendees

Regrets

Agenda:

Notes: