Notes from the last DF meeting are available at:
A few important takeouts from the meeting:
- There was a power upgrade affecting USDF over the Winter break
- Upgrades of the underlying Kubernetes and S3 infrastructure (the hypervisor upgrade) affected the Buttler database. The PostgreSQL schema migration attempted by Andy Salnikov before the upgrade was wiped out (2 days' worth of work). Andy may have to say more on this.
- Fritz Mueller the "cascade" effect of the changes in upgraded services to downstream dependencies was not accounted for
- Andy Salnikov :
- PostgreSQL created 6 TB of the write-ahead logs when upgrading the schemas of the 0.5 TB database. This needs to be understood. Specifically, this might happened during the vacuum stage.
- Write-ahead logs are needed for backups in case if restore is needed. The problem is that the logs are huge.
- The first attempt to back up did not work.
- The second full backup worked
- Igor Gaponenko How the performance of the operation at USDF vs IDF is compared (both are based on the network storage)?
- IDF was somewhat slower to upgrade on the second attempt after expanding memory on the machines
- Some info on WAL size growth:
-
DM-42411
-
Getting issue details...
STATUS
- No progress on the Cassandra nodes.
- (As explained by Yemi) The deployment got stalled since the underlying network infrastructure was not ready.
- A reminder to discuss an acquisition of the next batch of 50 Qserv nodes was made at the meeting.
- ACTION ITEM for Fritz Mueller and Igor Gaponenko:
- Meet at SLAC to discuss this topic
- Fritz Mueller based on the current experience with the slower than estimated progress of deployments at USDF it's imperative to accelerate the purchase order. Realistically, there is a typical 6-month delay on the deployment road after the hardware arrives at SLAC.
- Colin Slater: there is an impression that the overall planning of the work on the USDF infrastructure may lack clarity