Database Meeting 2023-12-20

Date

20 Dec 2023

Attendees

Igor Gaponenko Andy Salnikov Colin Slater John Gates Fritz Mueller

Notes from the previous meeting

Database Meeting 2022-12-14

Discussion items

Discussed	Item	Notes
	Project news	Fritz Mueller : reported on his trip to Chile The SQuaRE team is expanding. A few more people are being hired. Some may be based at SLAC. DM All Hands and System Performance meeting on February 6-8 Fritz Mueller is among the local organizers Travels/vacations: Igor Gaponenko on vacation tomorrow (Dec 21st) and on Jan 4-5 Fritz Mueller on vacation on Jan 4-5
	USDF	Igor Gaponenko news from the last Data Facilities meeting Joint Data Facilities Meeting - 2023-12-18 FY24 hardware purchasing includes $1M for 50 Qserv nodes (on top of existing 15 + 30 - 2) this is meant to be sufficient to serve DR1 (and DR2?) Fritz Mueller and Igor Gaponenkoare expected to discuss specs with Richard in January SLAC IT team is still working on bringing new hardware up, which includes Cassandra nodes and 28 Qserv nodes The previous ETA for Cassandra nodes to be available in January still looks valid Planned power outage at SLAC on January 3-8, 2024: This may result in the interruption of services at USDF. A discussion on the potential effect of having a cluster with non-homogenous nodes (different storage and processing capacities) this will eventually happen as Qserv nodes will be incrementally upgraded/replaced with newer hardware Fritz Mueller the present implementation of Qserv implies a homogeneous hardware setup improvements are needed in the following areas to address this challenge: load balancing query processing by Qserv in the Replication system to take
	Current status of Qserv and Qserv builds	The most recent release: `2023.11.1-rc3` : was deployed at `-prod` (IDF) ~2 weeks ago (used ad hoc method to bypass missing operator support) Qserv looks stable (no worker crashes, pod restarts, etc.) no performance issues Known problems: Still no `qserv-operator` support for the latest release A discussion on global alerts on queries that take too long: a few ideas on how to improve the "watcher" set a short limit for "mobu" queries (typically less than 5 minutes) follow query progress take into account the number of queries that are being processed in parallel
	Query analysis & processing in Qserv	Igor Gaponenko tested the effect of locking files in memory at USDF using queries against `DP02` : tested a variety of query types, including unconstrained (no `LIMIT`) large result `SELECT ` with automatic query cancelation on the 5 GB* result set limit unconstrained large result `SELECT * LMIT 100000..` with automatic query cancelation after getting enough rows true scan queries w/o subchunks involving columns for which we don't define table indexes large result small result low CPU usage near-neighbor queries running 64 queries to be processed in parallel switching back and forth between the following memory management modes `MemManReal` `MemManNoneRelaxed` (a bug-fixed version of the original `MemManNoneRelaxed` ) PRELIMINARY CONCLUSION: Locking files in memory does NOT affect the performance of any query Notes: It's not clear what would be an effect of the memory locking in case of high-latency the HDD-based or network-based (IDF) filesystems Multi-way `JOIN` queries that involve indexes would be another interesting use case: See a use case reported by GPDF and discussed at the Database Meeting 2023-12-06 locking files in memory doesn't work at IDF (Kubernetes) Fritz Mueller: the indifference might be caused by files cached in memory from previous runs kernel performance tools to study what's in the file system cache Igor Gaponenko : we might order a few Qser nodes (of the 50 nodes batch) to be equipped with HDDs to experiment with the memory-locking options
	New Qserv	Igor Gaponenko : Finished migrating the Control plane and monitoring protocol of Qserv workers and Czar to HTTP/REST (based on `qhttp` ) All seems to work fine under stress tests Next steps: Code cleanup and refactoring in the result processing code (in progress, to be finished mid-January 2024) Develop the high-performance result merging mechanism at Czar (to begin working on this in January 2024) HTTP/REST-based Qserv front-end (Czar) Initially, it will be C++-based (`qhttp`) The service could be easily set up at USDF A support at Qserv operator will be needed to add a new pod Fritz Mueller this can be done ad-hoc Colin Slater : this needs to be the high-priority project for January 2024
	UKDF colleagues are still having trouble with ingesting DP02 into their cluster using `qserv-ingest`.	The problem was reported at the Database Meeting 2023-12-06 Investigation: An accidental Qserv restart during ingest was the root cause of the problem. It triggered 5 Qserv workers to go into Kubernetes "crash loop". This, in turn, might be caused by excessive memory use by the Qserv Ingest System in an attempt to resume ingest contribution requests posted before the restart. The cluster was repaired by Igor Gaponenko (I got access to the Kubernetes cluster at UKDF). A workaround for ingesting the catalog in 3 steps (1/3rd of each step) was found. The ingest is still in progress (due to low network bandwidth to the input contributions and a large number of contributions)

Action items

Space shortcuts

Page tree

Date

Attendees

Notes from the previous meeting

Discussion items

Action items