Database Meeting 2023-11-15

Date

15 Nov 2023

Attendees

Igor Gaponenko

Notes from the previous meeting

Database Meeting 2023-11-08

Discussion items

Discussed	Item	Notes
	Project news	Fritz Mueller from DMLT: DM All Hands planning (logistics, agenda) is underway: 2024 Feb 5-8 DM All Hands Meeting Discussing an option of the late morning start of the sessions to help locals commuting to SLAC. Formal requirements for Chile travel have changed. People used to travel on tourist visas (not for work) and stay there for weeks. Now visits for the engineering work need to be done on the work visas. Vacations/travels: Fritz → Chile on the last week of November after Thanksgiving for 3 weeks. Colin → Chile on the last week of November after Thanksgiving for 2 weeks. Fritz → Tuscon in January for a few days
	USDF	Igor Gaponenko: no news for Qserv 2 nodes have been sacrificed to EFD Qserv will get 28 more nodes ETA is this December At the last Data Facilities meeting Fritz Mueller mentioned the new Cassandra infrastructure as a high-priority project for SLAC IT to unblock Andy Salnikov in January.
	Current status of Qserv and Qserv builds This section has to be present on each document in this series.	The most recent release: `2023.11.1-rc3` : support the file-based results delivery protocol still needs to be deployed (with ad-hoc fixes) on `-int` at IDF, further changes are required in `qserv-operator` for automated deployments A problem with the XROOT-based file transport reported last week has been investigated. As a reminder: XROOTD server planted into Qserv worker couldn't immediately see result files that were created, flushed, and properly closed by the worker. The problem doesn't exist for the built-in HTTP server. Theories mentioned earlier have been ruled, out. It looks like XROOTD's file client API caches stale DNS info of the worker hosts/pods after they got restarted. Still awaiting confirmation of this from Andy Hanushevsky For now, we should keep using the HTTP-based transport. `qserv-operator` doesn't support a new command line option of the Replication Controller `--qserv-czar-proxy`. This one is needed to monitor Czar's status on the Web Dashboard. A temporary fix was applied. This release fixes problems observed in `2023.10.1-rc1`: DM-41625 - Getting issue details... STATUS DM-41645 - Getting issue details... STATUS A problem with building/deploying the documentation in GHA CI has been solved by Fritz Mueller Known issues: `qserv-operator` is not up to date (see release status above) Fritz Mueller has a PR for that. After that, a new release of the operation can be deployed at IDF and elsewhere DNS problem in XROOTD file client class (no solution for this yet) Andy Hanushevsky: client caches connections. If the connection gets closed because of the server restart and a possible IP address change then the cached connection couldn't be reopened as it's still associated with the older IP address. A fix will be put on a work list for XROOOT. A solution will be to use `VNID` (the Virtual Network ID). No ETA yet. The next release needs to be built to include the above-mentioned fixes. In the long run: Fritz Mueller has plans for merging `qserv-operator` into `qserv` soon. This will eliminate the multi-month latency that exists between these two developments and prevent rapid deployment of the new Qserv versions in the cloud-based deployments. Colin Slater is curious about a plan for updating IDF's `-prod` to the latest Qserv release Fritz Mueller , Igor Gaponenko : the release seems to be stable, it has been tested on `-int`, and it's ready to be deployed.
	User-generated data products (postponed)	Postponed from the previous meeting before a new Qserv release would be built and deployed: discuss how to add a new container with dependencies. Fritz Mueller will look at this for Kubernetes-based deployment Igor Gaponenko will do it at the USDF see other notes on this subject in the relevant section of Database Meeting 2023-10-25 This has been postponed till next week when Fritz Mueller will have time to work on the Docker images.
	"Dark" tasks at workers	Discussed Last week at: Database Meeting 2023-11-08 There was a proposal to try "booting" tasks to the "Snail" scheduler. Indeed, this was implemented by Igor Gaponenko in: DM-41657 - Getting issue details... STATUS The implementation was tested last week at USDF. Unfortunately, this makes things even worse. Observations on the tests can be found in the above-linked Jira ticket. It was discussed with John Gates yesterday at the informal meeting at SLAC. Apparently, more work on the task "booting" code is needed to make tasks "booted" onto "Snail" visible by the Qserv worker monitoring. Fritz Mueller noted that LSST agreed on the 12-hour limit for how long queries are allowed to live in Qserv. John Gates suggestions for what could be done next: verify that the tasks-to-scheduler association gets updated in tasks booted to the new scheduler cancel queries when booting to "Snail" Igor Gaponenko this may not work for sub-chunk queries in the data extraction phase as a whole result file (where results from other sub-chunks of the same chunk are being collected) Colin Slater was wondering if it's the right time to reconsider the old approach for optimizing query processing at workers with a switch to using the SSD-base storage technology. Fritz Mueller we get the file-based result protocol to production, and get experience with that. After that two options: Option 1 (radical): to have a test scheduler that wouldn't be the "shared scan". Option 2 (less radical): extend query classification based on the number of chunks. This needs to be done on Czar. Option 3 (conservative): fix bugs in the current code ACTION ITEM for Igor Gaponenko: make the "dark" queries visible on the Web Dashboard monitoring pages DM-41534 - Getting issue details... STATUS
	Query analysis & processing in Qserv	The topic was mentioned last week. It was postponed due to a lack of time. What we have so far is this: No progress on creating table indexes on the materialized sub-chunk tables (the same ones that are defined on the "mother" table). Further experiments are still needed to see if this will bring any benefits from the overall query performance standpoint. A new idea of what might be interesting to see was discussed yesterday at SLAC during an informal meeting between John Gates, Fritz Mueller, and Igor Gaponenko: rebuild one of the chunk tables by reordering rows by sub-chunks test a typical NN-query that involves sub-chunks against the table w/o materializing the subchunk into temporary tables This is an ACTION ITEM for Igor Gaponenko Colin Slater suggested trying a modified/reduced version of the same RefMatch query with fewer columns to see if that will have any effect on the run-time of the query This is an ACTION ITEM for Igor Gaponenko
	New Qserv	Igor Gaponenko is making good progress on: DM-41291 - Getting issue details... STATUS

Action items

Space shortcuts

Page tree

Date

Attendees

Notes from the previous meeting

Discussion items

Action items