Database Meeting 2022-11-02

Date

02 Nov 2022

Attendees

Igor Gaponenko Andy Salnikov Fritz Mueller Andy Hanushevsky

Notes from the previous meeting:

Database Meeting 2022-10-26

Discussion items

Discussed	Item	Who	Notes
	Project news	Fritz Mueller	Project news: no DMLT this week gave a Qserv presentation at the ADASS conference Office spaces at B48 (ROB) Fritz Mueller and John Gatesvisited both offices we will continue discussing the topic next week when John Gates will be able to join the conversation
	Supporting user-generated data products in Qserv	team	Context: started discussing the topic at the previous meeting Igor Gaponenko will begin looking at how to improve `qhttp` to support multi-part form data in the request's body using Boost Beast Fritz Mueller will talk to Flossie about where to put the new services
	Code reviews	team	Context: the team has shrunk Andy Salnikov is now 100% on MW + APDB ongoing development efforts in the R-I system (where we still have a lot of work to get ready before the LSST data releases) and "new" Qserv (soon) requires reviewers Options: external solutions: Igor Gaponenko automated (AI-based) code reviews in GHA to do the basic code analysis. Examples: https://codeball.ai/ internal solutions: Fritz Mueller proposes a series of "code tours" to help the remaining reviewers to stay up-to-date with the ongoing effort to improve the R-I system and Qserv. Fritz Mueller had a negative experience with the older pre-AI tools Igor Gaponenko may begin experimenting with the new generation of automated tools to see if there will be any benefits Discussed the following options to help reviewers understand the reviewed tickets: developers should write good descriptions of the proposed changes in JIRA tickets specifically for the R-I system, Igor Gaponenko will make a presentation on the design of the system we may also begin making presentations at the group meeting to cover various topics having design documents available to reviewers would be helpful as well
	(Possible) bug in Qserv `czar` when handling failed chunk queries.		Context: it was discussed at the previous meeting John Gates began looking at the problem using the Qserv cluster at USDF ( `slac6` ). Issues with `sudo` the privileges of John's account were seen. We also got an official bug report for Qserv that may be relevant in this context: https://community.lsst.org/t/truth-match-and-forcedsourceondiaobject-tables-are-available/7088 Fritz Mueller on what we know so far: mysterious OOM worker restarts need to be investigated. We don't know what causes that. it's been observed that it takes too much time for Qserv to process sub-chunks. Igor Gaponenko Do we have a problem with materializing those? Fritz Mueller Missing indexes for `subChunkId` on the truth table? the data model of the truth tables (RefMatch) in DP02 is complex. There is a synthetic unique key that users are not aware of how to use it's difficult for scientists to write meaningful and efficient queries BTW, we won't have these tables in the LSST DRs. Though, we will still have other RefMatch tables. Igor Gaponenko suggests that improving `cvzar` 's retry logic for failed chunk queries could help. One of the improvements would be to introduce delays before retries in order to give the restarted workers extra time to recover. Otherwise, we would be just wasting the attempts. Perhaps, we could switch from the retry count-based logic to the time-based? Basically, instead of counting the number of failed attempts count the amount of time permitted for recovery. Fritz Mueller provided extra info on the shared scan optimization not working at IDF due to memory locking not working had a conversation with John Gates John Gates thinks that if the memory locking is not working then we should turn it off since there may be other bad effects we need to identify the config option to disable the locking and try the one at IDF Igor Gaponenko the query that is giving us troubles is known. This is the same query that was tested at SLAC (Qserv at USDF) Igor Gaponenko in order to improve the stability of Qserv, we may also experiment with increasing the replication level to 2. This requires increasing the storage capacity of Qserv workers from 10 TB up to at least 16 TB or higher: Fritz Mueller will look at options here Fritz Mueller has proposed the following plan on how to proceed with the investigation to address both the short-term goal (of making IDF users happy) and the long-term one (improving Qserv): Step 1: Deploy the table index for `subChunkId` on the truth table (the director table) Step 2: Experiment with "Igor's crazy" queries at IDF with the memory lock switched on and off to see the effect of the switch on the performance, memory, and stability of Qserv workers. Step 3: If turning the memory locking off still won't help then bump the replication level to 2 to see if that will help. Step 4: And, finally, begin looking at Qserv code to solve the problem in the long run.
(postponed till the next meeting)	`qserv-operator` and `qserv-ingest`	Fabrice Jammes	One topic from the previous meeting: the FrDF team is looking at developing a better `Parquet`-to-`CSV` translator (to be written in C++) with a possible option of integrating the one into the partitioning tools any news here?

Action items

Space shortcuts

Page tree

Date

Attendees

Notes from the previous meeting:

Discussion items

Action items