Date

Attendees

Notes from the previous meeting:

Discussion items

DiscussedItemWhoNotes
(tick)Project news

Project news:

  • no DMLT this week
  • gave a Qserv presentation at the ADASS conference

Office spaces at B48 (ROB)

(tick)

Supporting user-generated data products in Qserv

team

Context:

  • started discussing the topic at the previous meeting

Igor Gaponenko will begin looking at how to improve qhttp to support multi-part form data in the request's body using Boost Beast

Fritz Mueller will talk to Flossie about where to put the new services

(tick)Code reviewsteam

Context:

  • the team has shrunk
  • Andy Salnikov is now 100% on MW + APDB
  • ongoing development efforts in the R-I system (where we still have a lot of work to get ready before the LSST data releases) and "new" Qserv (soon) requires reviewers

Options:

  • external solutions: Igor Gaponenko automated (AI-based) code reviews in GHA to do the basic code analysis. Examples: https://codeball.ai/
  • internal solutions: Fritz Mueller proposes a series of "code tours" to help the remaining reviewers to stay up-to-date with the ongoing effort to improve the R-I system and Qserv. 

Fritz Mueller had a negative experience with the older pre-AI tools

Igor Gaponenko may begin experimenting with the new generation of automated tools to see if there will be any benefits

Discussed the following options to help reviewers understand the reviewed tickets:

  • developers should write good descriptions of the proposed changes in JIRA tickets
  • specifically for the R-I system, Igor Gaponenko will make a presentation on the design of the system
  • we may also begin making presentations at the group meeting to cover various topics
  • having design documents available to reviewers would be helpful as well
(tick)(Possible) bug in Qserv czar when handling failed chunk queries. 

Context:

Fritz Mueller on what we know so far:

  • mysterious OOM worker restarts need to be investigated. We don't know what causes that.
  • it's been observed that it takes too much time for Qserv to process sub-chunks.
  • the data model of the truth tables (RefMatch) in DP02 is complex. There is a synthetic unique key that users are not aware of how to use
  • it's difficult for scientists to write meaningful and efficient queries
  • BTW, we won't have these tables in the LSST DRs. Though, we will still have other RefMatch tables.

Igor Gaponenko suggests that improving cvzar 's retry logic for failed chunk queries could help. One of the improvements would be to introduce delays before retries in order to give the restarted workers extra time to recover. Otherwise, we would be just wasting the attempts. Perhaps, we could switch from the retry count-based logic to the time-based? Basically, instead of counting the number of failed attempts count the amount of time permitted for recovery.

Fritz Mueller provided extra info on the shared scan optimization not working at IDF due to memory locking not working

  • had a conversation with John Gates 
  • John Gates thinks that if the memory locking is not working then we should turn it off since there may be other bad effects
  • we need to identify the config option to disable the locking and try the one at IDF

Igor Gaponenko the query that is giving us troubles is known. This is the same query that was tested at SLAC (Qserv at USDF)

Igor Gaponenko in order to improve the stability of Qserv, we may also experiment with increasing the replication level to 2. This requires increasing the storage capacity of Qserv workers from 10 TB  up to at least 16 TB  or higher:

Fritz Mueller has proposed the following plan on how to proceed with the investigation to address both the short-term goal (of making IDF users happy) and the long-term one (improving Qserv):

  • Step 1: Deploy the table index for subChunkId on the truth table (the director table)
  • Step 2: Experiment with "Igor's crazy" queries at IDF with the memory lock switched on and off to see the effect of the switch on the performance, memory, and stability of Qserv workers.
  • Step 3: If turning the memory locking off still won't help then bump the replication level to 2 to see if that will help.
  • Step 4: And, finally, begin looking at Qserv code to solve the problem in the long run.

(minus)

(postponed till the next meeting)

qserv-operator and qserv-ingest

One topic from the previous meeting:

  • the FrDF team is looking at developing a better Parquet-to-CSV  translator (to be written in C++) with a possible option of integrating the one into the partitioning tools
  • any news here?

Action items

  •