Date

Attendees

Notes from the previous meeting

Discussion items

DiscussedItemNotes
(tick)Project news

Fritz Mueller from DMLT:

  • DM All Hands planning (logistics, agenda) is underway: 2024 Feb 5-8 DM All Hands Meeting
  • Discussing an option of the late morning start of the sessions to help locals commuting to SLAC.
  • Formal requirements for Chile travel have changed. People used to travel on tourist visas (not for work) and stay there for weeks.
    • Now visits for the engineering work need to be done on the work visas.

Vacations/travels:

  • Fritz → Chile on the last week of November after Thanksgiving for 3 weeks.
  • Colin → Chile on the last week of November after Thanksgiving for 2 weeks.
  • Fritz → Tuscon in January for a few days
(tick)USDF

Igor Gaponenko:

  • no news for Qserv
  • 2 nodes have been sacrificed to EFD
  • Qserv will get 28 more nodes
  • ETA is this December

At the last Data Facilities meeting Fritz Mueller mentioned the new Cassandra infrastructure as a high-priority project for SLAC IT to unblock Andy Salnikov in January.

(tick)

Current status of Qserv and Qserv builds

(warning) This section has to be present on each document in this series.

The most recent release:

  • 2023.11.1-rc3 :
    • support the file-based results delivery protocol
    • still needs to be deployed (with ad-hoc fixes) on -int at IDF, further changes are required in qserv-operator for automated deployments
      • A problem with the XROOT-based file transport reported last week has been investigated.
        • As a reminder:
          • XROOTD server planted into Qserv worker couldn't immediately see result files that were created, flushed, and properly closed by the worker.
          • The problem doesn't exist for the built-in HTTP server.
        • Theories mentioned earlier have been ruled, out.
        • It looks like XROOTD's file client API caches stale DNS info of the worker hosts/pods after they got restarted.
        • Still awaiting confirmation of this from Andy Hanushevsky 
      • For now, we should keep using the HTTP-based transport.
      • qserv-operator doesn't support a new command line option of the Replication Controller --qserv-czar-proxy. This one is needed to monitor Czar's status on the Web Dashboard. A temporary fix was applied.  
    • This release fixes problems observed in 2023.10.1-rc1:
      • DM-41625 - Getting issue details... STATUS
      • DM-41645 - Getting issue details... STATUS
      • A problem with building/deploying the documentation in GHA CI has been solved by Fritz Mueller

Known issues:

  • qserv-operator  is not up to date (see release status above)
    • Fritz Mueller has a PR for that.
    • After that, a new release of the operation can be deployed at IDF and elsewhere
  • DNS problem in XROOTD file client class (no solution for this yet)
    • Andy Hanushevsky:
      • client caches connections. If the connection gets closed because of the server restart and a possible IP address change then the cached connection couldn't be reopened as it's still associated with the older IP address.
      • A fix will be put on a work list for XROOOT. A solution will be to use VNID  (the Virtual Network ID).
      • No ETA yet.

The next release needs to be built to include the above-mentioned fixes.

In the long run:

  • Fritz Mueller has plans for merging qserv-operator  into qserv soon. This will eliminate the multi-month latency that exists between these two developments and prevent rapid deployment of the new Qserv versions in the cloud-based deployments.

Colin Slater is curious about a plan for updating IDF's -prod to the latest Qserv release

(tick)

User-generated data products

(postponed)

Postponed from the previous meeting before a new Qserv release would be built and deployed:

This has been postponed till next week when Fritz Mueller will have time to work on the Docker images.

(tick)

"Dark" tasks at workers

Discussed Last week at:

There was a proposal to try "booting" tasks to the "Snail" scheduler. Indeed, this was implemented by Igor Gaponenko in:

The implementation was tested last week at USDF. Unfortunately, this makes things even worse. Observations on the tests can be found in the above-linked Jira ticket. It was discussed with John Gates yesterday at the informal meeting at SLAC. Apparently, more work on the task "booting" code is needed to make tasks "booted" onto "Snail" visible by the Qserv worker monitoring.

Fritz Mueller noted that LSST agreed on the 12-hour limit for how long queries are allowed to live in Qserv.

John Gates suggestions for what could be done next:

  • verify that the tasks-to-scheduler association gets updated in tasks booted to the new scheduler
  • cancel queries when booting to "Snail"
    • Igor Gaponenko this may not work for sub-chunk queries in the data extraction phase as a whole result file (where results from other sub-chunks of the same chunk are being collected)

Colin Slater was wondering if it's the right time to reconsider the old approach for optimizing query processing at workers with a switch to using the SSD-base storage technology.

  • Fritz Mueller we get the file-based result protocol to production, and get experience with that. After that two options:
    • Option 1 (radical): to have a test scheduler that wouldn't be the "shared scan".
    • Option 2 (less radical): extend query classification based on the number of chunks. This needs to be done on Czar.
    • Option 3 (conservative): fix bugs in the current code
    • ACTION ITEM for Igor Gaponenko:
      • make the "dark" queries visible on the Web Dashboard monitoring pages
      • DM-41534 - Getting issue details... STATUS
(tick)

Query analysis & processing in Qserv

The topic was mentioned last week. It was postponed due to a lack of time. What we have so far is this:

  • No progress on creating table indexes on the materialized sub-chunk tables (the same ones that are defined on the "mother" table).
  • Further experiments are still needed to see if this will bring any benefits from the overall query performance standpoint.

A new idea of what might be interesting to see was discussed yesterday at SLAC during an informal meeting between John Gates, Fritz Mueller, and Igor Gaponenko:

  • rebuild one of the chunk tables by reordering rows by sub-chunks
  • test a typical NN-query that involves sub-chunks against the table w/o materializing the subchunk into temporary tables
  • This is an ACTION ITEM for Igor Gaponenko 

Colin Slater suggested trying a modified/reduced version of the same RefMatch query with fewer columns to see if that will have any effect on the run-time of the query

(tick)

New Qserv

Igor Gaponenko is making good progress on:

Action items

  •