Date

Attendees

Notes from the previous meeting

Discussion items

DiscussedItemNotes
(tick)Project news

Fritz Mueller :

  • reported on his trip to Chile
  • The SQuaRE team is expanding. A few more people are being hired. Some may be based at SLAC.
  • DM All Hands and System Performance meeting on February 6-8
  • Fritz Mueller is among  the local organizers 

Travels/vacations:

(tick)USDF

Igor Gaponenko news from the last Data Facilities meeting  Joint Data Facilities Meeting - 2023-12-18

  • FY24 hardware purchasing includes $1M for 50 Qserv nodes (on top of existing 15 + 30 - 2)
  • this is meant to be sufficient to serve DR1 (and DR2?)
  • Fritz Mueller and Igor Gaponenkoare expected to discuss specs with Richard in January
  • SLAC IT team is still working on bringing new hardware up, which includes Cassandra nodes and 28 Qserv nodes
    • The previous ETA for Cassandra nodes to be available in January still looks valid

Planned power outage at SLAC on January 3-8, 2024:

  • This may result in the interruption of services at USDF.

A discussion on the potential effect of having a cluster with non-homogenous nodes (different storage and processing capacities)

  • this will eventually happen as Qserv nodes will be incrementally upgraded/replaced with newer hardware
  • Fritz Mueller the present implementation of Qserv implies a homogeneous hardware setup
  • improvements are needed in the following areas to address this challenge:
    • load balancing query processing by Qserv
    • in the Replication system to take 
(tick)

Current status of Qserv and Qserv builds

The most recent release:

  • 2023.11.1-rc3 :
    • was deployed at -prod (IDF) ~2 weeks ago (used ad hoc method to bypass missing operator support) 
    • Qserv looks stable (no worker crashes, pod restarts, etc.)
    • no performance issues

Known problems:

  • Still no qserv-operator  support for the latest release

A discussion on global alerts on queries that take too long:

  • a few ideas on how to improve the "watcher"
  • set a short limit for "mobu" queries (typically less than 5 minutes)
  • follow query progress
  • take into account the number of queries that are being processed in parallel
(tick)

Query analysis & processing in Qserv

Igor Gaponenko tested the effect of locking files in memory at USDF using queries against DP02 :

  • tested a variety of query types, including
    • unconstrained (no LIMIT) large result SELECT *  with automatic query cancelation on the 5 GB  result set limit
    • unconstrained large result SELECT * LMIT 100000..  with automatic query cancelation after getting enough rows
    • true scan queries w/o subchunks involving columns for which we don't define table indexes
      • large result
      • small result
    • low CPU usage near-neighbor queries
  • running 64 queries to be processed in parallel
  • switching back and forth between the following memory management modes
    • MemManReal 
    • MemManNoneRelaxed  (a bug-fixed version of the original MemManNoneRelaxed )
  • PRELIMINARY CONCLUSION:
    • Locking files in memory does NOT affect the performance of any query
  • Notes:
    • It's not clear what would be an effect of the memory locking in case of high-latency the HDD-based or network-based (IDF) filesystems
    • Multi-way JOIN queries that involve indexes would be another interesting use case:
    • locking files in memory doesn't work at IDF (Kubernetes)

Fritz Mueller:

  • the indifference might be caused by files cached in memory from previous runs
  • kernel performance tools to study what's in the file system cache

Igor Gaponenko :

  • we might order a few Qser nodes (of the 50 nodes batch) to be equipped with HDDs to experiment with the memory-locking options
(tick)

New Qserv

Igor Gaponenko :

  • Finished migrating the Control plane and monitoring protocol of Qserv workers and Czar to HTTP/REST (based on qhttp )
  • All seems to work fine under stress tests
  • Next steps:
    • Code cleanup and refactoring in the result processing code (in progress, to be finished mid-January 2024)
    • Develop the high-performance result merging mechanism at Czar (to begin working on this in January 2024)
    • HTTP/REST-based Qserv front-end (Czar)
      • Initially, it will be C++-based (qhttp)
      • The service could be easily set up at USDF
      • A support at Qserv operator will be needed to add a new pod
      • Colin Slater : this needs to be the high-priority project for January 2024
(tick)

UKDF colleagues are still having trouble with ingesting DP02 into their cluster using qserv-ingest.


The problem was reported at the Database Meeting 2023-12-06

Investigation:

  • An accidental Qserv restart during ingest was the root cause of the problem. It triggered 5 Qserv workers to go into Kubernetes "crash loop". This, in turn, might be caused by excessive memory use by the Qserv Ingest System in an attempt to resume ingest contribution requests posted before the restart. 
  • The cluster was repaired by Igor Gaponenko (I got access to the Kubernetes cluster at UKDF).
  • A workaround for ingesting the catalog in 3 steps (1/3rd of each step) was found.
  • The ingest is still in progress (due to low network bandwidth to the input contributions and a large number of contributions) 

Action items

  •