Date

Attendees

Notes from the previous meeting

Discussion items

DiscussedItemNotes
(tick)Project news

Fritz Mueller: 1-month delay on the project schedule. Further details to be provided later. 

(tick)Status of Qserv at USDF

 Igor Gaponenko :

  • two identical  Qserv instances have been installed on 6 nodes of the permanent cluster:
    • slac6prod - temporary production service run on port 4040 visible by TAP
    • slac6dev - internal development service on port 4047
    • both are managed by the dedicated service account rubinqsv 
    • Both instances are loaded with the same set of catalogs:
      • full Sky (150k chunks): Gaia, Wise
      • small 1.4k chunks: DP01, DP02 
    • Documentation has been updated: Managing Qserv instances at SLAC
  • the remaining nodes of the 15-node cluster are expected to be delivered to us in about 1 week

Fritz Mueller:

  • there were troubles with async queries submitted via TAP
  • a problem was found in the configurations settings and a bug in the code
  • the bug fix has been deployed and tested
  • it will be deployed at IDF RSP this Thursday
(tick)Status of qserv-operator and readiness to deploy the operator-managed Qserv at USDF

Fritz Mueller 

  • Fabrice Jammes needs to reactivate the UNIX account at SLAC
  • we don't have yet the Kubernetes cluster installed on these nodes

  • vcluster is being considered by Yee
  • storage provisioning is yet to be investigated (custom storage class, etc.)
  • Fabrice Jammes needs to work with Yee to help Yee with configuring k8s services in the cluster for Qserv

 Fritz Mueller Got a ping from Stephen Gueen (CADC IDAC) who's interested in having all catalogs in IDAC 

(tick)A bug in the "director" index optimization

It's been discussed at:

  • https://lsstc.slack.com/archives/G2JPZ3GC8/p1680559084249789
  • a summary of the problem reported by Fritz Mueller
    • for about 1 month we were getting complaints from Frossie and the team about the strange behavior of Qserv at IDF -prod. Some queries were taking too long to execute.
    • the problem was only seen in -prod not in -int 
    • the problem exhibited itself as a significant delay to connect from TAP to Qserv proxy port 4040
    • further investigation revealed a strange internal query submitted by Czar to the "director" index of DP01. The such query should not be allowed.
    • the query was submitted from the Czar pod
    • it turned out a user was submitting queries that had objectId>=0 in them
    • this resulted in a failure of these queries from the user's perspective for DP01 and Czar crash for DP02 
    • an ultimate reason for the problem was that the current implementation of the "director" restrictor plugin has a loophole in it which needs to be fixed
    • another observation: the "director" index lookup was blocking incoming connections to the proxy

Action items? What needs to be investigated?

  • Fritz Mueller proposed the conservative approach:
    • recognize the known "director" lookup clauses and pass them into the plugin
    • the rest should be passed normally to the scans
  • Colin Slater :
    • is concerned about the latency (100s milliseconds) imposed by the "director" index lookups even for the normal cases

Andy Salnikov :

  • inspect if there is any limit on the number of MySQL connections from the proxy to Czar's database

Fritz Mueller :

  • apparently, the query analysis is a blocking operation in Czar 
  • we need to investigate and fix the threading model

Fritz Mueller :

  • we need to revisit the code to see what it does in case if no object identifiers were found in the index
  • we should proceed to the scan

Colin Slater :

  • BETWEEN  makes no sense from the optimization perspective. Such a query should go directly into a scan
(tick)Problems with the k8s -based integration test at UKDF

Igor Gaponenko :

  • the following problem was reported by our colleagues: https://github.com/lsst/qserv-operator/issues/58
  • my preliminary conclusion is based on analyzing the worker log file - we may be seeing a possible mismatch between the test dataset and Qserv version. 

Fritz Mueller :

  • need to check the integration test for qserv-ingest in GHA ... the test passes for the latest official Qserv release
  • Igor Gaponenko should ask Greg what version of Qserv they have in there
(tick)Mods proposed for the ObsCore table data

Fritz Mueller to Andy Hanushevsky :

Action items:

Another issue is an additional column proposed by GPDF to be injected into the "live" catalogs:

(tick)Status of DP03 

Fritz Mueller is in progress

Action item for Igor Gaponenko:

  • study options for ingesting CSV files into PostgreSQL
(tick)Progress on modernizing Qserv

Fritz Mueller , John Gates , and Igor Gaponenko have a conversation on a proposal to introduce "uber-jobs" and "uber-chunks" into Qserv. It  will be based on (and will benefit from) the ongoing development of the file-based result delivery from workers to Czar. More details to be provided soon.

Action items

  •