Skip to end of metadata
Go to start of metadata

Date

Attendees

Last week's meeting notes:

Discussion items

TimeItemWhoNotes
(tick)Project news

DM all hands in Chile has been moved to March 2023

PCW2022 is comming

JSRs are coming after PCW2022. People are going to be preoccupied with that.

News from ORA: recognizing/rewarding people for work on DP01  (not DP02). E-mails with invitations have been sent to the relevant folks. This is scheduled for August 16th.

 (tick) Status of DP02

Colin Slater has the complete set of Parquet  files for ForcedSourceOnDiaObject.

Igor Gaponenko will spend the next 24+ hours preprocessing the Parquet.

Fritz Mueller suggested ingesting both qserv-int  And qserv-prod  in parallel since scientists may be interested in seeing this table in Qserv before PCW2022.

Fritz Mueller On RefMatch  tables:

  • the problem has been identified in the case sensitivity of the RelatonalGraph implementation.
  • this will be fixed and deployed soon.
  • still waiting for: DM-35578 - Getting issue details... STATUS
  • In the meantime, going to modify CSS manually

Fabrice Jammes will need to be notified by Igor Gaponenko on where to locate the new version of ForcedSourceOnDiaObject and truth tables (as RefMatch) and the relevant instructions on the schema and CSS configurations.

(tick)Case sensitivity in Qservteam

Context:

  • Christine ran into some problems with Qserv in the past
  • Qserv is case-sensitive on database and table names
  • There is the JIRA ticket related to this:

Fritz Mueller thinks he knows how to implement the case-insensitive front-end for incoming queries

Igor Gaponenko noted that some user queries are still case-sensitive in a respect of the column names. This includes queries mentioning the primary key of the director table.

(tick)Load testing of Qserv

Context reported by Fritz Mueller :

  • Issues with JDBC client library not canceling synchronous queries when the client disconnects from the TAP service.
  • We have to address these issues.
  • In the meantime, the testing needs to be postponed before the problems are understood and fixed.

Andy Salnikov: connection from a client to proxy results in another connection to MySQL

Fritz Mueller: another issue was about disconnects from Qserv due to 8 hours timeouts have been observed. It turns out this is exactly the timeout set in czar's MySQL service.

Fritz Mueller: tried to increase that timeout. This didn't help

Andy Salnikov: it's possible this could be fixed by setting a proper  interactive timeout or the wait timeout (wait_timeout) on the czar's MySQL server. Another idea is to check what mysql-proxy thinks about the timeout:

For timeouts could be interesting to run SHOW SESSION VARIABLES LIKE 'wait_timeout' through the proxy.

Fritz Mueller will further investigate this.

Fritz Mueller: there was some confusion about what queue was used for processing queries at workers. Discussed it with John Gates.

(tick)News on qserv-ingest 

Issues caused by the timeouts have been fixed.

The next step will be to ingest DP02  into qserv-dev  using qserv-ingest.

Fritz Mueller would like to fix the LSST Logger configuration in qserv-operator.


Instabilities in the Kubernetes-based CIteam

Context:

  • there is a chance CI may fail to ingest catalogs because some of the workers aren't ready (have not reported "for duty" to the Replication Controller.
  • this creates a spectrum of problems

The readiness probe based on the kube control's ability to monitor pods has been in place for many months. The CI is blocked waiting before all reports as ready.

Fritz Mueller: formally this looks good. The problem is in the fidelity of the application's status. It may take more time for the application to stabilize itself.

Igor Gaponenko proposed two solutions:

  • Ask the Replication Controller which workers (how many of those) have connected. There is a REST service for that. Unfortunately, the service has a bug that needs to be fixed. See:
  • Or, add an option to the Controller to wait before a quorum is formed (the required number of workers had connected to the Controller).


Action items

  •