Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Attendees

Last week's meeting notes:

...

TimeItemWhoNotes
(tick)Project news

DM all hands in Chile has been moved to March 2023

PCW2022 is comming

JSRs are coming after PCW2022. People are going to be preoccupied with that.

News from ORA: recognizing/rewarding people for work on DP01  (not DP02). E-mails with invitations have been sent to the relevant folks. This is scheduled for August 16th.

 (tick) Status of DP02

Colin Slater has the complete set of Parquet  files for ForcedSourceOnDiaObject.

Igor Gaponenko will spend the next 24+ hours preprocessing the Parquet.

Fritz Mueller suggested ingesting both qserv-int  And qserv-prod  in parallel since scientists may be interested in seeing this table in Qserv before PCW2022.

Fritz Mueller On RefMatch  tables:

  • the problem has been identified in the case sensitivity of the RelatonalGraph implementation.
  • this will be fixed and deployed soon.
  • still waiting for:
    Jira
    serverJIRA
    serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
    keyDM-35578
  • In the meantime, going to modify CSS manually

Fabrice Jammes will need to be notified by Igor Gaponenko on where to locate the new version of ForcedSourceOnDiaObject and truth tables (as RefMatch) and the relevant instructions on the schema and CSS configurations.

(tick)Case sensitivity in Qservteam

Context:

  • Christine ran into some problems with Qserv in the past
  • Qserv is case-sensitive on database and table names
  • There is the JIRA ticket related to this:
    • Jira
      serverJIRA
      serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
      keyDM-16709

Fritz Mueller thinks he knows how to implement the case-insensitive front-end for incoming queries

Igor Gaponenko noted that some user queries are still case-sensitive in a respect of the column names. This includes queries mentioning the primary key of the director table.

(tick)Load testing of Qserv

Context reported by Fritz Mueller :

  • Issues with JDBC client library not canceling synchronous queries when the client disconnects from the TAP service.
  • We have to address these issues.
  • In the meantime, the testing needs to be postponed before the problems are understood and fixed.

Andy Salnikov: connection from a client to proxy results in another connection to MySQL

  • needs to be investigated.
  • Fritz Mueller: another issue was about disconnects from Qserv due to 8 hours timeouts have been observed. It turns out this is exactly the timeout set in czar's MySQL service.

    Fritz Mueller: tried to increase that timeout. This didn't help

    Andy Salnikov: it's possible this could be fixed by setting a proper  interactive timeout or the wait timeout (wait_timeout) on the czar's MySQL server. Another idea is to check what mysql-proxy thinks about the timeout:

    Code Block
    For timeouts could be interesting to run SHOW SESSION VARIABLES LIKE 'wait_timeout' through the proxy.

    Fritz Mueller tried to increase that timeout. This didn't help will further investigate this.

    Fritz Mueller: there was some confusion about what queue was used for processing queries at workers. Discussed it with John Gates.What would be the next steps in investigating disconnects and timeouts?

    (tick)News on qserv-ingest 

    Issues caused by the timeouts have been fixed.

    The next step will be to ingest DP02  into qserv-dev  using qserv-ingest.

    Fritz Mueller would like to fix the LSST Logger configuration in qserv-operator.


    Instabilities in the Kubernetes-based CIteam

    Context:

    • there is a chance CI may fail to ingest catalogs because some of the workers aren't ready (have not reported "for duty" to the Replication Controller.
    • this creates a spectrum of problems

    The readiness probe based on the kube control's ability to monitor pods has been in place for many months. The CI is blocked waiting before all reports as ready.

    Fritz Mueller: formally this looks good. The problem is in the fidelity of the application's status. It may take more time for the application to stabilize itself.

    Igor Gaponenko proposed two solutions:

    • Ask the Replication Controller which workers (how many of those) have connected. There is a REST service for that. Unfortunately, the service has a bug that needs to be fixed. See:
      • Jira
        serverJIRA
        serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
        keyDM-35774
    • Or, add an option to the Controller to wait before a quorum is formed (the required number of workers had connected to the Controller).


    Action items

    •