Date

Attendees

Notes from the previous meeting

Discussion items

DiscussedItemNotes
(tick)Project news

Fritz Mueller, Colin Slater :

  • No major news from the DMLT meeting:
  • More progress on the Summit
  • JTM DM at SLAC is comming. Masks will be required.
(tick)USDF

Notes from the last DF meeting are available at:

A few important takeouts from the meeting:

  • There was a power upgrade affecting USDF over the Winter break
    • Qserv has survived
  • Upgrades of the underlying Kubernetes and S3 infrastructure (the hypervisor upgrade) affected the Buttler database. The PostgreSQL schema migration attempted by Andy Salnikov before the upgrade was wiped out (2 days' worth of work). Andy may have to say more on this.
    • Fritz Mueller the "cascade" effect of the changes in upgraded services to downstream dependencies was not accounted for
    • Andy Salnikov :
      • PostgreSQL created 6 TB of the write-ahead logs when upgrading the schemas of the 0.5 TB database. This needs to be understood. Specifically, this might happened during the vacuum stage.
      • Write-ahead logs are needed for backups in case if restore is needed. The problem is that the logs are huge.
      • The first attempt to back up did not work.
      • The second full backup worked
      • Igor Gaponenko How the performance of the operation at USDF vs IDF is compared (both are based on the network storage)?
        • IDF was somewhat slower to upgrade on the second attempt after expanding memory on the machines
      • Some info on WAL size growth:
  • No progress on the Cassandra nodes.
    • (As explained by Yemi) The deployment got stalled since the underlying network infrastructure was not ready.
  • A reminder to discuss an acquisition of the next batch of 50 Qserv nodes was made at the meeting.
    • ACTION ITEM for Fritz Mueller and Igor Gaponenko:
      • Meet at SLAC to discuss this topic
    • Fritz Mueller based on the current experience with the slower than estimated progress of deployments at USDF it's imperative to accelerate the purchase order. Realistically, there is a typical 6-month delay on the deployment road after the hardware arrives at SLAC.
    • Colin Slater: there is an impression that the overall planning of the work on the USDF infrastructure may lack clarity
(tick)Current status of Qserv and Qserv builds

No changes since the previous meeting. Here is the summary:

  • 2023.11.1-rc3 :
    • was deployed at -prod (IDF) more than 4 weeks ago (used ad hoc method to bypass missing operator support) 
    • Qserv looks stable (no worker crashes, pod restarts, etc.)
    • no performance issues

Known problems:

  • Still no qserv-operator  support for the latest release
  • Fritz Mueller this isn't going to happen before next week
(tick)The RSP folks reported that some queries submitted to Qserv over the TAP service may "last forever"

Fritz Mueller on what's known about the problem:

  • RSP query "monkey" reports queries that take longer than 30 minutes. These queries are timeouted. And after the timeout the TAP service gives up on the queries.
    • These are known to be non-long-running queries that would be typically done within a few minutes
    • However, at some point, these  queries would take longer than 30 minutes
    • According to Qserv these queries are done
  • tried to investigate it over the Winter break:
    • it's reproducible when launching queries over the TAP interface
    • it also seems to be reproducible when submitting queries via MySQL prompt once Qserv gets into this state
  • short term solution or workaround:
    • we need to instrument the mysql-proxy to sprinkle more logging statements
    • Igor Gaponenko :
      • it would be helpful to find a way to reproduce the problem, whether it's a specific set of queries or a sequence of preceding queries, etc.
      • it would allow testing this problem at USDF where we have more control over Qserv
  • long-term solution:
    • we should begin working on the REST API for Qserv, which would also involve making changes to the TAP service
    • the first step would be to come up with a proposal on such API to be discussed with the TAP team
(tick)New Qserv

Igor Gaponenko and Fritz Mueller met before the break to revisit a plan for improving Qserv.

 Projects finished before the Winter break:

On-going work on:

  • DM-40003 - Getting issue details... STATUS
    • John Gates suggested restructuring the protocol to return the summary messages in the headers (sent as the XROOTD/SSI metadata).

Action items

  •