Date

Attendees

Notes from the previous meeting

Discussion items

DiscussedItemNotes
(tick)Project news

Fritz Mueller :

  • There was no DMLT this week due to the next
  • Preparation for Ops rehearsal is going on 
  • Visible progress with the main mirror on the Summit
  • Igor Gaponenko Will the planned commissioning simulation (April 1-5) affect us?
    • Fritz Mueller: it won't. It's going to be a DRP-like run. No plans for ingesting any data products (into Qserv, or elsewhere).

Colin Slater community workshop (not the All Hands) at SLAC this June. The non-person format.

(tick) USDF

Summary of news from this week's meeting: Joint Data Facilities Meeting - 2024-03-18:

  • Still no progress with 28  Qserv nodes purchased last year
  • "Skyrocketing" (doubled since January) prices for SSD may affect the next PO
    • Plans for expanding SSD-based/backed filesystems may be scaled down
    • Qserv wasn't directly mentioned in this context
  • Ongoing activities in refining/redefining the cybersecurity model for USDF
    • Not much details yet
    • We should wait until May to see actual changes
  • The hiring process for the DBA candidate started (as a contractor?)
  • Progress in developing the tape-based backup infrastructure

Andy Salnikov :

  • Buttler schema migration: the next window of opportunity in ~1 month
  • Cassandra, APDB, PPDB:
    • it's still under tests
    • working on replication
      • implemented a simple script to push data from Cassandra into PostgreSQL
      • performance is not great
      • limited by using SQAlchemy in the script implementation (many thousand rows to be transferred)
      • tables are really wide
      • most time is spent on the client side (SPU) to repackage data
      • Looking at reimplementing the translation tool to dump data into CSV and using an efficient binary loader tool fr PostgreSQL
        • Fritz Mueller had a positive experience with the tool in the context of ingesting DP03. However, there are some caveats relating to a version of the tool. Further details can be found in Jira by looking for DP03.
(tick)Current status of Qserv and Qserv builds

The current production release is 2024.3.1-rc2.

Status of the ongoing "garbage collection" for the temporary message tables:

  • prod: 2.6 million (out of 3.5) tables  have been deleted
  • int: 2.3 million (out of 3.7) tables  have been deleted
  • ETA: ~1 more week

Status of Qserv cluster at IDF:

  • prod:
    • It's stable
    • There have been a couple of occasional worker pod restarts over the last 3 days w/o any traces. This could be the same problem discussed last week (alleged rolling upgrades of GKE).
  • int:
    • Two worker pods restarted similarly to what happened in prod.
    • A mysterious crash of the Replication System's MariaDB 45 hours ago. The Replication Controller was restarted after exceeding the maximum number of allowed reconnects to the service.

The relevant section of the database server's log:

2024-03-18 20:43:13 3 [Warning] IP address '10.137.1.1' could not be resolved: Name or service not known
2024-03-18 20:43:13 3 [Warning] Aborted connection 3 to db: 'unconnected' user: 'unauthenticated' host: '10.137.1.1' (This connection closed normally without authentication)
2024-03-18 20:43:14 0x7f42e81e5700  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.6.8/storage/innobase/btr/btr0cur.cc line 324
InnoDB: Failing assertion: btr_page_get_prev(get_block->page.frame) == block->page.id().page_no()
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mariadbd startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
240318 20:43:14 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
...

I will keep an eye on the problem.

Fritz Mueller suggested Igor Gaponenko run the integrity test on the Replication system's database tomorrow during the Thursday outage at IDF (3:00 pm).

UPDATE 2024-03-21: Igor Gaponenko : the database analyzer didn't report any problems with the Replication database on -int  or -prod :

% kubectl exec -it qserv-repl-db-0 -- mysqlcheck --analyze -uroot -pxxxxxx -h127.0.0.1 --protocol=tcp -P3306 qservReplica
qservReplica.QMetadata                             OK
qservReplica.config_database                       OK
qservReplica.config_database_family                OK
qservReplica.config_database_table                 OK
qservReplica.config_database_table_schema          OK
qservReplica.config_worker                         OK
qservReplica.config_worker_ext                     OK
qservReplica.controller                            OK
qservReplica.controller_log                        OK
qservReplica.controller_log_ext                    OK
qservReplica.database_ingest                       OK
qservReplica.job                                   OK
qservReplica.job_ext                               OK
qservReplica.replica                               OK
qservReplica.replica_file                          OK
qservReplica.request                               OK
qservReplica.request_ext                           OK
qservReplica.stats_table_rows                      OK
qservReplica.transaction                           OK
qservReplica.transaction_contrib                   OK
qservReplica.transaction_contrib_ext               OK
qservReplica.transaction_contrib_retry             OK
qservReplica.transaction_contrib_warn              OK
qservReplica.transaction_log                       OK
(tick)Qserv query analysis and query processing performance

Context: IDF

A bunch of queries were sent to the "medium" queue

From the relevant discussion in the team channel this week, it sounds like we may (possibly) have a suboptimal implementation of the following query class:

SELECT DISTINCT <column> FROM <database>.<partitioned-table> LIMIT <N>

Do we clearly understand what's going on here, and is this a problem or a "problem"?

Shall we further discuss this?

Fritz Mueller :

  • the query could be optimized
  • however, this is not a very important query
  • there are more important tasks to be addressed at the moment
(tick)Merging qserv-operator into qserv source tree and changing container builds 

Fritz Mueller:

  • no news
  • will continue working on this next week

Igor Gaponenko :

  • can Qserv development and run-time infrastructure be upgraded to Python 3.12?
  • Fritz Mueller possibly, started looking on upgrading the toolchain, we may have this week
(tick)Addressing an issue with the "dark" queries

John Gates:

  • all relevant code is in GitHub
  • Fritz Mueller will build a new release to get things rolling
(tick)HTTP-based Qserv frontend

Igor Gaponenko: ongoing work on expanding the integration test

  • DM-42810 - Getting issue details... STATUS
  • Had to "dive" into the Python code
  • Fritz Mueller suggested considering the option of comparing query results against the reference (MariaDB) database instead of the MySQL proxy. A reason for that is that the proxy-based front-end may (will) be eventually decommissioned.
    • Igor Gaponenko will investigate this option as well as a possible selector for the source database.
  • Fritz Mueller still needs to look at IVOA (TODO)

Igor Gaponenko: no progress (has not started working on) on

(tick)New dispatch (new Qserv)

John Gates looking at:

Action items

  •