Database Meeting 2024-03-20

Date

20 Mar 2024

Attendees

Igor Gaponenko Colin Slater Andy Salnikov John Gates Fritz Mueller

Notes from the previous meeting

Database Meeting 2024-03-13

Discussion items

Discussed	Item	Notes
	Project news	Fritz Mueller : There was no DMLT this week due to the next Preparation for Ops rehearsal is going on Visible progress with the main mirror on the Summit Igor Gaponenko Will the planned commissioning simulation (April 1-5) affect us? Fritz Mueller: it won't. It's going to be a DRP-like run. No plans for ingesting any data products (into Qserv, or elsewhere). Colin Slater community workshop (not the All Hands) at SLAC this June. The non-person format.
	USDF	Summary of news from this week's meeting: Joint Data Facilities Meeting - 2024-03-18: Still no progress with 28 Qserv nodes purchased last year "Skyrocketing" (doubled since January) prices for SSD may affect the next PO Plans for expanding SSD-based/backed filesystems may be scaled down Qserv wasn't directly mentioned in this context Ongoing activities in refining/redefining the cybersecurity model for USDF Not much details yet We should wait until May to see actual changes The hiring process for the DBA candidate started (as a contractor?) Progress in developing the tape-based backup infrastructure Andy Salnikov : Buttler schema migration: the next window of opportunity in ~1 month Cassandra, APDB, PPDB: it's still under tests working on replication implemented a simple script to push data from Cassandra into PostgreSQL performance is not great limited by using SQAlchemy in the script implementation (many thousand rows to be transferred) tables are really wide most time is spent on the client side (SPU) to repackage data Looking at reimplementing the translation tool to dump data into CSV and using an efficient binary loader tool fr PostgreSQL Fritz Mueller had a positive experience with the tool in the context of ingesting DP03. However, there are some caveats relating to a version of the tool. Further details can be found in Jira by looking for DP03.
	Current status of Qserv and Qserv builds	The current production release is `2024.3.1-rc2`. Status of the ongoing "garbage collection" for the temporary message tables: prod: 2.6 million (out of 3.5) tables have been deleted int: 2.3 million (out of 3.7) tables have been deleted ETA: ~1 more week Status of Qserv cluster at IDF: prod: It's stable There have been a couple of occasional worker pod restarts over the last 3 days w/o any traces. This could be the same problem discussed last week (alleged rolling upgrades of GKE). Fritz Mueller will have a look at this tomorrow int: Two worker pods restarted similarly to what happened in prod. A mysterious crash of the Replication System's MariaDB 45 hours ago. The Replication Controller was restarted after exceeding the maximum number of allowed reconnects to the service. The relevant section of the database server's log: 2024-03-18 20:43:13 3 [Warning] IP address '10.137.1.1' could not be resolved: Name or service not known 2024-03-18 20:43:13 3 [Warning] Aborted connection 3 to db: 'unconnected' user: 'unauthenticated' host: '10.137.1.1' (This connection closed normally without authentication) 2024-03-18 20:43:14 0x7f42e81e5700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.6.8/storage/innobase/btr/btr0cur.cc line 324 InnoDB: Failing assertion: btr_page_get_prev(get_block->page.frame) == block->page.id().page_no() InnoDB: We intentionally generate a memory trap. InnoDB: Submit a detailed bug report to https://jira.mariadb.org/ InnoDB: If you get repeated assertion failures or crashes, even InnoDB: immediately after the mariadbd startup, there may be InnoDB: corruption in the InnoDB tablespace. Please refer to InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/ InnoDB: about forcing recovery. 240318 20:43:14 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. ... I will keep an eye on the problem. Fritz Mueller suggested Igor Gaponenko run the integrity test on the Replication system's database tomorrow during the Thursday outage at IDF (3:00 pm). UPDATE 2024-03-21: Igor Gaponenko : the database analyzer didn't report any problems with the Replication database on -int or -prod : % kubectl exec -it qserv-repl-db-0 -- mysqlcheck --analyze -uroot -pxxxxxx -h127.0.0.1 --protocol=tcp -P3306 qservReplica qservReplica.QMetadata OK qservReplica.config_database OK qservReplica.config_database_family OK qservReplica.config_database_table OK qservReplica.config_database_table_schema OK qservReplica.config_worker OK qservReplica.config_worker_ext OK qservReplica.controller OK qservReplica.controller_log OK qservReplica.controller_log_ext OK qservReplica.database_ingest OK qservReplica.job OK qservReplica.job_ext OK qservReplica.replica OK qservReplica.replica_file OK qservReplica.request OK qservReplica.request_ext OK qservReplica.stats_table_rows OK qservReplica.transaction OK qservReplica.transaction_contrib OK qservReplica.transaction_contrib_ext OK qservReplica.transaction_contrib_retry OK qservReplica.transaction_contrib_warn OK qservReplica.transaction_log OK
	Qserv query analysis and query processing performance	Context: IDF A bunch of queries were sent to the "medium" queue From the relevant discussion in the team channel this week, it sounds like we may (possibly) have a suboptimal implementation of the following query class: SELECT DISTINCT <column> FROM <database>.<partitioned-table> LIMIT <N> Do we clearly understand what's going on here, and is this a problem or a "problem"? Shall we further discuss this? Fritz Mueller : the query could be optimized however, this is not a very important query there are more important tasks to be addressed at the moment
	Merging `qserv-operator` into `qserv` source tree and changing container builds	Fritz Mueller: no news will continue working on this next week Igor Gaponenko : can Qserv development and run-time infrastructure be upgraded to Python 3.12? Fritz Mueller possibly, started looking on upgrading the toolchain, we may have this week
	Addressing an issue with the "dark" queries	John Gates: all relevant code is in GitHub Fritz Mueller will build a new release to get things rolling
	HTTP-based Qserv frontend	Igor Gaponenko: ongoing work on expanding the integration test DM-42810 - Getting issue details... STATUS Had to "dive" into the Python code Fritz Mueller suggested considering the option of comparing query results against the reference (MariaDB) database instead of the MySQL proxy. A reason for that is that the proxy-based front-end may (will) be eventually decommissioned. Igor Gaponenko will investigate this option as well as a possible selector for the source database. Fritz Mueller still needs to look at IVOA (TODO) Igor Gaponenko: no progress (has not started working on) on DM-43282 - Getting issue details... STATUS
	New dispatch (new Qserv)	John Gates looking at: DM-43291 - Getting issue details... STATUS

Action items

Space shortcuts

Page tree

Date

Attendees

Notes from the previous meeting

Discussion items

Action items