Database meeting 2022-03-16

Date

16 Mar 2022

Attendees

Igor Gaponenko Fritz Mueller Andy Hanushevsky John Gates Unknown User (npease) Fabrice Jammes Andy Salnikov Colin Slater

Goals

Please register topics below

Discussion items

Item	Who	Notes
Project news	Fritz Mueller	The 3 days data facilities workshop on April 5th: 3rd Data Facilities Planning Workshop - 2022-04-05/07 SLAC welcomes everyone back (details in the e-mail sent yesterday by Brian Sheril) office spaces
Progress from the previous meeting Database meeting 2022-03-09		Grand Unified Repo: PR for merging `qserv_testdata` is still on the review for DM-33618 - Getting issue details... STATUS Database init: TBC Subchunk sizes and overlap: the problem is pertinent to the high-density catalogs a goal is to decrease the number of rows in each subchunk to improve the of cross-joins in the N-N queries do the query profiling first on the existing catalog (Fritz Mueller ) Igor Gaponenko has proposed to use an existing Data Exportation service of the Replication/Ingest system to get a subset of chunks from the existing kpm50 catalog where the problem is seen and reingest these data into the same Fritz Mueller will further investigate it and schedule it based on existing priorities.
Optimizations in processing results of the N-N queries	John Gates, team	The context: DM-33346 - Getting issue details... STATUS
Update on worker load imbalance problem	Fritz Mueller	The context was set in the previous meeting (see the link Database meeting 2022-03-09): seems to be XRootD version dependant (Andy Hanushevsky 's help is needed here) Andy Hanushevsky still needs to see the redirector's logs from the redirector and from one of the workers to see what's going on. John Gates would do this. Andy Hanushevsky: affinity works fine before an overload happens. After that XROOTD begins shifting chunk requests to further workers. This explains the linear behavior. Resolution: Andy Hanushevsky and John Gates will work on the further investigation based on the log files
IDF worker crash this morning	Fritz Mueller	Context: This is the second time we're seeing this in IDF. The previous time was reported in Sep/Oct 2021. These are links to the Slack channel: https://lsstc.slack.com/archives/C8EEUGDSA/p1633024306132700 https://lsstc.slack.com/archives/C8EEUGDSA/p1634233691204600 The problem was caused by the rolling update of GKE. Allegedly, it's caused by stale IP addressed cached by XRootD services. It's due to optimization in the service that won't re-resolve the address at each request. How do we investigate this? Andy Hanushevsky inspect the log files to see what service has the wrong address Andy Hanushevsky 's theory is that we may have some "rogue" service in Qserv using the wrong IP address Possible short-term solutions: coordinate GKE upgrades with complete restarts of Qserv There is a (potentially?) related issue exhibiting itself in the worker logs as follows: lsst.qserv.wdb.ChunkResource WARN: memLockStatus unexpected results, assuming LOCKED_OTHER. err=Error 0: Expecting one row, found no rows lsst.qserv.wdb.ChunkResource WARN: Memory tables were not released cleanly! LockStatus=1 Further investigation shows that these harmless messages are posted by: `wdb/SQLBackend`
Refactoring `qserv-ingest`	Fabrice Jammes	The work on modifying the workflow to begin using ASYNC ingest service is still in progress. The SYNC mode worked successfully for ingesting 50 TB catalog
Refactoring `qserv-operator`	Fabrice Jammes	Context: The new version of the Replication/Ingest system adds the required worker registry service that needs to be properly configured and used by the dependent services of the system. Igor Gaponenko published instructions for that at Configuring worker registry in the Replication system of Qserv Fabrice Jammes is still working on the Operator to integrate the change.

Action items

Space shortcuts

Page tree

Date

Attendees

Goals

Discussion items

Action items