Date

Attendees

Notes from the previous meeting:

Discussion items

DiscussedItemWhoNotes
(tick)Project news
  • the last DMLT was focused on the upcoming reviews
  • another "hot" topic is the DM all-hands meeting in Chile (preliminarily March 13-17, 2023):
    • Yusra sent around the questionary  for those who may be interested in participating (online versus in-person, arriving the weekend before or staying the weekend after)
    • It's formally for DM "construction" (not for "operation") which might be an issue for some of us.
(tick)NCSA to SDF (S3DF) migration

Igor Gaponenko on the status of the test Qserv instance:

  • got 6 nodes for Qserv, finalized the configuration
  • unpacked and deployed a snapshot of Qserv instance large6  that was taken at NCSA around May 17th
  • the snapshot does not include DP02 (the catalog would need to be ingested)
  • Qserv hasn't been started as I need to do some work on the tooling

Fritz Mueller there is an interest to set the TAP service at RSP

 (tick)

Status of the Qserv integration tests

(technical discussion)

 team

Igor Gaponenko on the current status:

  • a collection of tables (5 databases) within Qserv source tree within itest_src 
  • a purpose of many queries is not well understood (or documented) with documentation links pointing to TRAC
    • Fritz Mueller we may still have the original TRAC pages migrated to Confluence
    • Action item: need to document each query
  • some queries are meant to test the non-existing functionality of Qserv (some sort of the "wish list" for future improvements?)
    • Fritz Mueller those might be based on the initial survey of what functionality was expected from Qserv
  • some may exist for testing the home-grown SQL parser (before migrating to ANTLR4
    • Fritz Mueller some might be added as bugs were discovered in the parser, or for bugs in the query rewriter tests
  • about 50% of the test queries are presently disabled (marked as FIXME, etc.)
  • some of those (disabled tests) are needed to cover the current functionality of Qserv
  • some might be disabled when migrating the tests to the new lite container or because the required functionality wasn't present in the Replication/Ingest system at the time of the migration
  • there is quite a bit of duplication between the tests (and catalogs)
  • some data ( CSV ) files are compressed, while others aren't. It's not clear why and what it's meant to test.
  • only 3 (out f 5) catalogs are presently tested

Conclusions:

  • it's a bit of a mess in there
  • Qserv coverage is not complete (or excessive) in some areas
  • the BIGGEST problem (for myself) was with using very specific table names that imply certain semantics in the context of LSST ("Object", "Source", etc.). Although the initial motivation behind that decision is clear, this naming convention is presenting a big obstacle in understanding what is actually being tested in Qserv. The semantics of some LSST tables has been changed since the original Data Model.

The proposal to be discussed:

  • revisit the test cases
  • eliminate duplicates and obsolete tests
  • add in the missing tests
  • come up with the Qserv-specific naming convention for the tables to reflect their role within Qserv ("director", "child", "ref-match", "fully-replicated", etc.)
  • refine the table schemas (and data) to leave only the essential columns (required for Qserv and for the referential integrity of the schemas), and add a few of the "payload" columns as needed where tests require row selection based on those values (shared scans, testing the WHERE clause etc.)

Fritz Mueller:

  • some tables should carry the semantics (time series queries, etc.). So we do need a way to keep the semantics (at least) for some)
  • we could document those within the source tree using RST
  • we need to do a systematic revisit of the tests to see what's missing

Fritz Mueller Add the micro dataset to be automatically deployed with Qserv before running any integration tests. This dataset could be used for basic (interactive?) testing of Qserv after it gets deployed.

Igor Gaponenko proposed to implement the synthetic dataset generator (driven by YAML) as an alternative for the present collection of static test catalogs. The dataset would be generated by the Python script at the run-time of the integration test. Or, it could be pre-generated if needed. This option allows generating catalogs of any scale (number of databases, tables, columns in the tables, the number of rows). Also:

  • the test queries could be generated by the script accordingly
  • this technique could be also used for the small-to-mid term scalability & performance testings of Qserv
(tick)Status of qserv-operator 

Intermittent problems when loading DP02 into qserv-dev have been observed at IDF. The first class of problems was found to be caused by Google's NAT service configuration for the outbound connections. That was causing failures when pulling contributions from IN2P3 into IDF.

The second class of problem might be caused by the worker pods restarted during the ingest. The restarts were resulting in changes in the IP addresses of the restarted pods. This was confusing the ingest workflow that was caching IP addresses of the workers at the beginning of each transaction.

  • Fritz Mueller has proposed to extend the worker registration protocol (model) of the Replication/Ingest system with the DNS entries of the workers captured by the worker ingest services themselves and reported to the worker registry. Then the ingest workflow would be given an option to use the IP addresses or the DNS entries. Igor Gaponenko has registered the following JIRA ticket addressing the issue:
  • Though, we still have the transient problem during worker's pod restarts. The DNS entries of the workers would disappear from the Kubernetes DNS service. In order to deal with this, Fabrice Jammes would need to reinforce the implementation of the workflow to resolve the DNS entries (or IP addresses of the workers) before submitting the contributions. Should any problems be seen at this stage, the workflow could take proper actions (wait before the DNS entry would show up again, or request the new sets of the IP addresses from the Replication Controller).
  • Igor Gaponenko an alternative approach would be to extend the Replication Controller to allow sending all ASYNC requests to the Controller and let the Controller take care of distributing these requests between the relevant workers.

Fabrice Jammes is not seeing these problems in IN2P3 (k8s -based Qserv deployments) where the workers are being run on the beefy hardware. Fabrice Jammes is going to bump (by a factor of 2) the amount of memory available to the workers in IDF (qserv-dev)  to see if that would help to workaround the issue (prevent the restarts).

In the meantime, Igor Gaponenko will be working on the improved version of the worker ingest services to avoid the very origin of the problem (memory accumulation by the services).

Also discussed a plan to improve the logger configuration for the services in the qserv-operator-based deployments. Fritz Mueller has made the following JIRA ticket in tis context:

(tick)

Query cancellation


Context:

Fritz Mueller reported the updates:

  • has discovered there is the LUA hook allows to intercept these events at the lua-proxy  level
  • The origin of the 8 hours timeout is still not known. Setting timeouts at the level of MariaDB and the proxy didn't help
  • Theories: kernel TCP, keep-alive timeout, or something else
  • will continue investigating

Action items