another "hot" topic is the DM all-hands meeting in Chile (preliminarily March 13-17, 2023):
Yusra sent around the questionary for those who may be interested in participating (online versus in-person, arriving the weekend before or staying the weekend after)
It's formally for DM "construction" (not for "operation") which might be an issue for some of us.
a collection of tables (5 databases) within Qserv source tree within itest_src
a purpose of many queries is not well understood (or documented) with documentation links pointing to TRAC
Fritz Mueller we may still have the original TRAC pages migrated to Confluence
Action item: need to document each query
some queries are meant to test the non-existing functionality of Qserv (some sort of the "wish list" for future improvements?)
Fritz Mueller those might be based on the initial survey of what functionality was expected from Qserv
some may exist for testing the home-grown SQL parser (before migrating to ANTLR4
Fritz Mueller some might be added as bugs were discovered in the parser, or for bugs in the query rewriter tests
about 50% of the test queries are presently disabled (marked as FIXME, etc.)
some of those (disabled tests) are needed to cover the current functionality of Qserv
some might be disabled when migrating the tests to the new lite container or because the required functionality wasn't present in the Replication/Ingest system at the time of the migration
there is quite a bit of duplication between the tests (and catalogs)
some data ( CSV ) files are compressed, while others aren't. It's not clear why and what it's meant to test.
only 3 (out f 5) catalogs are presently tested
Conclusions:
it's a bit of a mess in there
Qserv coverage is not complete (or excessive) in some areas
the BIGGEST problem (for myself) was with using very specific table names that imply certain semantics in the context of LSST ("Object", "Source", etc.). Although the initial motivation behind that decision is clear, this naming convention is presenting a big obstacle in understanding what is actually being tested in Qserv. The semantics of some LSST tables has been changed since the original Data Model.
The proposal to be discussed:
revisit the test cases
eliminate duplicates and obsolete tests
add in the missing tests
come up with the Qserv-specific naming convention for the tables to reflect their role within Qserv ("director", "child", "ref-match", "fully-replicated", etc.)
refine the table schemas (and data) to leave only the essential columns (required for Qserv and for the referential integrity of the schemas), and add a few of the "payload" columns as needed where tests require row selection based on those values (shared scans, testing the WHERE clause etc.)
some tables should carry the semantics (time series queries, etc.). So we do need a way to keep the semantics (at least) for some)
we could document those within the source tree using RST
we need to do a systematic revisit of the tests to see what's missing
Fritz Mueller Add the micro dataset to be automatically deployed with Qserv before running any integration tests. This dataset could be used for basic (interactive?) testing of Qserv after it gets deployed.
Igor Gaponenko proposed to implement the synthetic dataset generator (driven by YAML) as an alternative for the present collection of static test catalogs. The dataset would be generated by the Python script at the run-time of the integration test. Or, it could be pre-generated if needed. This option allows generating catalogs of any scale (number of databases, tables, columns in the tables, the number of rows). Also:
the test queries could be generated by the script accordingly
this technique could be also used for the small-to-mid term scalability & performance testings of Qserv
Intermittent problems when loading DP02 into qserv-dev have been observed at IDF. The first class of problems was found to be caused by Google's NAT service configuration for the outbound connections. That was causing failures when pulling contributions from IN2P3 into IDF.
The second class of problem might be caused by the worker pods restarted during the ingest. The restarts were resulting in changes in the IP addresses of the restarted pods. This was confusing the ingest workflow that was caching IP addresses of the workers at the beginning of each transaction.
Fritz Mueller has proposed to extend the worker registration protocol (model) of the Replication/Ingest system with the DNS entries of the workers captured by the worker ingest services themselves and reported to the worker registry. Then the ingest workflow would be given an option to use the IP addresses or the DNS entries. Igor Gaponenko has registered the following JIRA ticket addressing the issue:
Though, we still have the transient problem during worker's pod restarts. The DNS entries of the workers would disappear from the Kubernetes DNS service. In order to deal with this, Fabrice Jammes would need to reinforce the implementation of the workflow to resolve the DNS entries (or IP addresses of the workers) before submitting the contributions. Should any problems be seen at this stage, the workflow could take proper actions (wait before the DNS entry would show up again, or request the new sets of the IP addresses from the Replication Controller).
Igor Gaponenko an alternative approach would be to extend the Replication Controller to allow sending all ASYNC requests to the Controller and let the Controller take care of distributing these requests between the relevant workers.
Fabrice Jammes is not seeing these problems in IN2P3 (k8s -based Qserv deployments) where the workers are being run on the beefy hardware. Fabrice Jammes is going to bump (by a factor of 2) the amount of memory available to the workers in IDF (qserv-dev) to see if that would help to workaround the issue (prevent the restarts).
In the meantime, Igor Gaponenko will be working on the improved version of the worker ingest services to avoid the very origin of the problem (memory accumulation by the services).