Database Meeting 2022-04-06

Date

06 Apr 2022

Attendees

Igor Gaponenko Fritz Mueller Unknown User (npease) Andy Salnikov John Gates Colin Slater Andy Hanushevsky Fabrice Jammes

Goals

Add topics to the table below

Discussion items

Item	Who	Notes
Project news	Fritz Mueller	News from DLMT News from the ongoing 3rd Data Facilities Planning Workshop - 2022-04-05/07: meeting recordings and presentations are available (see the link above) service transition from NCSA is the biggest concern Fritz Mueller will report on moving HSC reprocessing to USDF B50 offices: more pressure is expected later this year
Progress on topics discussed at the last meeting Database Meeting 2022-03-30
Plans for building a new Qserv release to be deployed at IDF	Fritz Mueller	Igor Gaponenko: the latest version of the Replication/Ingest system introduced the worker registration service which requires follow up changes in the `qserv-operator`. Details in: DM-33376 - Getting issue details... STATUS Fabrice Jammes: worked on improving `qserv-ingest` to use the ASYNC protocol the new version works for ingesting `DP0.1` catalog issues with the DNS server have been observed no work on integrating the Replication system's worker registry has been made yet Fabrice Jammes will work on upgrading the operator to incorporate the latest improvements made to the R-I system as a high-priority task.
Readiness for ingesting the `DP0.2` products into Qserv		Discussed improvements to the architecture of the catalog ingest system (in a broad sense, including the "inner" Replication/Ingest system and the "outer" ingest workflows): we are expected to take care of translating Parquet files and Felis schema files into intermediate products to be ingested into Qserv should we consider moving the partitioning stage into the "inner" Ingest system rather than doing this at a level of the ingest workflow? we need to keep thinking about the API to the "outer" ingest workflows for users (to reduce impedance for ingesting user-generated data products)
Problems with creating local table indexes at workers during catalog ingest	Fabrice Jammes	Context: exceptions are thrown by the improved version of the `qserv-ingest` tools (the Python code) when attempting to create indexes. The problem is still being investigated. the very same problem has been observed at the Qserv instance qserv-dev (IDF) when committing transactions. Investigation: Occasional crashes of the replication controller are observed with exit code 137. The code is attributed to the running out of memory condition. The controller uses the utility node that's shared with many other services. It's possible that the controller's pod got evited by Kubernetes from the utility node. Unfortunately, GKE monitoring is not keeping the full history of the nodes. Only the last 30 minutes of history are kept. Fabrice Jammes proposed signed (with some hash function) file contributions for detecting data corruptions in the CSV files. Andy Hanushevsky suggested using the hardware-accelerated algorithm `CRC32C`. Igor Gaponenko proposed to improve the robustness of the R-I system for infrastructure failures: implement ASYNC version of the long-running requests for ending transactions, publishing databases, and table-level index creation at workers investigate an option for signing contribution requests sent to the R-I workers by the ingest workflow clients. The signature could be based on UUID (or any other unique in the scope of the ingested catalog) identifier. This would improve the bookkeeping and reduce ambiguity in analyzing failures caused by connection losses during ingest submission.

Action items

Space shortcuts

Page tree

Date

Attendees

Goals

Discussion items

Action items