Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

TimeItemWhoNotes

Project newsFritz Mueller 

News from DLMT

News from the ongoing 3rd Data Facilities Planning Workshop - 2022-04-05/07:

  • meeting recordings and presentations are available (see the link above)
  • service transition from NCSA is the biggest concern
  • Fritz Mueller will report on moving HSC reprocessing to USDF

B50 offices: more pressure is expected later this year


Progress on topics discussed at the last meeting Database Meeting 2022-03-30


Plans for building a new Qserv release to be deployed at IDF

Igor Gaponenko: the latest version of the Replication/Ingest system introduced the worker registration service which requires follow up changes in the qserv-operator. Details in:

Jira
serverJIRA
serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
keyDM-33376

Fabrice Jammes:

  • worked on improving qserv-ingest to use the ASYNC protocol
  • the new version works for ingesting DP0.1 catalog
  • issues with the DNS server have been observed
  • no work on integrating the Replication system's worker registry has been made yet

Fabrice Jammes will work on upgrading the operator to incorporate the latest improvements made to the R-I system as a high-priority task.


Readiness for ingesting the DP0.2 products into Qserv

Discussed improvements to the architecture of the catalog ingest system (in a broad sense, including the "inner" Replication/Ingest system and the "outer" ingest workflows):

  • we are expected to take care of translating Parquet files and Felis schema files into intermediate products to be ingested into Qserv
  • should we consider moving the partitioning stage into the "inner" Ingest system rather than doing this at a level of the ingest workflow?
  • we need to keep thinking about the API to the "outer" ingest workflows for users (to reduce impedance for ingesting user-generated data products)

Problems with creating local table indexes at workers during catalog ingest

Context:

  • exceptions are thrown by the improved version of the qserv-ingest tools (the Python code) when attempting to create indexes. The problem is still being investigated.
  • the very same problem has been observed at the Qserv instance qserv-dev (IDF) when committing transactions.

Investigation:

  • Occasional crashes of the replication controller are observed with exit code 137. The code is attributed to the running out of memory condition.
  • The controller uses the utility node that's shared with many other services.
  • It's possible that the controller's pod got evited by Kubernetes from the utility node.
  • Unfortunately, GKE monitoring is not keeping the full history of the nodes. Only the last 30 minutes of history are kept.

Fabrice Jammes proposed signed (with some hash function) file contributions for detecting data corruptions in the CSV files. Andy Hanushevsky suggested using the hardware-accelerated algorithm CRC32C.

Igor Gaponenko proposed to improve the robustness of the R-I system for infrastructure failures:

  • implement ASYNC version of the long-running requests for ending transactions, publishing databases, and table-level index creation at workers
  • investigate an option for signing contribution requests sent to the R-I workers by the ingest workflow clients. The signature could be based on UUID (or any other unique in the scope of the ingested catalog) identifier. This would improve the bookkeeping and reduce ambiguity in analyzing failures caused by connection losses during ingest submission.


Action items

  •