Date

Attendees

Discussion items

DiscussedItemWhoNotes
(tick)Project news
  • NSF visit to Tuscon went well
  • A virtual visit to Summit, June 1st
  • Virtual Face-to-Face DM meeting (June 6-on): technical sessions are now open to the public. Participation of the team members may be needed for some topics.
  • The development plan for Qserv for the next year is expected to be provided to justify funding of the development efforts. NO big changes so far.
(tick)(unfinished) Progress on topics discussed at the previous meeting Database Meeting 2022-05-18

Generating overlaps when partitioning director and child  tables:

  • (minus)  Igor Gaponenko was supposed to run experiments. This hasn't been done yet.

Improving error reporting in Qserv:

  • (minus) Fritz Mueller was going to look for the use of the column code by mysql-proxy. If it's not used then we should eliminate it. This hasn't been done yet.
(tick)DP02

Igor Gaponenko: on the status of processing and ingesting the remaining tables into Qserv.

This catalog represents 1/60th of the LSST DR6. The amount of data (per table) to be ingested is shown below:

Table                   | Status                | Size on disk (MySQL data directories)
------------------------+-----------------------+--------------------------------------------------
Object                  | ingested              |  3 TB
Source                  | ready to be ingested  |  8 TB (estimated)
ForcedSource            | processing input data | 29 TB (estimated based on 33% of data processed)
ForcedSourceOnDiaObject | processing input data |  3 TB (estimated based on 33% of data processed)

The amount of data in other tables doesn't exceed 1 TB.

The Parquet files of the last two tables are now 50% translated. It took 5 days to process the first 50%.

The Parquet-to-CSV translation is the main limiting factor. It's x100 times slower than the partitioning phase. I have traced the slowness to Panda.to_csv. The memory footprint of the Panda tables is another (huge) problem. It results in swelling the input data in memory by a factor of x10. Narrow tables such as ForcedSourceOnDiaObject is the biggest offender for RAM.

Igor Gaponenko will need to inspect the status of the Ingest system in IDF to see if it needs to be upgraded to include the latest features. And if the upgrade is needed then Fritz Mueller will build and deploy it in IDF (qserv-int tomorrow during usual Thursday patch time).

Fritz Mueller has mentioned the RFC that's related to the observed problem: RFC-844 - Getting issue details... STATUS

Andy Salnikov has reported on his experience of using Panda and PyArrow  for reading the Parquet files and generating CSV. Reading by row groups would greatly improve the performance and reduce resource (memory) utilization.

Andy Hanushevsky: a new version of the translation is available:

  • The error reporting has been improved
  • The schema info is now printed when running the translation with the option --versbose

Andy Salnikov: there is the Python tool for exploring metadata and structure of the Parquet files:  https://pypi.org/project/parquet-tools/

Fritz Mueller: there is a related discussion on: https://stackoverflow.com/questions/36140264/inspect-parquet-from-command-line 

Action items for Igor Gaponenko:

  • Make the Confluence page with  instructions on ingesting DP02 tables
  • Get help from my colleagues on the ingest while I'll be on vacation

Fritz Mueller: on the status of the IVOA tables

  • no final TAP/Felis schema for the IVOA tables
  • only the test slice of the preliminary version of the table has been ingested so far

Fritz Mueller: disk storage in IDF

  • The catalog needs a lot more than we have in IDF.
  • It was a straightforward operation to expand worker storage of qserv-int as needed to accommodate the full version of the DP02  catalog. Other clusters will be expanded later as needed.

(tick)

leftover from the previous meeting

Qserv in IDF fails to lock tables in the memory

Context:

Action items (no assignee yet):

  • do the research to see what others are doing and how this feature is managed in Kubernetes
  • inspect Google documentation on this subject
  • talk to the Google Cloud support teems 


Action items

  •