Database Meeting 2022-05-25

Date

25 May 2022

Attendees

Igor Gaponenko Fritz Mueller Andy Salnikov Andy Hanushevsky Joanne Bogart John Gates Unknown User (npease)

Discussion items

Discussed	Item	Who	Notes
	Project news	Fritz Mueller	NSF visit to Tuscon went well A virtual visit to Summit, June 1st Virtual Face-to-Face DM meeting (June 6-on): technical sessions are now open to the public. Participation of the team members may be needed for some topics. The development plan for Qserv for the next year is expected to be provided to justify funding of the development efforts. NO big changes so far.
	(unfinished) Progress on topics discussed at the previous meeting Database Meeting 2022-05-18		Generating overlaps when partitioning director and child tables: Igor Gaponenko was supposed to run experiments. This hasn't been done yet. Improving error reporting in Qserv: Fritz Mueller was going to look for the use of the column `code` by `mysql-proxy`. If it's not used then we should eliminate it. This hasn't been done yet.
	DP02		Igor Gaponenko: on the status of processing and ingesting the remaining tables into Qserv. This catalog represents 1/60th of the LSST DR6. The amount of data (per table) to be ingested is shown below: Table \| Status \| Size on disk (MySQL data directories) ------------------------+-----------------------+-------------------------------------------------- Object \| ingested \| 3 TB Source \| ready to be ingested \| 8 TB (estimated) ForcedSource \| processing input data \| 29 TB (estimated based on 33% of data processed) ForcedSourceOnDiaObject \| processing input data \| 3 TB (estimated based on 33% of data processed) The amount of data in other tables doesn't exceed 1 TB. The `Parquet` files of the last two tables are now 50% translated. It took 5 days to process the first 50%. The `Parquet-to-CSV` translation is the main limiting factor. It's x100 times slower than the partitioning phase. I have traced the slowness to `Panda.to_csv`. The memory footprint of the Panda tables is another (huge) problem. It results in swelling the input data in memory by a factor of `x10`. Narrow tables such as ForcedSourceOnDiaObject is the biggest offender for RAM. Igor Gaponenko will need to inspect the status of the Ingest system in IDF to see if it needs to be upgraded to include the latest features. And if the upgrade is needed then Fritz Mueller will build and deploy it in IDF (`qserv-int` tomorrow during usual Thursday patch time). Fritz Mueller has mentioned the RFC that's related to the observed problem: RFC-844 - Getting issue details... STATUS Andy Salnikov has reported on his experience of using `Panda` and `PyArrow` for reading the `Parquet` files and generating `CSV`. Reading by row groups would greatly improve the performance and reduce resource (memory) utilization. Andy Hanushevsky: a new version of the translation is available: The error reporting has been improved The schema info is now printed when running the translation with the option `--versbose` Andy Salnikov: there is the Python tool for exploring metadata and structure of the `Parquet` files: https://pypi.org/project/parquet-tools/ Fritz Mueller: there is a related discussion on: https://stackoverflow.com/questions/36140264/inspect-parquet-from-command-line Action items for Igor Gaponenko: Make the Confluence page with instructions on ingesting DP02 tables Get help from my colleagues on the ingest while I'll be on vacation Fritz Mueller: on the status of the IVOA tables no final TAP/Felis schema for the IVOA tables only the test slice of the preliminary version of the table has been ingested so far Fritz Mueller: disk storage in IDF The catalog needs a lot more than we have in IDF. It was a straightforward operation to expand worker storage of `qserv-int` as needed to accommodate the full version of the `DP02` catalog. Other clusters will be expanded later as needed.
leftover from the previous meeting	Qserv in IDF fails to lock tables in the memory		Context: there was a discussion on the team's Slack channel: https://lsstc.slack.com/archives/G2JPZ3GC8/p1652497266243519 the problem may affect the aggregate performance of the shared scan queries Action items (no assignee yet): do the research to see what others are doing and how this feature is managed in Kubernetes inspect Google documentation on this subject talk to the Google Cloud support teems

Action items

Space shortcuts

Page tree

Date

Attendees

Discussion items

Action items