Date

Attendees

Notes from the previous meeting

Discussion items

DiscussedItemWhoNotes
(tick)Project news
(tick)DP02 team

Igor Gaponenko: a subset (one tract) of the table ForcedSourceOnDiaObject has been loaded into qserv-int. There is one column skymap that's present in the Parquet  files, but not in the TAP schema. 

Colin Slater: what's the final status of the Parquet files for ForcedSourceOnDiaObject? Are we ready to ingest the rest based on the preliminary tests of one tract ingested into qserv-int? What should we do about the extra column?

Action items for Igor Gaponenko:

  • proceed with translating and ingesting the full version of the table
  • eliminate the column skymap during the table partitioning phase.

A general discussion on how to configure (annotate) Ingest to eliminate unwanted columns and manage type mapping

Fritz Mueller on RefMatch columns:

  • not much progress in debugging

Action items:

  • Igor Gaponenko: ingest truth tables into the small  Qserv cluster at NCSA
  • Fritz Mueller: instrument RelationGraph  with more debugging and work with Igor Gaponenko to deploy the test version of Qserv  at NCSA and used it for testing.
(tick)Qserv in IDF

Context (reported by Fritz Mueller):

  • stress testing Qserv (Fritz Mueller was asked to run this test by Frossie)
  • ended up with 20 parallel queries run in Qserv or no more
  • a few mysteries in the TAP query queuing and pooling have been discovered with these tests
  • the load testing is creating some turbulence affecting the users
  • good news: Qserv has been good in digging out itself from the "avalanche" (in its current version)
  • bad news: scaling in digging out doesn't look good. Qserv slows down faster than expected (it should ?) Perhaps the slowness is caused by some config parameter (number of threads, etc.). John Gates has been tasked to look at that.
  • another observation: some queries were left in a strange state (still EXECUTING) after they got explicitly canceled by Fritz Mueller. Fritz Mueller's theory is that cancelation is a heavy procedure. Asked John Gates to investigate.
  • The dashboard seems to have performance issues when displaying long lists of past queries.
  • Sometimes only the headers are shown on the Dashboard and no content
    • Igor Gaponenko reported his observations on the cross-site complaints made by Chrome browser
    • This needs to be investigated
  • more pressure to set up better monitoring of Qserv

Action items for Fritz Mueller: discuss the next steps with Frossie

(tick)Status ob the ObsCore  table extractor

Has prepared the draft version of the technote on extracting the data:

(tick)Temporarily solutions to replacing a loss of the development/testing platform at NCSA (August 15th)Fritz Mueller 

The development cluster for code development and interactive itest:

  • The short-term solution: Fritz Mueller will commission a big VM in the Google Cloud 

The mid-to-large-scale tests:

  • setup a separate Qserv cluster at IDF (or use qserv-dev)

 (minus)

(postponed till the next week because Fritz Mueller is presently oversubscribed)

Merging three Git packages qserv , qserv-operator and qserv-ingest  into qserv

Context:

  • it was discussed last week
  • Fritz Mueller was going to work on integrating/cloning qserv-operator into qserv this week.
  • any progress?
(tick)Status of qserv-operator  and qserv-ingest 

For the new version of qserv-ingest , issues have been observed when ingesting DP02 into qserv-dev (namespace dp02):

  • There are the InnoDB locking problems
    • Tried to solve this by explicitly locking the table
    • This problem is not seen in IN2P3
  • MySQL client to the ingest-db  timed out
    • his problem is not seen in IN2P3 either
    • the client was keeping the connection for 7 hours and then disconnected
    • a fix to add automatic retries has resulted in another class of problems reported by the Replication Controller during transaction commit time. The Controller was reporting the error "Connection reset by peer" on queries sent to the Replication database
    • Igor Gaponenko has a theory that MySQL server may be misconfigured (the connection limit is too low, or timeouts are too low). This needs to be investigated.

Igor Gaponenko has evaluated the connection limit parameters of the Replication database and found that the parameter's value is too low  connect_timeout=10 as compared with what's set for other MySQL databases (czar and workers) of the cluster connect_timeout=28800.

  • action item: locate at what code this setting is done (Qserv container's "entry-point"?), fix it, and let Fritz Mueller include it into the new Qserv release to be deployed in IDF tomorrow during the Thirday maintenance window.

Fritz Mueller proposed to look at an existing solution for using an on-the-shelf queue manager to support the ingest workflows. The scaling issues are only going to get worse over time.


Action items