Skip to end of metadata
Go to start of metadata

Date

Attendees

Goals

  • Discuss Fritz Mueller's proposal for configuring Qserv containers.  

Discussion items

TimeItemWhoNotes

A proposal for configuring Qserv containers

Ongoing tickets that are relevant in this context:


Configuring and adding workers to Qserv cluster

 Fabrice Jammes raised the topic of updating a configuration of the Replication/Ingest system at run time. This is needed for two reasons:

  • for registering workers at the startup time of Qserv
  • for scaling up an existing cluster

Igor Gaponenko reported that there is an ongoing effort to improve the situation here. The first step is to migrate worker services (specifically cmsd and the replication system's worker) to self-configure themselves (learning their identities) from the unique (UUID-generated) dataset identifiers stored in the corresponding Qserv worker databases. For further details and the current status of this development see:

The second step will be to make changes within the Replication system's communication network to allow workers to log into a (yet to be implemented) redirector service. This will reverse dependencies within the system and eliminate a need for the explicit configuration of the workers. A preliminary plan for this development was discussed between Igor Gaponenko , Fritz Mueller ad Andy Salnikov before the Winter break. This project is still at an early stage. The actual work on it will start after Fabrice Jammes will finish migrating Qserv to the lite containers and their entry points.


Schema initialization and migrationThe topic was just briefly mentioned in the context of the Qserv configuration discussion as there is an overlap between both. It was decided to postpone the discussion till the next meeting.

Status report on testing DM-31537 - Getting issue details... STATUS

Lockups are seen in the latest version of the branch when testing mixed query loads in the large  Qserv cluster at NCSA. Two types of queries are launched simultaneously in this round of tests:

  • one or two unconditional queries like SELECT * FROM database.table LIMIT 1 
  • 100 or 200 of the near neighbor queries, each covering from 1 to 7 chunks

 The lockup is happening shortly (a few minutes) after launching the queries. The problem is reproducible.

Details were posted in the last comment to the ticket: DM-31537 - Getting issue details... STATUS

The direct link to the most relevant comment: https://jira.lsstcorp.org/browse/DM-31537?focusedCommentId=443619&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-443619

Action items

  • John Gates will work with Igor Gaponenko (if needed) to investigate the lockups.
  • Fritz Mueller will lead a discussion for initializing and upgrading Qserv schemas at the next meeting. This will be preceded by a discussion among interested members of the group at the team's Slack channel.
  • Igor Gaponenko will be looking at migrating the configuration system of the Replication/Ingest system from the database tables to a more conventional technique.
  • Fabrice Jammes will work on finalizing migrating the operator-based Qserv deployment tools to the lite containers and the new configuration model.
  • Unknown User (npease) will finish improving the parameter handling in the entry points as per Fritz Mueller's proposal.