1. Introduction
This document explains basic principles, tools, and procedures for managing existing Qserv deployments at SLAC. These are based on the binary Docker containers w/o relying on any sophisticated container orchestration technologies like Kubernetes or Docker Compose. These are the main features of this design:
- Each Qserv instance has a dedicated master node and a set of corresponding worker nodes.
- Each service runs in a dedicated container.
- Configuration files of each service are placed on a host filesystem of a node where the service is run.
- Log and
core
files are put onto the host filesystems of the corresponding nodes. - Containers are managed by the shell scripts launched from a special folder at the master node of a setup.
1.1. Authorization requirements
In order to perform most management operations mentioned in the document the following privileges are required on the corresponding nodes:
- the
docker
privileges (be a member of thedocker
UNIX group) - group membership in the POSIX group
rubin_users
Note that all data of the Qserv instances are owned by the special "service account" rubinqsv
. Docker containers are run under that user. And proper ACLs are set on all data owned by the user for members of the POSIX group rubin_uisers
.
1.2. Management scripts and configurations in Git
All tools for managing Qserv instances mentioned in this document are found in the following Git package:
Exceptions are the special files that contain security-related information. Local copies of these files are located in the following subfolder of each instance:
instance/<instance-name>/config/secrets
More details on this subject are presented in the rest of the document.
2. Deployments
At the time of the last update of the document, there were three instances of Qserv managed by the tools.
2.1.1. slac6prod
setup | MySQL ports | Replication controller port (HTTP) | XRootD ports | comments | |
---|---|---|---|---|---|
Master host sdfqserv001
| Worker hosts sdfqserv00[2..15] | 4040 (Qserv port) | 25081 | 2131 (redirector) 1094 (redirector) 2132 (worker) 1095 (worker) | The production cluster is used for testing LSST's LSP at SLAC. The cluster has 14 worker nodes. |
Management scripts and configurations /zfspool/management/qserv-slac/instance/slac6prod
Base path for data, logs, etc. [ QSERV_BASE_DIR ] /zfspool/qserv-prod |
2.1.2. slac6dev
setup | MySQL ports | Replication controller port (HTTP) | XRootD ports | comments | |
---|---|---|---|---|---|
Master host sdfqserv001
| Worker hosts sdfqserv00[2..15] | 4047 | 25781 | 2731 (redirector) 1794 (redirector) 2732 (worker) 1795 (worker) | The development cluster is used for testing Qserv. The cluster has 14 worker nodes. |
Management scripts and configurations /zfspool/management/qserv-slac/instance/slac6dev
Base path for data, logs, etc. [ QSERV_BASE_DIR ] /zfspool/qserv-dev |
2.1.3. slac6test
setup | MySQL ports | Replication controller port (HTTP) | XRootD ports | comments | |
---|---|---|---|---|---|
Master host sdfqserv001
| Worker hosts sdfqserv00[2..15] | 4048 | 25881 | 2831 (redirector) 1894 (redirector) 2832 (worker) 1895 (worker) | The development cluster is used for testing Qserv. The cluster has 14 worker nodes. |
Management scripts and configurations /zfspool/management/qserv-slac/instance/slac6test
Base path for data, logs, etc. [ QSERV_BASE_DIR ] /zfspool/qserv-test |
TBC...
3. Configuring deployments
Configurations of the Qserv deployment are represented by two groups of parameters:
- general parameters of a particular deployment
- parameters of the Qserv services within the deployment
The first group is rather small. It captures things like container tags and locations of the base directories (for data, logs, etc.) of an instance. Values of the parameters are set in the special files of the config/ subfolder:
% cd /zfspool/management/qserv-slac/instance/slac6prod % ls -1 config/ master workers env
Where:
file | role | example |
---|---|---|
master | The hostname is where Qserv czar and the Replication system's master services are run. | sdfqserv001
|
workers | The hostnames of the Qserv workers and the Replication system's workers. | sdfqserv002 |
env | This is the shell script that sets deployment-specific various environment variables. |
The file config/env defines the following environment variables (using a config file of slac6prod as an example):
variable | meaning | example |
---|---|---|
CONTAINER_UID | UID of the service account that owns data and Docker containers | 45386 |
CONTAINER_GID | GID of the service account that owns data and Docker containers | 1126 |
QSERV_DB_IMAGE_TAG | The tag of the MariaDB container of both czar and the worker services. Please, do not update this tag unless migrating the instance to the new version of MariaDB. Database tags define the underlying versions of the MariaDB services. The database upgrade for all tables may be needed after such a change. This is a very sensitive operation since it affects existing data. Mistakes may result in damaging or losing the valuable information deployed in the clusters. | qserv/lite-mariadb:2022.7.1-rc1-21-g292b0f1ea |
QSERV_IMAGE_TAG | The tag of the binary container implementing Qserv services (except MariaDB). | qserv/lite-qserv:2022.7.1-rc1-39-g572903551 |
QSERV_BASE_DIR | The base path to a folder where Qserv's data, configurations, log files, and core files are found. Each host (the master and all workers) of a given Qserv instance should have this folder. The filesystem tree is owned by the local user rubinqsv. | /zfspool/qserv-prod |
QSERV_CZAR_DB_PORT | The port number of the Qsev czar 's MariaDB service. | 3306 |
QSERV_WORKER_DB_PORT | The port number of the Qsev worker 's MariaDB services. | 3366 |
REPL_IMAGE_TAG | The tag of the binary container implementing Qserv services (except MariaDB). Note that the Replication/Ingest system's tools are part of the Qserv binary container. This tag allows running a separate version of the Replication/Ingest system if needed. | qserv/lite-qserv:2022.7.1-rc1-39-g572903551 |
REPL_INSTANCE_ID | A unique instance name of the Qserv. It's used by the Replication/Ingest system to prevent "crosstalks" between different instances in case of accidental misconfigurations. It also adds an additional security to the system. | 23306 |
REPL_DB_PORT | The port number of the system's MariaDB service. | |
REPL_CONTR_HTTP_PORT | The port number of the Master Replication Controller. The Controller provides REST services for managing and monitoring Qserv and data catalogs. | 25081 |
REPL_REG_PORT | The port number of the Workers Registry service. The service is an internal component of the Replication/Ingest system. However, knowing this port may be needed for debugging a setup. | 25082 |
REPL_WORKER_SVC_PORT | The port number at which the Replication system's workers listen for commands from the controllers. | 25000 |
REPL_WORKER_FS_PORT | The port number at which the Replication system's workers serve the replica acquisition requests from other workers. | 25001 |
REPL_WORKER_LOADER_PORT | The port number of the Replication system's workers. The port is used for ingesting table data into Qserv over the proprietary binary protocol. | 25002 |
REPL_WORKER_EXPORTER_PORT | The port number at which the Replication system's workers. The port is used for exporting table data from Qserv over the proprietary binary protocol. | 25003 |
REPL_WORKER_HTTP_LOADER_PORT | The port number of the Replication system's workers. The port is used for ingesting table data into Qserv from third-party sources, such as the locally mounted POSIX file system or object store. There is the REST service behind this port. | 25004 |
REPL_REG_PARAMETERS | Parameters of the Workers Registry service. | --debug |
REPL_CONTR_PARAMETERS | Parameters of the Master Replication Controller. | --worker-evict-timeout=600 |
REPL_WORKER_PARAMETERS | Parameters of the system's worker services. | --do-not-create-folders |
REPL_HTTP_ROOT | The optional parameter overrides the default root folder of the HTTP server that is built into the Master Replication Controller. The Qserv Web Dashboard application is served from this folder. The folder is used for debugging purposes of the application, and it should be normally commented out. | /zfspool/management/qserv/www |
REPL_MEMORY_LIMIT | Limits the size (in bytes) of the replication containers. Larger limits may be required during ingest campaigns when millions of contributions (files) are to be loaded into Qserv. | 42949672960 |
CONTAINER_NAME_PREFIX | A prefix that is used for naming containers. For example, if the prefix is the same as shown in the last column of this table (on the right) then the container names would be: qserv-prod-czar-mariadb qserv-prod-czar-cmsd qserv-prod-czar-xrootd qserv-prod-czar-proxy qserv-prod-czar-debug qserv-prod-worker-mariadb qserv-prod-worker-cmsd qserv-prod-worker-xrootd ... | qserv-prod- |
CONTAINER_RESTART_POLICY | Docker container restart policy to define the desired behavior when the Docker daemon starts, when the applications within containers crash, or when the containers are explicitly stopped. More information on this subject can be found at: https://docs.docker.com/config/containers/start-containers-automatically/ | unless-stopped |
ULIMIT_MEMLOCK | This parameter affects locking files in memory at workers during shared scans. | 10995116277760 |
4. Getting started
This section introduces basic management operations using the slac6prod cluster as a target. All examples to be shown hereafter assume that a reader is logged onto the master node of the cluster and did cd into the root folder of the instance:
ssh sdfqserv001 cd /zfspool/management/qserv-slac/instance/slac6prod
All examples presented in this chapter are based on an assumption that all known services should be affected by the operations. It's possible to manage select services as well w/o affecting the others. More details on this technique are found in Script options for managing select services.
4.1. Inspecting versions of the containers
Versions of the configured containers are defined in the special variables of the environment definition file:
cat config/env .. QSERV_DB_IMAGE_TAG="qserv/lite-mariadb:2021.10.1-lite-rc2" QSERV_IMAGE_TAG="qserv/lite-qserv:2021.10.1-lite-rc2" .. REPLICA_DB_IMAGE_TAG="qserv/lite-mariadb:2021.10.1-lite-rc2" REPLICA_IMAGE_TAG="qserv/lite-qserv:2021.10.1-rc1-61-gdefd950b0"
4.2. Inspecting the current status of the instance
To see which containers (services) are running (please, note the option --all
or its short form -a):
./status --all [sdfqserv001] qserv czar mariadb: aa81b3001d2b qserv/lite-mariadb:2022.7.1-rc1-21-g292b0f1ea "docker-entrypoint.s…" 9 hours ago Up 9 hours qserv-prod-czar-mariadb [sdfqserv001] qserv czar cmsd: afcf7d4001b4 qserv/lite-qserv:2023.2.1-rc1-8-gc28fa47d8 "bash -c 'cmsd -c /c…" 9 hours ago Up 9 hours qserv-prod-czar-cmsd [sdfqserv001] qserv czar xrootd: f0854dba627d qserv/lite-qserv:2023.2.1-rc1-8-gc28fa47d8 "bash -c 'xrootd -c …" 9 hours ago Up 9 hours qserv-prod-czar-xrootd [sdfqserv001] qserv czar proxy: 41cd70bd5329 qserv/lite-qserv:2023.2.1-rc1-8-gc28fa47d8 "bash -c 'mysql-prox…" 9 hours ago Up 9 hours qserv-prod-czar-proxy [sdfqserv002] qserv worker mariadb: c7299b920838 qserv/lite-mariadb:2022.7.1-rc1-21-g292b0f1ea "docker-entrypoint.s…" 9 hours ago Up 9 hours qserv-prod-worker-mariadb [sdfqserv002] qserv worker cmsd: a4863feb9494 qserv/lite-qserv:2023.2.1-rc1-8-gc28fa47d8 "bash -c 'cmsd -c /c…" 9 hours ago Up 9 hours qserv-prod-worker-cmsd [sdfqserv002] qserv worker xrootd: a6b14bf3df8e qserv/lite-qserv:2023.2.1-rc1-8-gc28fa47d8 "bash -c 'xrootd -c …" 9 hours ago Up 9 hours qserv-prod-worker-xrootd [sdfqserv002] qserv worker mariadb: 2fc9339cb010 qserv/lite-mariadb:2022.7.1-rc1-21-g292b0f1ea "docker-entrypoint.s…" 9 hours ago Up 9 hours qserv-prod-worker-mariadb [sdfqserv003] qserv worker cmsd: 69bd76bcf36f qserv/lite-qserv:2023.2.1-rc1-8-gc28fa47d8 "bash -c 'cmsd -c /c…" 9 hours ago Up 9 hours qserv-prod-worker-cmsd [sdfqserv003] qserv worker xrootd: 163df642a8d8 qserv/lite-qserv:2023.2.1-rc1-8-gc28fa47d8 "bash -c 'xrootd -c …" 9 hours ago Up 9 hours qserv-prod-worker-xrootd ... [sdfqserv001] repl mariadb: 51d9b5ad82e8 qserv/lite-mariadb:2022.1.1-rc1 "docker-entrypoint.s…" 9 hours ago Up 9 hours qserv-prod-repl-mariadb [sdfqserv001] repl reg: d841ef33499b qserv/lite-qserv:2022.12.1-rc2-47-g982eac7c4 "bash -c 'qserv-repl…" 9 hours ago Up 9 hours qserv-prod-repl-reg [sdfqserv001] repl contr: 0dd0bd5b983e qserv/lite-qserv:2022.12.1-rc2-47-g982eac7c4 "bash -c 'qserv-repl…" 9 hours ago Up 9 hours qserv-prod-repl-contr [sdfqserv001] repl worker: ffb3118ab0b7 qserv/lite-qserv:2022.12.1-rc2-47-g982eac7c4 "bash -c 'qserv-repl…" 9 hours ago Up 9 hours qserv-prod-repl-worker [sdfqserv002] repl worker: c0b2f84d686b qserv/lite-qserv:2022.12.1-rc2-47-g982eac7c4 "bash -c 'qserv-repl…" 9 hours ago Up 9 hours qserv-prod-repl-worker ..
According to this report, all services are running.
4.3. Stopping the services
All services (including the ones of the Replication/Ingest system) are stopped by the following script:
./stop --all
4.4. Pulling containers to the master and worker nodes
This step may be needed if the container image of some existing tag got updated. Sometimes, this operation may be also needed to ensure the desired containers exist on the relevant nodes before starting Qserv and/or the Replication system.
./pull --all
4.5. Starting the services
Services won't start if they were not explicitly stopped as explained in the previous section. This also applies to scenarios when containers got crashed or died for other reasons (due to reboots of the nodes). This was done to allow inspecting container logs in case of any abnormalities with the services.
Services are started by these scripts:
./start --all
Alternatively, one may invoke this script:
./restart --all
5. Script options for managing select services
5.1. Service selectors
All scripts mentioned in this document allow selectors for the services. Any combination of the selectors shown in the table below can be used:
selector | service |
---|---|
--czar-db | MariaDB service of Qserv czar. |
--czar-cmsd | Redirector service of Qserv czar. |
--czar-xrootd | Redirector service of Qserv czar. |
--czar-proxy | The mysql-proxy service of Qserv czar (the czar itself). |
--czar-debug | The side-cart container with GDB and other tools for debugging Qserv services that run on the master node of a setup. |
--worker-db | MariaDB service of the select Qserv worker(s). Each worker host runs a dedicated instance of the service. |
--worker-cmsd | Redirector service of the select worker(s). |
--worker-xrootd | Qserv worker service itself. |
--worker-debug | The side-cart container with GDB and other tools for debugging Qserv services that run on the select worker node(s) of a setup. |
--repl-db | MariaDB service of the Replication/Ingest system. |
--repl-contr | The Master Replication Controller of the Replication/Ingest system. |
--repl-reg | The Workers Registry of the Replication/Ingest system. |
--repl-contr-debug | The side-cart container with GDB and other tools for debugging the Replication system's services that run on the master node of a setup. |
--repl-worker | The worker service of the Replication/Ingest system. Each worker host runs a dedicated instance of the service accompanying the corresponding Qserv worker. Note that like in the case of the Qserv worker services, workers affected by the scripts are specified via the option --worker=<worker-list> . |
--repl-worker-debug | The side-cart container with GDB and other tools for debugging the Replication system's services that run on the select worker node(s) of a setup. |
Alternatively (or in addition if needed), one may also specify the group selectors:
selector | service |
---|---|
--all | Affects all services of czar and the select worker(s). Note that this option won't include the special debug containers. The latter need to be managed using options: --all-debug --czar-debug --worker-debug --replica-contr-debug --replica-worker-debug |
--all-debug | Affects side-cart containers for debugging services. These include both Qserv and the Replication/Ingest system's containers at the master and select worker nodes. This option complements the above-explained option --all. |
--czar-all | All services of Qserv czar. Note that this option won't include the special debug container. For the latter use options: --all-debug --czar-debug |
--worker-all | All services of the select Qserv worker(s). Note that this option won't include the special debug container. For the latter use options: --all-debug --worker-debug |
--repl-all | All services of the Replication/Ingest system at the master node and the select worker nodes. Note that this option won't include the special debug container. For the latter use options: --all-debug --repl-contr-debug --repl-worker-debug |
Targeted service selectors and group selectors can always be mixed in any combination. The result would be the logical sum of all options found in the command line. At least one option must be provided. Otherwise, scripts will behave as if the following option was specified:
./<script-name> --help
5.2. Selector of the worker hosts
Finally, it's possible to limit the scope of an operation to a subset of workers using the following option:
selector | description | examples |
---|---|---|
--worker=<worker-list> | Select a subset of workers affected by the operation. A value of the parameter carries a list of the space-separated worker hostnames. If '*' is specified in place of the worker name then the select services of all workers will be assumed. Note that single quotes are required here in order to prevent the shell from expanding the wildcard symbol into a list of local files in the current working directory. Notes:
| '*' |
5.3. Examples
In the following example, the Qserv's proxy at the master host and the worker xrootd services at all worker hosts will get restarted:
./restart --czar-proxy --worker-xrootd
In this example all services (both Qserv and the Replication/Ingest system) only at workers sdfqserv001
and sdfqserv002
will get stopped:
./stop --worker-all --repl-worker --worker="sdfqserv001 sdfk8sc012"
The next example illustrates using the complementary selector --all and --all-debug to include all containers in operations:
./status --all --all-debug [sdfqserv001] qserv czar mariadb: [sdfqserv001] qserv czar cmsd: [sdfqserv001] qserv czar xrootd: [sdfqserv001] qserv czar proxy: [sdfqserv001] qserv czar debug: [sdfqserv002] qserv worker mariadb: [sdfqserv002] qserv worker cmsd: [sdfqserv002] qserv worker xrootd: [sdfqserv002] qserv worker debug: ... [sdfqserv006] repl mariadb: [sdfqserv006] repl reg: [sdfqserv006] repl contr: [sdfqserv006] repl debug: [sdfqserv006] repl worker: [sdfqserv006] repl worker debug:
The very same combination of selectors can be used with other operations.
6. Working with the log files
The log files of the services that run inside the containers are put onto the host filesystems of the corresponding nodes. Locations of the files are specified at a location relative to the environment variable QSERV_BASE_DIR. The corresponding folder should be write-enabled to the local user rubinqsv. For example, the log files of all master services (of Qserv czar
and the Replication/Ingest system) of the Qserv instance slac6prod
will be placed at:
/zfspool/qserv-prod/master/log
And the log files of the worker services will be placed at:
/zfspool/qserv-prod/worker/log
6.1. Locating and inspecting
Based on the value of this variable, one can see the locations of all logs by:
cd /zfspool/management/qserv-slac/instance/slac6prod ./log_ls --all [sdfqserv001] -rw-rwxr--+ 1 rubinqsv gu 2615309 Mar 30 19:06 /zfspool/qserv-prod/master/log/qserv-prod-czar-cmsd.log -rw-rwx---+ 1 rubinqsv gu 1853 Mar 30 09:43 /zfspool/qserv-prod/master/log/qserv-prod-czar-mariadb.error.log -rw-rwx---+ 1 rubinqsv gu 195 Mar 30 09:40 /zfspool/qserv-prod/master/log/qserv-prod-czar-mariadb.slow-query.log -rw-rwxr--+ 1 rubinqsv gu 2918488 Mar 30 19:08 /zfspool/qserv-prod/master/log/qserv-prod-czar-proxy.log -rw-rwx---+ 1 rubinqsv gu 270 Mar 30 09:40 /zfspool/qserv-prod/master/log/qserv-prod-czar-proxy.mysql-proxy.log -rw-rwxr--+ 1 rubinqsv gu 3449471 Mar 30 19:06 /zfspool/qserv-prod/master/log/qserv-prod-czar-xrootd.log -rw-rwxr--+ 1 rubinqsv gu 258636 Mar 30 19:06 /zfspool/qserv-prod/master/log/qserv-prod-repl-contr.log -rw-rwx---+ 1 rubinqsv gu 1716 Mar 30 09:40 /zfspool/qserv-prod/master/log/qserv-prod-repl-mariadb.error.log -rw-rwx---+ 1 rubinqsv gu 190 Mar 30 09:40 /zfspool/qserv-prod/master/log/qserv-prod-repl-mariadb.slow-query.log -rw-rwxr--+ 1 rubinqsv gu 41533739 Mar 30 19:07 /zfspool/qserv-prod/master/log/qserv-prod-repl-reg.log [sdfqserv002] -rw-rwxr--+ 1 rubinqsv gu 276490433 Mar 30 19:08 /zfspool/qserv-prod/worker/log/qserv-prod-repl-worker.log -rw-rwxr--+ 1 rubinqsv gu 2392065 Mar 30 19:07 /zfspool/qserv-prod/worker/log/qserv-prod-worker-cmsd.log -rw-rwx---+ 1 rubinqsv gu 1698 Mar 30 09:40 /zfspool/qserv-prod/worker/log/qserv-prod-worker-mariadb.error.log -rw-rwx---+ 1 rubinqsv gu 191 Mar 30 09:40 /zfspool/qserv-prod/worker/log/qserv-prod-worker-mariadb.slow-query.log -rw-rwxr--+ 1 rubinqsv gu 15874204 Mar 30 19:07 /zfspool/qserv-prod/worker/log/qserv-prod-worker-xrootd.log [sdfqserv003] -rw-rwxr--+ 1 rubinqsv gu 277487701 Mar 30 19:08 /zfspool/qserv-prod/worker/log/qserv-prod-repl-worker.log -rw-rwxr--+ 1 rubinqsv gu 2394846 Mar 30 19:07 /zfspool/qserv-prod/worker/log/qserv-prod-worker-cmsd.log -rw-rwx---+ 1 rubinqsv gu 1698 Mar 30 09:40 /zfspool/qserv-prod/worker/log/qserv-prod-worker-mariadb.error.log -rw-rwx---+ 1 rubinqsv gu 191 Mar 30 09:40 /zfspool/qserv-prod/worker/log/qserv-prod-worker-mariadb.slow-query.log -rw-rwxr--+ 1 rubinqsv gu 17516856 Mar 30 19:07 /zfspool/qserv-prod/worker/log/qserv-prod-worker-xrootd.log ...
Note that the log files are named after the corresponding containers where the services are run.
To inspect the log files of the workers one would need to log onto the desired worker node and check the same folder as shown in this example:
ssh -n sdfqserv002 tail /zfspool/qserv-prod/worker/log/qserv-prod-worker-xrootd.log
Log files of the MariaDB services are not seen by all users (UNIX permissions for others). In this case, one would need to use this pattern:
ssh -n sdfqserv002 sudo -u qserv tail /zfspool/qserv-prod/worker/log/qserv-prod-worker-mariadb.error.log
6.2. Removing
Log files won't be automatically cleared (truncated) during restarts of the services. New entries will be appended to existing files. Eventually, if the files may grow to be too big, one would have to truncate them manually or use the following automation script:
./log_rm --all
Normally, it's done when Qserv is down:
./stop --all ./log_rm --all ./start --all
It's also possible to remove the log files of the select services. For example, if the worker's xrootd service needs to be restarted to get the clean log one would have to do this:
./stop --worker-xrootd ./log_rm --worker-xrootd ./start --worker-xrootd
This operation won't affect other services.
6.3. Parallel search
The script log_search may be handy in case if multiple log files need to be searched and (optionally) merged by the timestamps. The tool adds 3 more parameters to the service and worker selectors explained earlier in the section Script options for managing select services:
Usage: [--lines=<num>] [--regexp=<pattern>] [<service specificaton>] Where: --lines=<number> The number of lines in the end of each log file to search for the given pattern. Default: '100'. --regexp=<pattern> A regular expression. If the one is provided the content of the log stream will be filtered using 'egrep <pattern>'. Otherwise, no additional processing will be made on the stream. --merge Sort and merge results read from all log files based on the timestamps. Otherwise group results by the log files.
The general idea is to specify services whose logs need to be searched, provide the number of lines from the tail of each file to be included in the search (option --lines), and provide an optional filter to be applied to the result (--regexp) and report all findings via a single output stream. Results could be optionally merged (option --merge) based on the timestamps of the entries.
This is an example of searching 2 last lines from the log files of the xrootd service at czar and all workers w/o merging. Note that the hostname is injected into the second column of each line of the report:
./log_tail --czar-proxy --worker-xrootd --lines=2 2021-12-15T00:06:18.148Z qserv-db01 LWP:3278 QID: INFO xrdssi.msgs - qserv.7:23@141.142.181.128 Ssi_Dispose: 0:/worker/db01 [done odRsp] Recycling request... 2021-12-15T00:06:18.158Z qserv-db01 LWP:3278 QID: INFO xrdssi.msgs - qserv.7:23@141.142.181.128 Ssi_close: /worker/db01 del=False 2021-12-15T00:06:19.107Z qserv-db02 LWP:6498 QID:265345#3 INFO wdb.QueryRunner - QueryRunner rowCount=28620 tSize=2000062 2021-12-15T00:06:19.111Z qserv-db02 LWP:6498 QID:265345#3 DEBUG wdb.QueryRunner - _buildHeaderThis 2021-12-15T00:06:18.171Z qserv-db03 LWP:269 QID: INFO xrdssi.msgs - 2.7:25@141.142.181.128 Ssi_Dispose: 0:/worker/db03 [done odRsp] Recycling request... 2021-12-15T00:06:18.186Z qserv-db03 LWP:269 QID: INFO xrdssi.msgs - 2.7:25@141.142.181.128 Ssi_close: /worker/db03 del=False 2021-12-15T00:06:19.337Z qserv-db04 LWP:6268 QID:265345#1 INFO wdb.QueryRunner - QueryRunner rowCount=28616 tSize=2000040 2021-12-15T00:06:19.343Z qserv-db04 LWP:6268 QID:265345#1 DEBUG wdb.QueryRunner - _buildHeaderThis 2021-12-15T00:06:18.211Z qserv-db05 LWP:17 QID: INFO xrdssi.msgs - qserv.7:24@141.142.181.128 Ssi_Dispose: 0:/worker/db05 [done odRsp] Recycling request... 2021-12-15T00:06:18.221Z qserv-db05 LWP:17 QID: INFO xrdssi.msgs - qserv.7:24@141.142.181.128 Ssi_close: /worker/db05 del=False 2021-12-15T00:06:18.222Z qserv-db06 LWP:131 QID: INFO xrdssi.msgs - 1.7:25@141.142.181.128 Ssi_Dispose: 0:/worker/db06 [done odRsp] Recycling request... 2021-12-15T00:06:18.228Z qserv-db06 LWP:131 QID: INFO xrdssi.msgs - 1.7:25@141.142.181.128 Ssi_close: /worker/db06 del=False 2021-12-15T00:03:39.613Z qserv-master01 LWP:51 QID:265345#3 WARN qdisp.Executive - Executive: error executing [0] (status: 0) 2021-12-15T00:03:39.613Z qserv-master01 LWP:51 QID:265345#3 ERROR qdisp.Executive - Executive: requesting squash, cause: failed (code=0 )
The next result illustrates the use of the option --merge:
./log_tail --czar-proxy --worker-xrootd --lines=2 --merge 2021-12-15T00:03:39.613Z qserv-master01 LWP:51 QID:265345#3 ERROR qdisp.Executive - Executive: requesting squash, cause: failed (code=0 ) 2021-12-15T00:03:39.613Z qserv-master01 LWP:51 QID:265345#3 WARN qdisp.Executive - Executive: error executing [0] (status: 0) 2021-12-15T00:07:30.252Z qserv-db01 LWP:5199 QID: INFO xrdssi.msgs - qserv.7:23@141.142.181.128 Ssi_Dispose: 0:/worker/db01 [done odRsp] Recycling request... 2021-12-15T00:07:30.277Z qserv-db01 LWP:5199 QID: INFO xrdssi.msgs - qserv.7:23@141.142.181.128 Ssi_close: /worker/db01 del=False 2021-12-15T00:07:30.292Z qserv-db03 LWP:269 QID: INFO xrdssi.msgs - 2.7:25@141.142.181.128 Ssi_Dispose: 0:/worker/db03 [done odRsp] Recycling request... 2021-12-15T00:07:30.301Z qserv-db03 LWP:269 QID: INFO xrdssi.msgs - 2.7:25@141.142.181.128 Ssi_close: /worker/db03 del=False 2021-12-15T00:07:30.320Z qserv-db05 LWP:102 QID: INFO xrdssi.msgs - qserv.7:24@141.142.181.128 Ssi_Dispose: 0:/worker/db05 [done odRsp] Recycling request... 2021-12-15T00:07:30.327Z qserv-db05 LWP:102 QID: INFO xrdssi.msgs - qserv.7:24@141.142.181.128 Ssi_close: /worker/db05 del=False 2021-12-15T00:07:30.328Z qserv-db06 LWP:131 QID: INFO xrdssi.msgs - 1.7:25@141.142.181.128 Ssi_Dispose: 0:/worker/db06 [done odRsp] Recycling request... 2021-12-15T00:07:30.349Z qserv-db06 LWP:131 QID: INFO xrdssi.msgs - 1.7:25@141.142.181.128 Ssi_close: /worker/db06 del=False 2021-12-15T00:07:31.124Z qserv-db02 LWP:6531 QID:265345#3 INFO wdb.QueryRunner - QueryRunner rowCount=28617 tSize=2000029 2021-12-15T00:07:31.128Z qserv-db02 LWP:6531 QID:265345#3 DEBUG wdb.QueryRunner - _buildHeaderThis 2021-12-15T00:07:31.261Z qserv-db04 LWP:6273 QID:265345#1 INFO wdb.QueryRunner - QueryRunner rowCount=28614 tSize=2000031 2021-12-15T00:07:31.266Z qserv-db04 LWP:6273 QID:265345#1 DEBUG wdb.QueryRunner - _buildHeaderThis
Note that the lines are now sorted by the timestamps, which may be handy for tracking dependencies between events.
In the next example, the search is made for events posted in a context of a specific query identifier QID:265345:
./log_tail --czar-proxy --worker-xrootd --lines=200000 --merge --regexp='QID:265345' | tail -n 10 2021-12-15T00:08:22.877Z qserv-db04 LWP:6268 QID:265345#1 DEBUG wdb.QueryRunner - _buildHeaderThis 2021-12-15T00:08:23.443Z qserv-db04 LWP:6268 QID:265345#1 INFO wdb.QueryRunner - QueryRunner rowCount=28614 tSize=2000030 2021-12-15T00:08:23.448Z qserv-db04 LWP:6268 QID:265345#1 DEBUG wdb.QueryRunner - _buildHeaderThis 2021-12-15T00:08:23.829Z qserv-db04 LWP:6268 QID:265345#1 INFO sql.MySqlConnection - &&& MySqlConnection::runQuery SELECT uid FROM q_memoryLockDb.memoryLockTbl WHERE keyId = 1 2021-12-15T00:08:23.836Z qserv-db04 LWP:6268 QID:265345#1 DEBUG sql.MySqlConnection - connectToDb trying to connect hostN=127.0.0.1 sock= uname=qsmaster dbN= port=3306 2021-12-15T00:08:23.836Z qserv-db04 LWP:6268 QID:265345#1 ERROR sql.MySqlConnection - connectToDb failed to connect! hostN=127.0.0.1 sock= uname=qsmaster dbN= port=3306 2021-12-15T00:08:23.836Z qserv-db04 LWP:6268 QID:265345#1 ERROR sql.MySqlConnection - runQuery failed connectToDb: SELECT uid FROM q_memoryLockDb.memoryLockTbl WHERE keyId = 1 2021-12-15T00:08:23.836Z qserv-db04 LWP:6268 QID:265345#1 ERROR wdb.ChunkResource - memLockRequireOwnership could not verify this program owned the memory table lock, Exiting. 2021-12-15T00:08:23.836Z qserv-db04 LWP:6268 QID:265345#1 WARN sql.MySqlConnection - connectToDb ping=1 2021-12-15T00:08:23.836Z qserv-db04 LWP:6268 QID:265345#1 WARN wdb.ChunkResource - memLockStatus query failed, assuming UNLOCKED. SELECT uid FROM q_memoryLockDb.memoryLockTbl WHERE keyId = 1 err=Error -999: Error connecting to mysql with config:[host=127.0.0.1, port=3306, user=qsmaster, password=XXXXXX, db=, socket=]
7. Configuring services
7.1. Configuring the LSST logger of the services
The script config_logger is available for reconfiguring Qserv and the Replication system's services. The script performs the following sequence of actions:
- read existing configuration files
- for each file of a service specified via selectors on the command line the script opens the editor (vim) to allow viewing and/or modifying the configuration of the select service
- and deploys an updated version of the configuration (regardless if it was modified or not)
This example illustrates how to reconfigure both the Master Replication Controller and the Replication system's worker services on all worker hosts of a setup:
./config_logger --repl-contr --repl-worker <editor session for the Master Controller pops up> <editor session for the Replication worker pops up> [sdfqserv001] updating configuration at /zfspool/qserv-prod/master/config/log/repl-contr.cfg [sdfqserv002] updating configuration at /zfspool/qserv-prod/worker/config/log/repl-worker.cfg [sdfqserv003] updating configuration at /zfspool/qserv-prod/worker/config/log/repl-worker.cfg [sdfqserv004] updating configuration at /zfspool/qserv-prod/worker/config/log/repl-worker.cfg [sdfqserv005] updating configuration at /zfspool/qserv-prod/worker/config/log/repl-worker.cfg [sdfqserv006] updating configuration at /zfspool/qserv-prod/worker/config/log/repl-worker.cfg
Note that the script will open two editing sessions - the first one for the Master Replication Controller and the second one - for the Replication system's worker services. All workers will get the same configuration. Be advised that the affected services need to be restarted in order to enact the updated configurations.
This example shows how to reconfigure czar. In this example, two () configuration files will be offered for editing - one of the mysql-proxy
, and the other one for the czar itself.
./config_logger --czar-proxy
Here is how one may reconfigure workers at all worker nodes:
./config_logger --worker-xrootd <editor session pops up> [sdfqserv001] updating configuration at /zfspool/qserv-prod/worker/config/log/xrootd.cfg [sdfqserv002] updating configuration at /zfspool/qserv-prod/worker/config/log/xrootd.cfg [sdfqserv003] updating configuration at /zfspool/qserv-prod/worker/config/log/xrootd.cfg [sdfqserv004] updating configuration at /zfspool/qserv-prod/worker/config/log/xrootd.cfg [sdfqserv005] updating configuration at /zfspool/qserv-prod/worker/config/log/xrootd.cfg [sdfqserv006] updating configuration at /zfspool/qserv-prod/worker/config/log/xrootd.cfg
7.2. Configuring parameters of the services
The script config_service is available for reconfiguring Qserv services either at czar or worker nodes. Actions performed by the script are exactly the same as in the case of the above-explained script config_logger.
This example illustrates how to reconfigure Qserv worker service (xrootd) on all worker hosts of a setup:
./config_service --worker-xrootd <editor session pops up> [sdfqserv001] updating configuration at /zfspool/qserv-prod/worker/config/xrdssi.cfg [sdfqserv002] updating configuration at /zfspool/qserv-prod/worker/config/xrdssi.cfg [sdfqserv003] updating configuration at /zfspool/qserv-prod/worker/config/xrdssi.cfg [sdfqserv004] updating configuration at /zfspool/qserv-prod/worker/config/xrdssi.cfg [sdfqserv005] updating configuration at /zfspool/qserv-prod/worker/config/xrdssi.cfg [sdfqserv006] updating configuration at /zfspool/qserv-prod/worker/config/xrdssi.cfg
Notes:
- this mechanism can't be used to configure the MariaDB services or the Replication system's services
- in order to get the new parameters into effect, one would need to start/restart the services affected by the reconfiguration as shown in the next example
- when modifying the configuration of a worker service the same configuration will be deployed across all workers. Please, avoid using the worker selection option --worker=<worker-list> in this scenario.
Here is the typical sequence of actions for reconfiguring Qserv workers:
./config_service --worker-xrootd ./restart --worker-xrootd
8. Schema upgrades
Schema upgrades are made using the following script and options:
./schema_upgrade --czar-db --worker-db --repl-db
It's possible to select either or all services if needed. Here is this example:
./schema_upgrade --repl-db [sdfqserv001] upgrading repl mariadb 2021-12-08 04:48:49 INFO Current replica schema version: 6 2021-12-08 04:48:49 INFO Latest replica schema version: 6 2021-12-08 04:48:49 INFO Known migrations for replica: 0 -> 1 : migrate-0-to-1.sql, 1 -> 2 : migrate-1-to-2.sql, 2 -> 3 : migrate-2-to-3.sql, 3 -> 4 : migrate-3-to-4.sql, 4 -> 5 : migrate-4-to-5.sql, 5 -> 6 : migrate-5-to-6.sql, Uninitialized -> 6 : migrate-None-to-6.sql 2021-12-08 04:48:49 INFO No migration was needed
It's recommended to upgrade schemas when all (but the MariaDB) services are down. Here are the easiest ways to do this:
./stop --all ./start --czar-db --worker-db --repl-db ./schema_upgrade --czar-db --worker-db --repl-db ./stop --czar-db --worker-db --repl-db ./start --all
8.1. Checking if a schema upgrade is needed
This feature is not implemented as it needs support from the binary container's "entry points".
9. Upgrading MariaDB versions
In MySQL/MariaDB one can only upgrade an existing instance to a higher version. Please, know what version was running and what version is going to be run in this instance and follow specific instructions for the desired upgrade path at https://mariadb.com/kb/en/upgrading/. Note that database version upgrade is a very sensitive operation as it involves modifying internal data structures of MySQL/MariaDB. Some failures during this operation may be hard to recover from. Besides, the duration of the upgrade will depend on the amount of data existing in the database data directories. Potentially, it may take many hours before the operation would complete. The general advise is - do not upgrade the database service unless it's strictly required!
The first step, before attempting to upgrade the database would be to shut down all containers by:
./stop --all --all-debug
After that one should start the desired databases service that will be upgraded (or all services, depending on intent):
./start --czar-db --repl-db --worker-db
The next step is to proceed to the upgrade itself. Note that this operation requires knowing the password of the root account of the corresponding service. The accounts are found in the following local files of the managed instance:
instance/<instance-name>/config/secrets
For example:
% ssh sdfqserv001 % cd /zfspool/management/qserv-slac/instance/slac6prod % ls -al config/secrets/ total 52 drwxr-xr-x+ 2 gapon ec 8 Sep 16 13:29 . drwxrwxr-x+ 3 gapon ec 6 Sep 16 16:50 .. -rw-------+ 1 gapon ec 9 Sep 16 13:29 qserv_czar_db_root_password -rw-------+ 1 gapon ec 9 Sep 16 13:29 qserv_worker_db_root_password -rw-------+ 1 gapon ec 21 Sep 16 13:29 repl_admin_auth_key -rw-------+ 1 gapon ec 15 Sep 16 13:29 repl_auth_key -rw-------+ 1 gapon ec 9 Sep 16 13:29 repl_db_qsreplica_password -rw-------+ 1 gapon ec 9 Sep 16 13:29 repl_db_root_password
Assuming, one is planning to upgrade the MariaDB service of the Replication/Ingest system and the password is XYZ, the upgrade should be done like this:
% ssh sdfqserv001 % docker exec -it qserv-prod-repl-mariadb mysql_upgrade -uroot -pXYZ
The command will begin printing the following messages (or alike):
Phase 1/7: Checking and upgrading mysql database Processing databases mysql mysql.column_stats OK mysql.columns_priv OK mysql.db OK mysql.event OK mysql.func OK mysql.global_priv OK mysql.gtid_slave_pos OK mysql.help_category OK mysql.help_keyword OK mysql.help_relation OK mysql.help_topic OK mysql.index_stats OK mysql.innodb_index_stats OK mysql.innodb_table_stats OK mysql.plugin OK mysql.proc OK mysql.procs_priv OK mysql.proxies_priv OK mysql.roles_mapping OK mysql.servers OK mysql.table_stats OK mysql.tables_priv OK mysql.time_zone OK mysql.time_zone_leap_second OK mysql.time_zone_name OK mysql.time_zone_transition OK mysql.time_zone_transition_type OK mysql.transaction_registry OK Phase 2/7: Installing used storage engines... Skipped Phase 3/7: Fixing views mysql.user OK Phase 4/7: Running 'mysql_fix_privilege_tables' Phase 5/7: Fixing table and database names Phase 6/7: Checking and upgrading tables Processing databases information_schema performance_schema qservReplica qservReplica.QMetadata OK qservReplica.config OK qservReplica.config_database OK qservReplica.config_database_family OK qservReplica.config_database_table OK qservReplica.config_database_table_schema OK qservReplica.config_worker OK qservReplica.config_worker_ext OK qservReplica.controller OK qservReplica.controller_log OK qservReplica.controller_log_ext OK qservReplica.database_ingest OK qservReplica.job OK qservReplica.job_ext OK qservReplica.replica OK qservReplica.replica_file OK qservReplica.request OK qservReplica.request_ext OK qservReplica.stats_table_rows OK qservReplica.transaction OK qservReplica.transaction_contrib OK qservReplica.transaction_contrib_ext OK qservReplica.transaction_log OK sys sys.sys_config OK ...
A similar command should be executed for each service that's going to be upgraded.
When all required upgrades were finished, and if no problems were discovered during the upgrades then Qserv should be restarted normally as:
./stop --czar-db --repl-db --worker-db ./start --all
It's always a good idea to inspect the log files of the upgraded database services after restarting those to be sure the services are healthy.
10. Debugging
Please, note that the service containers do not include tools (such as gdb) needed for debugging core dumps. The tools are available within special debug containers. The containers are needed because service containers will disappear after the enclosed services crash. These containers are not started if using the command line option --all. One has to use the corresponding options to manage these containers as explained in one of the prior sections.
The following example illustrates the basic management procedures for the debug containers. The status of all containers can be inspected by:
./status --all-debug [sdfqserv001] qserv czar debug: [sdfqserv002] qserv worker debug: [sdfqserv003] qserv worker debug: ... [sdfqserv001] repl contr debug: [sdfqserv001] repl worker debug: [sdfqserv002] repl worker debug: ...
The containers can be started by:
./start --all-debug [sdfqserv001] starting qserv czar debug ...
The tag names for the Qserv and the Replication containers can be different as these were configured in the corresponding variables of the file config/env:
./status --all-debug [sdfqserv001] qserv czar debug: 7063e02e500a qserv/lite-qserv:2022.7.1-rc1-39-g572903551 "bash -c 'yum -y ins…" 55 seconds ago Up 54 seconds qserv-czar-debug [sdfqserv002] qserv worker debug: ea029cba5225 qserv/lite-qserv:2022.7.1-rc1-39-g572903551 "bash -c 'yum -y ins…" 55 seconds ago Up 54 seconds qserv-worker-debug [sdfqserv003] qserv worker debug: 3dc36e38a209 qserv/lite-qserv:2022.7.1-rc1-39-g572903551 "bash -c 'yum -y ins…" 55 seconds ago Up 54 seconds qserv-worker-debug ... [sdfqserv001] repl contr debug: 5c93438fba55 qserv/lite-qserv:2022.7.1-rc1-39-g572903551 "bash -c 'yum -y ins…" 54 seconds ago Up 53 seconds qserv-repl-debug [sdfqserv001] repl worker debug: 5c93438fba55 qserv/lite-qserv:2022.7.1-rc1-39-g572903551 "bash -c 'yum -y ins…" 54 seconds ago Up 54 seconds qserv-repl-debug [sdfqserv002] repl worker debug: 813f64a347f9 qserv/lite-qserv:2022.7.1-rc1-39-g572903551 "bash -c 'yum -y ins…" 54 seconds ago Up 54 seconds qserv-repl-debug ...
10.1. Core files
Locations of the core dump at all nodes of the clusters are configured at the OS level as:
% cat /proc/sys/kernel/core_pattern |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e
This means the core files are processed and managed by the system service specified in the configuration. The location of the core files as per the current configuration is determined at:
% cat /usr/lib/tmpfiles.d/systemd.conf | grep core d /var/lib/systemd/coredump 0755 root root 3d
Note the default retention period of 3 days (3d in the configuration shown above). After that period the files would get automatically deleted.,
Files in this folder can only be inspected by the user root:
% ls -al /var/lib/systemd/coredump total 833704 drwxr-xr-x. 2 root root 4096 Mar 29 10:40 . drwxr-xr-x. 5 root root 70 Mar 9 16:49 .. -rw-r-----+ 1 root root 1304087 Mar 28 12:34 core.cmsd.4367.14031c86fbe543f8b4a583e681cb765c.128528.1680032058000000.lz4 -rw-r-----+ 1 root root 1304316 Mar 28 12:42 core.cmsd.4367.14031c86fbe543f8b4a583e681cb765c.140523.1680032546000000.lz4 -rw-r-----+ 1 root root 1301490 Mar 28 13:52 core.cmsd.4367.14031c86fbe543f8b4a583e681cb765c.173596.1680036747000000.lz4 -rw-r-----+ 1 root root 41856897 Mar 29 10:40 core.mysql-proxy.4367.14031c86fbe543f8b4a583e681cb765c.1991942.1680111646000000.lz4 -rw-r-----+ 1 root root 20278531 Mar 28 18:16 core.mysql-proxy.4367.14031c86fbe543f8b4a583e681cb765c.3593039.1680052562000000.lz4 -rw-r-----+ 1 root root 18726543 Mar 28 18:52 core.mysql-proxy.4367.14031c86fbe543f8b4a583e681cb765c.494831.1680054740000000.lz4 -rw-r-----+ 1 root root 1106592 Mar 28 12:34 core.qserv-replica-m.4367.14031c86fbe543f8b4a583e681cb765c.129492.1680032061000000.lz4 -rw-r-----+ 1 root root 1106701 Mar 28 12:42 core.qserv-replica-m.4367.14031c86fbe543f8b4a583e681cb765c.141485.1680032548000000.lz4 -rw-r-----+ 1 root root 1106167 Mar 28 13:52 core.qserv-replica-m.4367.14031c86fbe543f8b4a583e681cb765c.174406.1680036750000000.lz4 ...
Note that the files are compressed using the specified algorithm. One would have to deflate the desired file before inspecting it using gdb
.
This determines the core file inspection technique presented later in this section.
10.2. Inspecting core dumps
To inspect locating the core files, each host of the given Qserv setup runs the dedicated debug container. The containers have gdb installed. Containers also mount the local folder of a host where the core files are located. Depending if this is the master service or the worker one, the container mount option would like one of those
-v /var/lib/systemd/coredump:/tmp/core_files:ro -v /var/lib/systemd/coredump:/tmp/core_files:ro
All core files which exist on the host would be seen inside the container as:
% ls -1 /tmp/core_files total 833704 drwxr-xr-x. 2 root root 4096 Mar 29 10:40 . drwxr-xr-x. 5 root root 70 Mar 9 16:49 .. -rw-r-----+ 1 root root 1304087 Mar 28 12:34 core.cmsd.4367.14031c86fbe543f8b4a583e681cb765c.128528.1680032058000000.lz4 -rw-r-----+ 1 root root 1304316 Mar 28 12:42 core.cmsd.4367.14031c86fbe543f8b4a583e681cb765c.140523.1680032546000000.lz4 -rw-r-----+ 1 root root 1301490 Mar 28 13:52 core.cmsd.4367.14031c86fbe543f8b4a583e681cb765c.173596.1680036747000000.lz4 -rw-r-----+ 1 root root 41856897 Mar 29 10:40 core.mysql-proxy.4367.14031c86fbe543f8b4a583e681cb765c.1991942.1680111646000000.lz4 -rw-r-----+ 1 root root 20278531 Mar 28 18:16 core.mysql-proxy.4367.14031c86fbe543f8b4a583e681cb765c.3593039.1680052562000000.lz4 -rw-r-----+ 1 root root 18726543 Mar 28 18:52 core.mysql-proxy.4367.14031c86fbe543f8b4a583e681cb765c.494831.1680054740000000.lz4 -rw-r-----+ 1 root root 1106592 Mar 28 12:34 core.qserv-replica-m.4367.14031c86fbe543f8b4a583e681cb765c.129492.1680032061000000.lz4 -rw-r-----+ 1 root root 1106701 Mar 28 12:42 core.qserv-replica-m.4367.14031c86fbe543f8b4a583e681cb765c.141485.1680032548000000.lz4 -rw-r-----+ 1 root root 1106167 Mar 28 13:52 core.qserv-replica-m.4367.14031c86fbe543f8b4a583e681cb765c.174406.1680036750000000.lz4 ...
To inspect the core files one would need to log into the special debug container as user rubinqsv . The container needs to be running on the same node where the core file is located. For example:
# Log into the container docker exec -it qserv-prod-czar-debug bash # Make a copy of the core file at some writeable location cp /tmp/core.mysql-proxy.4367.14031c86fbe543f8b4a583e681cb765c.1991942.1680111646000000.lz4 /tmp/mysql-proxy.core.lz4 # Deflate the file before running GDB iz4 /tmp/mysql-proxy.core.lz4 gdb /user/local/bin/mysql-proxy /tmp/mysql-proxy.core
This will bring the gdb prompt to allow further debugging.
10.3. Attaching gdb to processes
Another option for debugging is to attach gdb to an existing process. The simplest (though, not the most efficient) way to do this would be to log into the target container as user root, install gdb, quit, log again into the container as user rubinqsv, locate the pid of a process in question, and attach the debugger to the process. Here is this example:
docker exec -u root qserv-prod-repl-contr bash -c 'yum -y install gdb' docker exec qserv-prod-repl-contr bash -c 'ps' PID TTY TIME CMD 1 ? 00:00:00 bash 7 ? 00:06:43 qserv-replica-m 212 ? 00:00:00 ps docker exec -it qserv-prod-repl-contr bash -c 'gdb -p 7' (gdb) where #0 0x00007f8246296fb0 in nanosleep () from /lib64/libpthread.so.0 #1 0x00007f82481b9c6a in std::this_thread::sleep_for<long, std::ratio<1l, 1000l> > (__rtime=...) at /usr/include/c++/8/thread:379 #2 lsst::qserv::util::BlockPost::wait (milliseconds=milliseconds@entry=1220) at /home/gapon/code/qserv/src/util/BlockPost.cc:57 #3 0x00007f82481b9d97 in lsst::qserv::util::BlockPost::wait (this=this@entry=0x7ffdce10c220) at /home/gapon/code/qserv/src/util/BlockPost.cc:49 #4 0x00007f8248f92556 in lsst::qserv::replica::MasterControllerHttpApp::runImpl (this=0x1a8e510) at /home/gapon/code/qserv/src/replica/MasterControllerHttpApp.cc:280 #5 0x00007f8248c71b48 in lsst::qserv::replica::Application::run (this=0x1a8e510) at /home/gapon/code/qserv/src/replica/Application.cc:182 #6 0x00000000004044da in main (argc=<optimized out>, argv=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1307 (gdb)
TBC...