1. Introduction
Put initial observations on the problem and initial theories to be tested.
TBC...
2. Tests
Explain the testing environment, versions of the code, etc.
The list of worker names at:
% cat /qserv/replication/work/workers db01 db02 .. db30
The list of chunk resources for database wise_00 :
% head -2 /qserv/replication/work/resources /chk/wise_00/0 /chk/wise_00/680 % tail -2 /qserv/replication/work/resources /chk/wise_00/229844 /chk/wise_00/230520 % wc -l /qserv/replication/work/resources 146332 /qserv/replication/work/resources
NOTE : these files are seen from inside the runnable container at:
/qserv/work/workers /qserv/work/resources
2.1. Simple stress tests
2.1.1. Performance of the worker status requests
These requests return a JSON object with a status of the worker schedulers as well as a list of resources which are in flight.
2.1.1.1. Single stream of requests
Restart Qserv. Then launch:
date; run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null; date Mon Apr 22 18:17:19 CDT 2019 Mon Apr 22 18:20:05 CDT 2019
Performance: 10000 * 30 = 30k requests for 2 min 46 sec (166 sec) => 1807 request/sec.
Monitoring network traffic (packets per second) while running the application:
% netstat -i | grep ens4f0; date ens4f0 1500 495760516 0 324 0 469759742 0 0 0 BMRU Mon Apr 22 18:24:36 CDT 2019 % netstat -i | grep ens4f0; date ens4f0 1500 496095182 0 324 0 470095334 0 0 0 BMRU Mon Apr 22 18:25:06 CDT 2019
Network:
- received: 11155 packets/sec
- sent: 11186 packets/sec
Resource utilization:
% top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 591551 qserv 20 0 7254820 560112 10960 R 482.4 0.2 8:38.58 qserv-worker-st
Check if all workers are alive:
% run.sh qserv-replica-job-health ClusterHealth::runImpl job finished: FINISHED::SUCCESS STATUS worker qserv replication ------ ----- ----------- db01 UP UP ... db30 UP UP real 0m1.622s user 0m0.083s sys 0m0.077s
2.1.1.2. Multiple streams of requests
Launch in parallel (detach containers using Control-P Control-Q sequence after launching each one):
% run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null % run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null % run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null & run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null
Network:
- received: 71465 packets/sec
- sent: 71489 packets/sec
Monitoring resources:
% top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 595213 qserv 20 0 16.0g 437324 10900 R 418.8 0.2 0:11.52 qserv-worker-st 594841 qserv 20 0 16.2g 552768 10908 S 412.5 0.2 0:43.81 qserv-worker-st 594529 qserv 20 0 4722948 195448 10960 R 400.0 0.1 1:15.97 qserv-worker-st 594670 qserv 20 0 5516404 176636 10904 R 387.5 0.1 0:57.91 qserv-worker-st
Checked that all workers were still alive. And, indeed - there were!
Performance of the operations was measured by extracting head -2 and tail -2 from the container's log files. Each finished within 2 minutes and 10 seconds, resulting in 4 * 300,000 / 130 sec = 9230 requests/second.
2.1.2. Performance of requests probing chunk resources /chk/wise_00/<chunk>
These requests send TEST_ECHO message to the resources. This message is intercepted by the workers and reported back as the following error:
% run.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=1 connected to service provider at: localhost:1094 2019-04-23T00:54:25.979Z LWP 1 DEBUG QservRequest constructed instances: 1 2019-04-23T00:54:26.010Z LWP 17 DEBUG QservRequest::ProcessResponse eInfo.rType: 4(isStream), eInfo.blen: 0 2019-04-23T00:54:26.011Z LWP 17 DEBUG QservRequest::ProcessResponse ** REQUESTING RESPONSE DATA ** 2019-04-23T00:54:26.011Z LWP 18 DEBUG QservRequest::ProcessResponseData eInfo.isOK: 0 2019-04-23T00:54:26.012Z LWP 18 ERROR QservRequest::ProcessResponseData ** FAILED ** eInfo.Get(): Failed to decode TaskMsg on resource db=wise_00 chunkId=0, eInfo.GetArg(): 0 status: ERROR error: Failed to decode TaskMsg on resource db=wise_00 chunkId=0 2019-04-23T00:54:26.182Z LWP 1 DEBUG TestEchoQservRequest ** DELETED **
2.1.2.1. Single stream of requests
Launch:
date; run.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null; date Mon Apr 22 19:56:52 CDT 2019 Mon Apr 22 19:57:57 CDT 2019
Performance: 146332 / 65 = 2251 request/sec.
And, according to a report from qserv-replica-job-health all workers are still alive.
Network performance wasn't measured because it's expected to be the same as for the previously tested workers-specific requests.
2.1.2.2. Multiple streams of requests
Launch:
% run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null& % run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null& % run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null& % run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null&
Performance: 4 * 146332 / 60 = 9755 requests/sec.
And all workers were still up after all tests finished.
2.1.3. Performance of the mixed types of requests
These tests included 3 3ypes of probes
- /worker/<id> TEST_ECHO
- /worker/<id> GET_STATUS
- /chk/wise_00/<chunk> TEST_ECHO
Launched the following tests:
% run-dettached.sh qserv-worker-status /qserv/work/workers 100000 --num-workers=30 >& /dev/null& % run-dettached.sh qserv-worker-perf /qserv/work/workers 100000 --num-workers=30 >& /dev/null& % run-dettached.sh qserv-worker-perf-chunks /qserv/work/resources 10 123 --num-resources=146332 >& /dev/null&
Then ran qserv-replica-job-health to see if all resources are up. Worker db14 didn't respond. This probe was repeated to ensure it returns consistent result.
Inspected redirector's log and found:
% tail /qserv/log/xrootd.log [2019-04-23T01:20:45.315Z] [LWP:568] INFO xrdssi.msgs (cmsd:0) - Node: 141.142.181.145 service suspended [2019-04-23T01:20:45.321Z] [LWP:1839] INFO xrdssi.msgs (cmsd:0) - Record: client defered; eligible servers suspended for /worker/db14 ..
Then logged onto qserv-db14 and found that xrootd was dead.
3. Resolution
An origin of the problem was a deadlock introduced in method:
nlohmann::json SchedulerBase::statusToJson() { std::lock_guard<std::mutex> lock(_countsMutex); ... status["num_tasks_in_queue"] = getSize(); status["num_tasks_in_flight"] = getInFlight(); ... }
Method getSize() acquires another lock. And a caller of SchedulerBase::statusToJson may also acquire the very same (the second lock).
A solution was to call virtual methods getSize and getInFlight outside of the lock area.