...
Restart Qserv. Then launch:
Code Block |
---|
date; run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null; date
Mon Apr 22 18:17:19 CDT 2019
Mon Apr 22 18:20:05 CDT 2019 |
...
Check if all workers are alive:
Code Block |
---|
% run.sh qserv-replica-job-health
ClusterHealth::runImpl job finished: FINISHED::SUCCESS
STATUS
worker qserv replication
------ ----- -----------
db01 UP UP
...
db30 UP UP
real 0m1.622s
user 0m0.083s
sys 0m0.077s |
Multiple streams of requests
Restart Qserv. Then launch Launch in parallel (detach containers using Control-P Control-Q sequence after launching each one):
Code Block |
---|
% run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null % run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null % run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null & run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null |
...
Performance of the operations was measured by extracting head -2 and tail -2 from the container's log files. Each finished within 2 minutes and 10 seconds, resulting in 4 * 300,000 / 130 sec = 9230 requests/second.
Performance of requests probing chunk resources /chk/wise_00/<chunk>
These requests send TEST_ECHO message to the resources. This message is intercepted by the workers and reported back as the following error:
Code Block |
---|
% run.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=1
connected to service provider at: localhost:1094
2019-04-23T00:54:25.979Z LWP 1 DEBUG QservRequest constructed instances: 1
2019-04-23T00:54:26.010Z LWP 17 DEBUG QservRequest::ProcessResponse eInfo.rType: 4(isStream), eInfo.blen: 0
2019-04-23T00:54:26.011Z LWP 17 DEBUG QservRequest::ProcessResponse ** REQUESTING RESPONSE DATA **
2019-04-23T00:54:26.011Z LWP 18 DEBUG QservRequest::ProcessResponseData eInfo.isOK: 0
2019-04-23T00:54:26.012Z LWP 18 ERROR QservRequest::ProcessResponseData ** FAILED ** eInfo.Get(): Failed to decode TaskMsg on resource db=wise_00 chunkId=0, eInfo.GetArg(): 0
status: ERROR
error: Failed to decode TaskMsg on resource db=wise_00 chunkId=0
2019-04-23T00:54:26.182Z LWP 1 DEBUG TestEchoQservRequest ** DELETED ** |
Single stream of requests
Launch:
Code Block |
---|
date; run.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null; date
Mon Apr 22 19:56:52 CDT 2019
Mon Apr 22 19:57:57 CDT 2019 |
Performance: 146332 / 65 = 2251 request/sec.
And, according to a report from qserv-replica-job-health all workers are still alive.
Network performance wasn't measured because it's expected to be the same as for the previously tested workers-specific requests.
Multiple streams of requests
Launch:
Code Block |
---|
% run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null&
% run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null&
% run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null&
% run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null& |
Performance: 4 * 146332 / 60 = 9755 requests/sec.
And all workers were still up after all tests finished.
Performance of the mixed types of requests
These tests included 3 3ypes of probes
- /worker/<id> TEST_ECHO
- /worker/<id> GET_STATUS
- /chk/wise_00/<chunk> TEST_ECHO
Launched the following tests:
Code Block |
---|
% run-dettached.sh qserv-worker-status /qserv/work/workers 100000 --num-workers=30 >& /dev/null&
% run-dettached.sh qserv-worker-perf /qserv/work/workers 100000 --num-workers=30 >& /dev/null&
% run-dettached.sh qserv-worker-perf-chunks /qserv/work/resources 10 123 --num-resources=146332 >& /dev/null& |
Then ran qserv-replica-job-health to see if all resources are up. Forket db14 didn't respond. This probe was repeated to ensure it returns consistent result.
Inspected redirector's log and found:
Code Block |
---|
% tail /qserv/log/xrootd.log
[2019-04-23T01:20:45.315Z] [LWP:568] INFO xrdssi.msgs (cmsd:0) - Node: 141.142.181.145 service suspended
[2019-04-23T01:20:45.321Z] [LWP:1839] INFO xrdssi.msgs (cmsd:0) - Record: client defered; eligible servers suspended for /worker/db14
.. |
Then logged onto qserv-db14 and found that xrootd was down as per:
Code Block |
---|
ps -ef | grep xrootd |