Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Restart Qserv. Then launch:

Code Block
date; run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null; date
Mon Apr 22 18:17:19 CDT 2019
Mon Apr 22 18:20:05 CDT 2019

...

Check if all workers are alive:

Code Block
% run.sh  qserv-replica-job-health
ClusterHealth::runImpl job finished: FINISHED::SUCCESS

  STATUS
  
   worker   qserv   replication
   ------   -----   ----------- 
   db01     UP      UP              
   ...       
   db30     UP      UP         

real	0m1.622s
user	0m0.083s
sys	0m0.077s

Multiple streams of requests

Restart Qserv. Then launch Launch in parallel (detach containers using Control-P Control-Q  sequence after launching each one):

Code Block
% run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null
% run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null
% run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null
& run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null

...

Performance of the operations was measured by extracting head -2  and tail -2  from the container's log files. Each finished within 2 minutes and 10 seconds, resulting in 4 * 300,000 / 130 sec9230 requests/second.

Performance of  requests probing chunk resources /chk/wise_00/<chunk>

These requests send TEST_ECHO  message to the resources. This message is intercepted by the workers and reported back as the following error:

Code Block
% run.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=1
connected to service provider at: localhost:1094
2019-04-23T00:54:25.979Z  LWP 1     DEBUG  QservRequest  constructed  instances: 1
2019-04-23T00:54:26.010Z  LWP 17    DEBUG  QservRequest::ProcessResponse    eInfo.rType: 4(isStream), eInfo.blen: 0
2019-04-23T00:54:26.011Z  LWP 17    DEBUG  QservRequest::ProcessResponse  ** REQUESTING RESPONSE DATA **
2019-04-23T00:54:26.011Z  LWP 18    DEBUG  QservRequest::ProcessResponseData  eInfo.isOK: 0
2019-04-23T00:54:26.012Z  LWP 18    ERROR  QservRequest::ProcessResponseData  ** FAILED **  eInfo.Get(): Failed to decode TaskMsg on resource db=wise_00 chunkId=0, eInfo.GetArg(): 0
status: ERROR
error:  Failed to decode TaskMsg on resource db=wise_00 chunkId=0
2019-04-23T00:54:26.182Z  LWP 1     DEBUG  TestEchoQservRequest  ** DELETED **

Single stream of requests

Launch:

Code Block
date; run.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null; date
Mon Apr 22 19:56:52 CDT 2019
Mon Apr 22 19:57:57 CDT 2019

Performance: 146332 / 65 = 2251 request/sec.

And, according to a report from qserv-replica-job-health all workers are still alive.

Network performance wasn't measured because it's expected to be the same as for the previously tested workers-specific requests.

Multiple streams of requests

Launch:

Code Block
% run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null&
% run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null&
% run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null&
% run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null&

Performance: 4 * 146332 / 60 = 9755 requests/sec.

And all workers were still up after all tests finished.

Performance of  the mixed types of requests 

These tests included 3 3ypes of probes

  • /worker/<id> TEST_ECHO
  • /worker/<id> GET_STATUS 
  • /chk/wise_00/<chunk> TEST_ECHO

Launched the following tests:

Code Block
% run-dettached.sh qserv-worker-status /qserv/work/workers 100000 --num-workers=30 >& /dev/null&
% run-dettached.sh qserv-worker-perf /qserv/work/workers 100000 --num-workers=30 >& /dev/null&
% run-dettached.sh qserv-worker-perf-chunks /qserv/work/resources 10 123 --num-resources=146332 >& /dev/null&

Then ran qserv-replica-job-health to see if all resources are up. Forket db14  didn't respond. This probe was repeated to ensure it returns consistent result.

Inspected redirector's log and found:

Code Block
% tail /qserv/log/xrootd.log
[2019-04-23T01:20:45.315Z] [LWP:568] INFO  xrdssi.msgs (cmsd:0) - Node: 141.142.181.145 service suspended
[2019-04-23T01:20:45.321Z] [LWP:1839] INFO  xrdssi.msgs (cmsd:0) - Record: client defered; eligible servers suspended for /worker/db14
..

Then logged onto qserv-db14  and found that xrootd  was down as per:

Code Block
ps -ef | grep xrootd