1. Introduction

Put initial observations on the problem and initial theories to be tested.

TBC...

2. Tests

Explain the testing environment, versions of the code, etc.

The list of worker names at:

% cat /qserv/replication/work/workers 
db01
db02

..
db30

The list of chunk resources for database wise_00 :

% head -2 /qserv/replication/work/resources 
/chk/wise_00/0
/chk/wise_00/680
% tail -2 /qserv/replication/work/resources 
/chk/wise_00/229844
/chk/wise_00/230520
% wc -l /qserv/replication/work/resources 
146332 /qserv/replication/work/resources

NOTE : these files are seen from inside the runnable container at:

/qserv/work/workers
/qserv/work/resources


2.1. Simple stress tests

2.1.1. Performance of the worker status requests

These requests return a JSON object with a status of the worker schedulers as well as a list of resources which are in flight.

2.1.1.1. Single stream of requests

Restart Qserv. Then launch:

date; run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null; date
Mon Apr 22 18:17:19 CDT 2019
Mon Apr 22 18:20:05 CDT 2019

Performance: 10000 * 30 = 30k requests for 2 min  46 sec  (166 sec) => 1807 request/sec.

Monitoring network traffic (packets per second) while running the application:

% netstat -i | grep ens4f0; date
ens4f0           1500 495760516      0    324 0      469759742      0      0      0 BMRU
Mon Apr 22 18:24:36 CDT 2019

% netstat -i | grep ens4f0; date
ens4f0           1500 496095182      0    324 0      470095334      0      0      0 BMRU
Mon Apr 22 18:25:06 CDT 2019

Network:

  • received: 11155 packets/sec
  • sent: 11186 packets/sec

Resource utilization:

% top

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 591551 qserv     20   0 7254820 560112  10960 R 482.4  0.2   8:38.58 qserv-worker-st

Check if all workers are alive:

% run.sh  qserv-replica-job-health
ClusterHealth::runImpl job finished: FINISHED::SUCCESS

  STATUS
  
   worker   qserv   replication
   ------   -----   ----------- 
   db01     UP      UP              
   ...       
   db30     UP      UP         

real	0m1.622s
user	0m0.083s
sys	0m0.077s

2.1.1.2. Multiple streams of requests

Launch in parallel (detach containers using Control-P Control-Q  sequence after launching each one):

% run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null
% run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null
% run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null
& run-dettach.sh qserv-worker-status /qserv/work/workers 10000 --num-workers=30 >& /dev/null

Network:

  • received: 71465 packets/sec
  • sent: 71489 packets/sec

Monitoring resources:

% top
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 595213 qserv     20   0   16.0g 437324  10900 R 418.8  0.2   0:11.52 qserv-worker-st
 594841 qserv     20   0   16.2g 552768  10908 S 412.5  0.2   0:43.81 qserv-worker-st
 594529 qserv     20   0 4722948 195448  10960 R 400.0  0.1   1:15.97 qserv-worker-st
 594670 qserv     20   0 5516404 176636  10904 R 387.5  0.1   0:57.91 qserv-worker-st

Checked that all workers were still alive. And, indeed - there were!

Performance of the operations was measured by extracting head -2  and tail -2  from the container's log files. Each finished within 2 minutes and 10 seconds, resulting in 4 * 300,000 / 130 sec9230 requests/second.

2.1.2. Performance of  requests probing chunk resources /chk/wise_00/<chunk>

These requests send TEST_ECHO  message to the resources. This message is intercepted by the workers and reported back as the following error:

% run.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=1
connected to service provider at: localhost:1094
2019-04-23T00:54:25.979Z  LWP 1     DEBUG  QservRequest  constructed  instances: 1
2019-04-23T00:54:26.010Z  LWP 17    DEBUG  QservRequest::ProcessResponse    eInfo.rType: 4(isStream), eInfo.blen: 0
2019-04-23T00:54:26.011Z  LWP 17    DEBUG  QservRequest::ProcessResponse  ** REQUESTING RESPONSE DATA **
2019-04-23T00:54:26.011Z  LWP 18    DEBUG  QservRequest::ProcessResponseData  eInfo.isOK: 0
2019-04-23T00:54:26.012Z  LWP 18    ERROR  QservRequest::ProcessResponseData  ** FAILED **  eInfo.Get(): Failed to decode TaskMsg on resource db=wise_00 chunkId=0, eInfo.GetArg(): 0
status: ERROR
error:  Failed to decode TaskMsg on resource db=wise_00 chunkId=0
2019-04-23T00:54:26.182Z  LWP 1     DEBUG  TestEchoQservRequest  ** DELETED **

2.1.2.1. Single stream of requests

Launch:

date; run.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null; date
Mon Apr 22 19:56:52 CDT 2019
Mon Apr 22 19:57:57 CDT 2019

Performance: 146332 / 65 = 2251 request/sec.

And, according to a report from qserv-replica-job-health all workers are still alive.

Network performance wasn't measured because it's expected to be the same as for the previously tested workers-specific requests.

2.1.2.2. Multiple streams of requests

Launch:

% run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null&
% run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null&
% run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null&
% run-dettach.sh qserv-worker-perf-chunks /qserv/work/resources 1 123 --num-resources=146332 >& /dev/null&

Performance: 4 * 146332 / 60 = 9755 requests/sec.

And all workers were still up after all tests finished.

2.1.3. Performance of  the mixed types of requests 

These tests included 3 3ypes of probes

  • /worker/<id> TEST_ECHO
  • /worker/<id> GET_STATUS 
  • /chk/wise_00/<chunk> TEST_ECHO

Launched the following tests:

% run-dettached.sh qserv-worker-status /qserv/work/workers 100000 --num-workers=30 >& /dev/null&
% run-dettached.sh qserv-worker-perf /qserv/work/workers 100000 --num-workers=30 >& /dev/null&
% run-dettached.sh qserv-worker-perf-chunks /qserv/work/resources 10 123 --num-resources=146332 >& /dev/null&

Then ran qserv-replica-job-health to see if all resources are up. Worker db14  didn't respond. This probe was repeated to ensure it returns consistent result.

Inspected redirector's log and found:

% tail /qserv/log/xrootd.log
[2019-04-23T01:20:45.315Z] [LWP:568] INFO  xrdssi.msgs (cmsd:0) - Node: 141.142.181.145 service suspended
[2019-04-23T01:20:45.321Z] [LWP:1839] INFO  xrdssi.msgs (cmsd:0) - Record: client defered; eligible servers suspended for /worker/db14
..

Then logged onto qserv-db14  and found that xrootd  was dead.

3. Resolution

An origin of the problem was a deadlock introduced in method:

nlohmann::json SchedulerBase::statusToJson() {
    std::lock_guard<std::mutex> lock(_countsMutex);
    ...
    status["num_tasks_in_queue"] = getSize();
    status["num_tasks_in_flight"] = getInFlight();
    ...
}

Method getSize()  acquires another lock. And a caller of SchedulerBase::statusToJson  may also acquire the very same (the second lock).

A solution was to call virtual methods getSize  and getInFlight  outside of the lock area.


  • No labels