The Worker Scheduler


The worker scheduler has complicated requirements, which result in complicated behavior. The primary purpose of the scheduler is the Shared Scan. The concept is to collect all the queries that need a particular chunk together and then run them all when it comes time to read that chunk into memory. This should minimize I/O and increase system efficiency.

Basic requirements:

  • Full table scan queries against the Objects table are supposed to complete within 1 hour.

  • Full table scan queries against the Source Table are supposed to complete within 8 hours.

  • Full table scan queries against the ForcedSource Table are supposed to complete within 12 hours.

  • Simple user queries (a.k.a. interactive queries) that only require a single chunk are supposed to complete within 10 seconds.

The scheduler has been built to meet these requirements under varied operating conditions using a limited thread pool and the wdb scheduler classes. The scheduler works with memman to maximize the usage of memory and increase the MySQL performance for the queries.


Query Analysis and ScanRating


The czar ranks queries during query analysis to determine if the query is interactive and what its ScanRating should be.

A query that will only hit a few chunks, or has a WHERE clause that limits it to a primary key value, such as “WHERE objectid = 1234”, is considered interactive. The gives interactive queries a ScanRating of zero and flags them as interactive.

If query is not interactive, it needs a ScanRating greater than zero, and interactive cannot be set. The ScanRating is determined by the tables used in the query. The ScanRating of each table is set in the qservCSSData databse in an entry that should look something like the following. "lockinMem":"1" indicates this is a table with chunks that should be locked in memory. kvId and parentkvid are internal to the database. The entries in the database need to be set by hand when the database is ingested.

kvIdkvKeykvValparentKvId
29/DBS/qservTest_case01_qserv/TABLES/Object/sharedScan/.packed.json{"lockInMem":"1","scanRating":"1"}

28

36/DBS/qservTest_case01_qserv/TABLES/Source/sharedScan/.packed.json{"lockInMem":"1","scanRating":"15"}35

 For this case, with three main tables (the code allows for an arbitrary number of tables) , the values should be

  • Object Table → ScanRating = 1

  • Source Table → ScanRating = 15

  • ForcedSource Table → ScanRating = 25 

As a rule, it is difficult for the system to handle more than 3 schedulers unless it has lots of memory and lots of cores. So multiple tables can go on the same scheduler, but they should be similar in size. So if there are 2 Tables about Object size, the smaller table should have a ScanRating of 1 and the larger one should have a scan rating > 1 and <= 10. A value greater than 10 would put it on the Medium scheduler (See table below). Larger tables should have larger ScanRatings to help with memman memory locking (See Running Tasks with Schedulers below). However, two tables having the same scan rating wont break anything.

The czar uses the highest rating found to determine the ScanRating for the query and this is passed to the worker with the rest of the Job information.

The worker converts the Job to a Task and places it on a scheduler acording to the table below. The worker ScanSchedulers have hard coded ranges for which they are responsible. Note that new queries are never placed on the Snail scheduler. The Snail scheduler is only to be used for poorly formed queries that are wrecking system performance. The Snail scheduler is discussed in its own section later.

SchedulerinteractiveMinMax
GroupSchedulerYesignoredignored
FastNo010
MediumNo1120
Slow No2130
SnailNo31100

This has the effect that all queries that use the ForcedSource table go on the Slow ScanScheduler, while queries that use the Source table, but not the ForcedSource table, go on the Medium ScanScheduler. Queries that only use the Object table go on the Fast ScanScheduler.


Memory use and memman


Before going into scheduling, it’s important to know something about memman. In Linux, it is possible to lock files into memory, preventing their pages from being paged out. Qserv knows which files on disk will be used by MariahDB when running the query and uses memman to lock those files in memory before running the query. This makes a single query run almost twice as fast in MariaDB than when the files weren’t locked in memory, and prevents additional I/O for the file.

There are limits to this. Only half of the memory on the machine can be locked at a time. There are multiple schedulers competing for the memory. These factors greatly limit the maximum size of of a chunk, about one quarter of half the memory on a worker. This results in a very large numbers of chunks.

The schedulers will normally only run a Task if all the tables in the query for the chunk can be locked in memory. There is one exception to this. To prevent a ScanScheduler from stalling, if it has no Tasks currently running, it will run the Tasks for one chunk even if there is not enough room to lock the chunks. While it does hurt performance, it keeps schedulers from being starved for memory.

memman keeps track of which files are locked in memory. It needs to do this so it can release them when they are no longer needed, but also so that other ScanSchedulers can take advantage of files already locked in memory.


Running Tasks with Schedulers


Once a Task is placed on a scheduler, the schedulers determine when it should run. This is a complicated process with several rules to prevent schedulers from stalling, deadlocking, or monopolizing the thread pool.

The individual schedulers are wrapped in the BlendScheduler. BlendScheduler controls the order that the schedulers are checked for Tasks to run on one of the threads in the limited pool. The order, presented below, is somewhat non-intuitive.

  1. GroupScheduler

  2. SlowScheduler

  3. MediumScheduler

  4. FastScheduler

It’s not surprising that the GroupScheduler goes first. It is the highest priority and the queries are expected to be small and fast. The GroupScheduler never locks tables in memory, so it wont interfere with memman locking of tables for scans.

The Slow Scheduler is checked next. The chunks in it’s tables are larger than the other tables, which makes them more difficult to lock in memory. By going first, it is more likely that it will be able to get memman to lock its large chunks into memory. If the Task has a join, it needs to lock all of the tables in the join into memory. The scheduler tries to put Tasks that lock multiple chunks at the head of the queue for the chunk, in hopes of avoiding the situation where a Task can't lock one of the tables it needs after another Task finishes. For example, the Slow ScanScheduler has 4 Tasks for chunk 109. The first Task locks chunks from all three tables. The second locks ForcedSource and Source chunks. The third uses ForcedSource and Object chunks. The fourth just uses ForcedSource. The sceduler will run them in that order, first, second, third, fourth. If the scheduler is limited to one thread, or there are a lot of Tasks for the chunk, the Object chunk may be released and then need to be relocked. It is setup intentionally so this is more likely to happen with the smaller tables, to reduce the expense, and also that joins between ForcedSource and Source are expected to be extremely rare.  (The code responsible for this is in wsched::ChunkDisk and its MinHeap class. This code could probably be simplified by building the heap using scanRating.)

The Medium ScanScheduler and then the Fast ScanScheduler get to go. Their tables are smaller and easier to fit into memory, so they don’t need preference to get their tables locked.

The BlendScheduler goes to each ScanScheduler to determine if it has anything ready to run. A scheduler is ready when it has a Task to run, it is not already at it’s limit of Tasks currently running (inFlight), and memman says there is memory available for locking the files (or the files are already locked in memory for another running Task). If these conditions are met, the Task is run on a thread from the pool and the associated parameters in the schedulers are updated.

Once the Task is running on its thread, it opens a database connection. If it can’t open a database connection right away, it waits until one becomes available. Once it has a database connection, it has memman lock the files in memory (if they are not already) and has the database run the query.

MariaDB and MySQL do not provide any result rows until they have completed the entire SQL query. This has been seen in practice and is evidenced by how the proxy works in both databases. This allows the Task to tell the scheduler it is complete once it can read any result rows from the database. The scheduler can then go on to provide another thread for another Task. The scheduler will then allow memman to release the files from memory, but only if it has no more Tasks ready to run. This prevents ScanSchedulers that only have one running thread from inadvertently releasing files from memory when there are more Tasks to run using those files.

While the scheduler is creating a new thread and dealing with memman, the Task does the slow work of transferring the results to the czar. It has to keep the database connection open until it is done transmitting results. To make sure that it doesn’t run out database connections the SqlConnectionMgr class is used track how many connections are in use, and makes Tasks wait when there are no more available. Tasks on the high priority GroupScheduler have a higher connection limit than those on the ScanSchedulers, so that the interactive Tasks can still run and complete even while the ScanSchedulers are stuck waiting. (SqlConnectionMgr allows 800 connections, 20 of which are reserved for interactive queries. When the scan schedulers are stuck waiting, it’s at 780)


Scheduler Complications


The BlendScheduler wants all three ScanSchedulers to be running all of the time while using as many threads in the thread pool as possible from the thread pool. It also needs to have a thread available to run a high priority interactive Task from the GroupScheduler whenever one shows up. Each Task runs on one thread.

To do this, the BlendScheduler reserves threads for schedulers. Each scheduler is allowed a minimum of one thread, out of the 20 threads in the pool. Any scheduler can take threads from the pool as long as it doesn’t encroach on the reserved threads for other schedulers.

When only a few queries are in the system, several threads may be unused. To improve throughput, a scheduler may work on multiple chunks at the same time. See the Fast scheduler in the diagram, which has chunk 61 as the “active” chunk and also has 79 locked in memory so it can run Tasks on both chunks.

This makes better use of the cores. However, the number of chunks that a single scheduler can use needs to be capped as each could use up all of the available memory for memman. That would cause a serious decline in performance for the other schedulers.

Another issue is that a czar can only handle so many incoming results at any given time. The TransmitMgr class is used to limit the number of transmits being sent to a particular czar. This is largely a function of the number of workers. A small pool of workers has a harder time overwhelming a czar, so it allows more transmits per worker. A large pool of workers may only be able to send one transmit at a time. It’s important to note that this is per czar, to protect the czar from the workers. So a large system may have 5 czars, and a limit of 1 transmit per worker, but it can have one transmit to each czar for 5 concurrent transmits per worker.


ScanScheduler Queues


ScanScheduler is the class used to implement Fast, Medium, and Slow schedulers. Different configuration values affect how they run, but the underlying code is the same for each.

It uses a map of chunk id numbers to store Tasks with the chunk id they need to work on, and stores the “active” chunk for that scheduler. When there are no Tasks are running for a scheduler, it starts with the smallest chunk id that has a Task. That becomes the “active” chunk for that scheduler. Using a map makes it easy to add new Tasks for the chunk and also stores the chunk ids in order. Having them in order makes it easier for humans to follow what is happening.

Once a scheduler has an “active” chunk, new Task for that chunk are blocked until the scheduler moves to the next chunk. This prevents the scheduler from getting stuck on a chunk because it keeps getting new Tasks for that chunk. Once all the pre-existing Tasks for a chunk have finished, the scheduler move to the next largest chunk, and that becomes the “active” chunk.

When conditions are favorable, a ScanScheduler may work on multiple chunks at the same time. In the diagram, Fast has 61 as it’s active chunk and is also working on chunk 79. When there are no more preexisting Tasks for chunk 61, chunk 79 becomes the active chunk. This keeps happening until there are no more Tasks in the map. Wraparound is also handled, so this works properly with “active” chunk 146 and chunk 1.


Snail ScanScheduler

Shared scans have a serious weakness in that one slow query will slow down every other query on that scheduler. If worker identifies a query that is significantly too slow, it boots that query to the Snail ScanScheduler to try to keep performance acceptable for other queries. Snail scan has the lowest priority for memory and threads and queries sent there could take a very long time to complete. Unfortuanately, the logic to boot user queries consistently could use some work. There's a lack of data for determining proper statistics, and it's difficult to tell if a query is slow because it's concentrated in a single chunk, the system is simply busy, or other events are hurting performance.


  • No labels