CSS v2

Related epic:  DM-2882 - Getting issue details... STATUS

Related stories:  DM-2966 - Getting issue details... STATUS

Problem Description

Existing version of CSS reads all information from Zookeeper (in python) when czar starts, makes a snapshot and makes it available to Qserv C++ layer. All the information is then statically cached througout the lifetime of czar. If any information in Zookeeper changes, czar will not know. With introductions of features such as creating / deleting databases, or keeping track of long-running queries, the existing model of static caching is not enough.

Information about database/tables has to be delivered very quickly. Querying CSS for each query in real-time is generally undesirable for (a) performance and (b) overloading CSS reasons – there can be o(100-200) calls for various CSS-based information per query.

Information about all databases and tables involved in a single query has to be consistent. In the current system we are reading individual keys one by one. This is OK because the information is all static, but once we switch to more dynamic system, we need to pay attention to that, because something might be updated/deleted in between fetching keys. So, we need to ensure information fetched from CSS related to a given query has to be consistent.

Information does not have to be completely up-to-date, e.g., it is OK to use a few-second-old cache.

Current Implementation 

When a query arrives, Qserv parses a query, extracts a list of databases and tables involved in the query, and then asks for various pieces of information pertaining to each database and table involved in the query. The CSS information is cached through Facade which relies on underlying memory-based "KVInterface" owned by UserQueryFactory. The cache is initialized as follows:

UserQueryFactory::Impl::initFacade()       at ccontrol/UserQueryFactory.cc:213
UserQueryFactory::Impl::readConfigFacade() at ccontrol/UserQueryFactory.cc:192
UserQueryFactory::UserQueryFactory()       at ccontrol/UserQueryFactory.cc:87
Context::_initFactory()                    at czar/python/app.py:278
Context::_init()                           at czar/python/app.py:259
AppInterface::submitQuery()                at czar/python/appInterface.py:109

_initFactory() is called only once ("if not initialized then initialize"). It creates a Facade object, which is then cached in UserQueryFactory::Impl for the lifetime of czar. As new queries arrive for each query we are creating a new QuerySession. QuerySession triggers code in qana (QservRestictorPlugin.ccTableInfoPool.ccMatchTablePlugin.ccRelationGraph.cc) and query (ChunkMapping.cc), that code is requesting information from the Facade, details about these interactions are provided below in Appendix A: Qserv  Interactions  with CSS. Then, QuerySession creates UserQuery object which does not hold pointer to Facade. After creating UserQuery, QuerySession goes out of scope. That means that for each query we are holding a pointer to Facade only for a very short time.

Proposed Design

Define how often we want to refresh cache. Configurable, with default = 15 sec.

Keep track of when last refresh was done (in python land, because only python land knows how to talk to zookeeper now). So in app.py

Instead of one Facade, allow keeping a vector of Facades in UserQueryFactory.

When a new query comes in, and CSS is due for refresh, make a new snapshot of the CSS information (fetch everything from zookeeper) and add a new Facade to UserQueryFactory. All new queries will now use the latest Facade.

When older facades are no longer needed (no QuerySessions are holding pointers to these Facades), remove them.

Note that in that design, it is unlikely that we will have more than 2 active Facades at any given time, because each query, even if very long, holds pointer to Facade for a very short time.

 1. Optional optimization: avoid blocking new query when refreshing CSS

In the above design when Facade is due for refresh, one unlucky query will be blocked until refresh completes. Based on some limited observations, a refresh takes ~60 milliseconds (local zookeeper, a small data set from integration tests case 01). It might get worse though once refreshing cache will start obeying locks from readers/writers, which is currently not implemented.

If that starts to be a problem, we could trigger the refresh asynchronously rather then when new query comes.

2. Optional optimization: fetch only when something changed

Instead of blindly re-fetching few seconds, make sure there were updates in zookeeper. If nothing changed, don't re-fetch. There is one complication here: zookeeper keeps last_updated timestamp for each zk node, but it is not recursive. That means that there is no single last_updated value we could check. To determine if anything changed we would have to either scan all nodes and check their last_updated, or introduce a new node that would serve as a flag.

3. Optional optimization: fetch only what changed

Instead of blindly re-fetch everything, re-fetch only the parts that changed. Not clear if the extra complexity is worth the savings.

4. Optional optimization: fetch only what is needed

Instead of making full snapshot, take a snapshot of metadata for most commonly used databases / tables. When a new query comes in and it needs metadata about table/database that it does not have, fetch it and append to the existing snapshot.

Make empty-chunk info per-database

I believe information about empty chunks should not be global, it should probably be per partitioning information. But that is a separate story that we should handle separately.

The New Design

Per Database Meeting 2015-07-15 we will get rid of zookeeper and use mysql instead. We will implement mysql-based KVInterface in C++ and expose it to python layer (thus we will keep just one implementation instead of managing two). We will extend the KVInterface to support updates. And, finally, we will synchronize CSS updates with the new Query Metadata through locks to make sure CSS seen by Qserv is always consistent.

Appendix A: Qserv Interactions with CSS

qana/QservRestrictorPlugin.cc

  • in lookupSecIndex():
    • containsDb()
    • containsTable()
    • getSecIndexColNames()
  • in operator()():
    • containsDb()
    • containsTable()
    • tableIsChunked()
    • getPartitionCols()
  • in _convertObjectId():
    • containsDb()
    • containsTable()
    • getDirColName()

qana/TableInfoPool.cc

  • in get()
    • getChunkLevel()
    • isMatchTable()
    • getMatchTableParams()
    • getDirTable()
    • getPartitionCols()
    • getDbStriping()
    • getDirTable()
    • getDirColName()

qana/MatchTablePlugin.cc

  • in applyLogical()
    • isMatchTable()
    • getMatchTableParams()

qana/RelationGraph.cc

  • getOverlap()

query/ChunkMapping.cc

  • getChunkLevel() called in a loop

We are not calling:

  • tableIsSubChunked()
  • getAllowedDbs()
  • getChunkedTables()
  • getSubChunkedTables()

in any qserv/core file, with the exception of some test programs.

 

  • No labels

6 Comments

  1. I think that in other places (table deletion) we expect to use CSS for some sort of synchronization between czars and other entities (e.g. watcher). I'm afraid that this kind of periodical snapshotting is not going to play well with that synchronization. Just thinking about all possible inconsistencies between different snapshots makes my head explode, I'm not sure how well we could reason about that or how fragile situation is going to become.

    Do we know why do we need 100s of CSS accesses per query, maybe we need better granularity for data in CSS?

  2. We pretty much need all basic metadata about all tables involved in a query. If you worry about things getting out of sync, then how about we take a consistent snapshot per query, and that snapshot will contain just the metadata of tables involves in a given query?

    To see why we need 100s of CSS accesses, just turn on logging for css module and you will see it all. I'll attach something as an example. It looks like it is ~60 for this particular query. We are doing repeated checks that could be removed, so we could optimize it down to ~10 calls x number of tables in a query.

  3. I see a lot of repeating calls to containsDb/containsTable in that log, if you eliminate all of that then there is probably ~10 meaningful requests (isMatchTable, getPartitionCols, getOverlap, etc.) I think with proper granularity we may need just few CSS accesses per table. My worry would be that we need per-chunk info from CSS then it can quickly jump to some insane numbers, but I believe we don't need that for czar. Do we worry if we need 10 zookeeper queries per one user query, do we have numbers to say how much is it going to cost us?

  4. I think fetching individual bits of metadata directly from zookeeper are not going to work, not because of the load or volume, but because things might get changed in between. Locking entire metadata for a given database, and fetch everything consistently/atomically in one shot is the only option, I think. (Whether we still call it "taking a snapshot" or not does not matter). Once we do that and we have the information handy in memory for whichever piece of qserv needs it, extra 50 calls are not really changing much. But of course it wouldn't hurt to optimize it. 

  5. Hmm, I looked into reducing the number of calls such as containsDb/containsTable. Currently if we call a function like Facade::tableIsChunked(db, table) etc, if the db or table does not exist, we will throw an exception. If we don't do the checking, the function will simply return "false", which feels very misleading. And if I start doing some extra checks in zookeeper, we are not really optimizing much comparing to what we have right now. Suggestions?

  6. I think Facade needs serious redesign. Instead of querying individual pieces of partitioning parameters it should return a complete set of the parameters for one table or a bunch of tables. This would also help with consistency (but I do not think we can reasonably handle updates to partitioning parameters at all so I would not worry about locking).