This page covers discussion related to DM-1896.

Table deletion is triggered by "DROP TABLE". A table can be distributed, and may involve deleting thousands of chunks distributed across hundreds of worker nodes.

  1. When "DROP TABLE XYZ" is requested through one of our czars, the czar sets the value of /DBS/<dbName>/TABLES/XYZ in CSS to "PRE_DELETE_<date>", and returns the uuid for that table. The <date> indicates when the pre_delete was initiated. Note that the query execution has to pay attention to values of keys /DBS/<dbName>/TABLES/<tableName> and it can not schedule a query against any table unless it is in "READY" state
  2. The value of /DBS/<dbName>/TABLES/XYZ is watched by a deletion watcher. There is only one such watcher (e.g., it is not per worker). The watcher is considered best-effort, it can fail, it can miss deletes, it is unreliable. When watcher wakes up on "PRE_DELETE", it ensures all czars had enough time to refresh their state and know about pending delete by sleeping a short amount of time: 30 sec. If, for some reason some czars will fail to refresh their state during that time and start scheduling queries about the table XYZ, these queries will most likely die.
  3. The watcher then scans the list of long running queries and look for queries that involve the XYZ table. Note, that means the CSS metadata keeping track of active queries needs to keep track of tables involved in each query. If there are queries on that list involving the XYZ table, wait and periodically re-check, and proceed only when all such queries complete.
  4. When there are no more active queries on that table, the watcher removes the entry /DBS/<dbName>/TABLES/XYZ and enters /DELETING/DBS/<dbName>/TABLES/XYZ_<uuid>. Note that when this happens, queries on the XYZ that is being deleted will fail. Also note that at this point "CREATE TABLE XYZ" will be accepted, however individual workers can reject if it chunks for the XYZ that is being deleted didn't get removed.
  5. The watcher then sends a message to the Data Distribution System: "DROP TABLE XYZ" (or maybe it does it for each chunk of XYZ, tbd depending how much Data Distribution System will know).
  6. Data Distribution System is responsible for deleting all replicas for a given chunk.
  7. A separate process that watches overall health of the system will periodically clean entries in /DELETING/DBS.

Note that the above is missing steps needed for provenance tracking. I assume that will be dealt with under a separate ticket.

Related CSS structures

  • /DBS/<dbName>/TABLES/<tableName>
    • PENDING: the table is currently created
    • READY: indicates the table is ready to be used / queried
    • PRE_DELETE <date>: indicates that the table is about to be deleted. The <date> indicates the date/time when the deletion was requested
  • /DELETING/DBS/<dbName>/TABLES/<tableName>_<uuid>

(When approved, the above should be added to https://dev.lsstcorp.org/trac/wiki/db/Qserv/CSS#Table-related)

Discussion about unusual conditions

If DROP is immediately followed by CREATE

  •  it should be possible to create a table with the same name shortly after it was deleted. E.g., 
CREATE TABLE t (id int);
DROP TABLE t;
CREATE TABLE t(id float);
  • If there is a pending delete, "create table" will fail.
  • If worker deleted the entry in CSS but the corresponding chunks have not been deleted by the distribution system, the worker should fail. (in version 1. Later, the worker can try to delete the orphan chunks before giving up.). That means that we have to send uuids for each table involved in a query with a query, and the worker must pay attention to table uuids sent with the query, and match them against uuids of table it has on disk
  • (in general, note that if we adopt the above design, there will be at least ~30 sec lag, introduced to allow all czars to synchronize their state.)

If the watcher dies

  • the process that periodically checks the health of the system will restart it. Then when the watcher starts up, it will:
    • scan all keys /DBS/<dbName>/TABLES/<tableName> and acts when value is set to "PRE_DELETE"
    • scan all keys in /DBS/DELETING and sends a message to the Data Distribution System: "DROP TABLE XYZ_<chunkId>" for each chunk of the XYZ table.

If active query never ends, blocks deletion

  • We can have a max limit, and if the query exceeds the limit, the process checking health of the system will kill it
  • Alternatively admin can kill the query

If watcher deletes entry in CSS, but does not create entry in /DBS/DELETING

  • process checking health of the system detects orphan entries and requests deletion

If data distribution never deletes some chunks

  • process checking health of the system intervenes

Debugging the system and getting status

Suppose the user or administrator wants to check on the status of just executed "DROP TABLE XYZ".

Introduce a command "SHOW STATUS FOR TABLE <tableName>". It shows the value of /DBS/<dbName>/TABLES/<tableName>. The command can have different levels of details

  • if "PRE-DELETE" is detected, it reports when the delete was triggered. Optionally, it can report why deletion is pending (e.g. maybe it was just triggered and there is 30 sec grace period, or maybe active queries are blocking delete. If active queries are blocking, it can list the queries)
  • We can have add a flag "-scan" to SHOW STATUS FOR TABLE, which would broadcast to all workers, and get information about all chunks. Such broadcast will be needed anyway by the process checking health of  the system

Related reading

  • No labels

12 Comments

  1. We need to identify who is responsible for doing items 4-8 from the list. We probably need new type of service, something like our non-existing "watcher" who is supposed to take care of operations like this (and many others).

    I think we need more synchronization in a system than just with the query scheduling, e.g. table/chunk deletions need to be synchronized with replication activity as well to avoid possible races. Could this be resolved at the level of per-chunk CSS information?

     

  2. I have been envisioning the data replication system as being very generic, and treating the files being distributed as opaque data.  It seems we should probably have some sort of callback or event system on each node that gets activated by the replication system when new data has arrived on the node (in which case we could load into the mysql on that node) and also immediately before the replication system deletes data on the node (in which case we could drop tables, etc.)?

  3. Unknown User (danielw)

    When you say "watcher" you mean wmgr, right?

  4. By "watcher" I mean here a process that runs continuously and watches for zookeeper events. Something like qserv/admin/bin/watcher.py. Maybe we do want to run it inside wmgr, I am not sure. 

  5. Unknown User (danielw)

    If some workers are offline, deletion shouldn't block. Each database and table has a UUID, and each worker knows the UUIDs for the tables and databases it hosts. Presumably, when the czar dispatches, it dispatches the set of UUIDs involved in its task, and if the worker notices unmatched UUIDs, then there is something wrong. The worker, when maintaining sync, should know when its tables are out-of-date this way.

    I think Serge suggested at one point, that internally qserv always worked with tables identified by these uuids, so they would never collide, but I pushed back saying that things would be too hard to debug: because you would never see "Object", "Source", etc. after query analysis (except through fancy monitoring/debugging tools that do the reverse-map).  If every update/delete creates a new UUID, then we would essentially have a single-assignment system, and then the complexity gets pushed to doing the mapping and garbage collection.

     

  6. I agree UUIDs will help with consistency, but if we rely on them for deletion, we need a more complex structure in CSS, I think.

    In "my" version, we simply use /DBS/x/TABLES/t to represent x.t table. Say the table x.t is on 3 nodes: A, B, and C, and node C is down. We just put instructions for node "C" "hey drop your chunks for x.t right after you get back up" and we are done: we can safely assume that no chunks for the x.t that we just deleted will suddenly surface, and when a request "create table x.t" comes, we can safely reuse /DBS/x/TABLES/t. Right? Note, the deletion on worker is triggered by "DELETE" value of the key (when node is up), or instructions "delete this and that" clearly entered for that node. 

    Now, if we rely on UUIDs, then what do we do if a node C hosting some chunks of x.t is down? Do we still proceed and delete the css metadata for x.t? I suspect we would want to keep it around until all chunks are truly gone. Yes? No?

    • If yes, to not block "create table x.t", we would need to keep in CSS something like /DBS/x_<dbUuid>/TABLES/t_<tbUuid>, or maybe /DBS/x/<dbUuid>/TABLES/t/<tbUuid> and we would need to deal with uuids in many places. This feels overly complicated to me.
    • If no, that means that we'd just go ahead and delete metadata for table x.t in CSS (to avoid blocking create table t.x which may come any time), and worker nodes would act upon the fact that /DBS/x/TABLES/t has been removed from CSS. We agreed some time ago that this is awfully dangerous. If there is a glitch in zookeeper and it does not list children for /DBS or something like that, it could trigger deletion of all our databases on all worker nodes. Also,  it'd be hard to track down if the deletion completed everywhere. To do that, we'd have to scan all worker nodes I guess.
  7. Unknown User (danielw)

    Well, there are less-aggressive and more-aggressive ways to use UUIDs. Simply including a UUID in the table's metadata is small, and if we include this in the czar dispatch, workers that were offline during the deletion will have a mismatched UUID when they receive tasks and will know to reject such queries. Because the deletion completed successfully while the nodes were down, this means that a quorum of nodes agree on the deletion and that there are sufficient copies of the correct data in the cluster. So the out-of-date worker can reject the task and be assured that some other node can handle the work.

     

    Any scheme that can't function when some nodes are down will not scale, because normal operation can continue with some nodes down. Normal operation shouldn't continue if the data is corrupt (really bad) or if there are not enough replicas (the system is degraded and should stop accepting new work while it tries to get back to fully-replicated gracefully).

  8. Daniel, walk me through a use case where we delete table x.t, but do not create a new one. There is no uuid "mismatch" then, right? What happens to metadata for table x.t in CSS when we drop it, but some nodes are down? Would you just delete its metadata in CSS?

    One option to consider for such case, instead of removing metadata for x.t: reset its uuid to some magic number, say 9999, and teach worker to consider this number as a flag "this table is being deleted". 

  9. Unknown User (danielw)

    0. cluster is good, table x.t exists, some nodes are down

    1. Issue "dropTable("t", "x")" to the system.

    2. Deletion/admin module updates css immediately (t.x.state=deletePending (uuidX)). Czars eventually get the update, issue no more queries on t.x. Queries already in-flight on czar currently unaffected. Do czars need to do anything else?

    3a. Workers don't have to notice immediately.  The main disadvantage to not dropping tables locally is the extra space consumed and conflicts with a future table. A maintenance thread/process on the worker runs periodically to maintain schema sync, and when it notices the new state in CSS, then it marks x.t (uuidX) locally for deletion. If the worker has queries using x.t (uuidX) in its schedule, then it doesn't drop x.t chunks yet. As it completes queries involving x.t, it checks to see if there are any more queries pending or in-flight on x.t (uuidX), and deletes x.t chunks if no more queries need them. Perhaps a data structure could optimize this portion. Note that the worker could be notified that there are updates to the CSS (and this is what zk watches do), but the notifications they receive are that updates exist. To improve latency, a lossy broadcast mechanism can be used, but the system must tolerate missing the broadcasts, hence a mechanism that doesn't rely on the broadcast message must exist.

    3b. Blocking network RPC is problematic in general–deletion could block for days because of a long-running query, but a probe on the workers (a broadcast getTableUuid("x", "t")), should reveal whether the worker still has the table. While a worker holds on to x.t (uuidX) chunks, it cannot participate in loading or replicating x.t(uuidY). createTable("x","t") can fail or block if getTableUuid("x","t") returns non-null for too many workers.

    3c. You can always issue a broadcast dropTableForce("x","t", uuidX). This tells workers to drop x.t for all chunks, regardless of scheduled/in-flight queries, and cancels/aborts queries on x.t, provided the worker holds x.t for uuidX.

    4. Eventually, a probe on getTableUuid("x","t") will return null for everyone, and the css can drop/archive its entry for x.t .

    I think whatever strategy we use, we should keep these ideas in mind:

    • Workers are disposable.
    • Workers don't have to be model citizens, but they will try. This means they offer service for the resources (i.e., chunks) they have, and try to increase the scope of their service by helping out (replicating resources to help the system).
    • Sometimes, workers will miss updates or not participate in the best way. The overall system should still provide service, and workers should be able to catch up.
    • When workers "go on vacation", they should not need to "read the entire email backlog" in order to "catch up" when they return. They can look at the current state, compare it with their own, and adjust. In the worst case, they can act like new workers entering the system.
    • As long as workers "try" to be good citizens, there should not be a need to manage their internal state, although they may publish stats that an auditor can use for debugging and monitoring.
    • If a node is faulty or misbehaving, it should be possible to improve things by adding a new node and removing the faulty one. The new node shouldn't have to be told that it's replacing a particular old node, and the cluster doesn't need to be told that a node is going away.

    Okay, you are welcome to disassemble the strawman now. As someone famous would say "Brickbats welcome."

  10. Few points from discussion with Daniel:

    • delete the entry in CSS for a table that is deleted as soon as all queries that need that table are gone. E.g., do not wait until the last chunk has been successfully deleted. 
    • if a worker is lagging behind with deleting chunks for a given table, than when a request comes in to create a new table with the same, the worker should reject it
    • each worker should verify uuid of each table involve in a query before executing the query (check if it is compatible with the expected uuids attached to the query)
    • the dropTableForce is only for special situations, debugging, administrators. It is never exposed to user. In practice, it is basically "find queries that are blocking drop table, and force-close these queries

     

  11. Does /DBS/DELETING key means that we cannot have database named DELETING? I would probably moved DELETING key outside of /DBS, otherwise the tools need to know that this name is special and is not a database name.

  12. Yes, good point! I'll fix it.