DM-1896

This page covers discussion related to DM-1896.

Table deletion is triggered by "DROP TABLE". A table can be distributed, and may involve deleting thousands of chunks distributed across hundreds of worker nodes.

When "DROP TABLE XYZ" is requested through one of our czars, the czar sets the value of /DBS/<dbName>/TABLES/XYZ in CSS to "PRE_DELETE_<date>", and returns the uuid for that table. The <date> indicates when the pre_delete was initiated. Note that the query execution has to pay attention to values of keys /DBS/<dbName>/TABLES/<tableName> and it can not schedule a query against any table unless it is in "READY" state
The value of /DBS/<dbName>/TABLES/XYZ is watched by a deletion watcher. There is only one such watcher (e.g., it is not per worker). The watcher is considered best-effort, it can fail, it can miss deletes, it is unreliable. When watcher wakes up on "PRE_DELETE", it ensures all czars had enough time to refresh their state and know about pending delete by sleeping a short amount of time: 30 sec. If, for some reason some czars will fail to refresh their state during that time and start scheduling queries about the table XYZ, these queries will most likely die.
The watcher then scans the list of long running queries and look for queries that involve the XYZ table. Note, that means the CSS metadata keeping track of active queries needs to keep track of tables involved in each query. If there are queries on that list involving the XYZ table, wait and periodically re-check, and proceed only when all such queries complete.
When there are no more active queries on that table, the watcher removes the entry /DBS/<dbName>/TABLES/XYZ and enters /DELETING/DBS/<dbName>/TABLES/XYZ_<uuid>. Note that when this happens, queries on the XYZ that is being deleted will fail. Also note that at this point "CREATE TABLE XYZ" will be accepted, however individual workers can reject if it chunks for the XYZ that is being deleted didn't get removed.
The watcher then sends a message to the Data Distribution System: "DROP TABLE XYZ" (or maybe it does it for each chunk of XYZ, tbd depending how much Data Distribution System will know).
Data Distribution System is responsible for deleting all replicas for a given chunk.
A separate process that watches overall health of the system will periodically clean entries in /DELETING/DBS.

Note that the above is missing steps needed for provenance tracking. I assume that will be dealt with under a separate ticket.

Related CSS structures

/DBS/<dbName>/TABLES/<tableName>
- PENDING: the table is currently created
- READY: indicates the table is ready to be used / queried
- PRE_DELETE <date>: indicates that the table is about to be deleted. The <date> indicates the date/time when the deletion was requested
/DELETING/DBS/<dbName>/TABLES/<tableName>_<uuid>

(When approved, the above should be added to https://dev.lsstcorp.org/trac/wiki/db/Qserv/CSS#Table-related)

Discussion about unusual conditions

If DROP is immediately followed by CREATE

it should be possible to create a table with the same name shortly after it was deleted. E.g.,

CREATE TABLE t (id int);
DROP TABLE t;
CREATE TABLE t(id float);

If there is a pending delete, "create table" will fail.
If worker deleted the entry in CSS but the corresponding chunks have not been deleted by the distribution system, the worker should fail. (in version 1. Later, the worker can try to delete the orphan chunks before giving up.). That means that we have to send uuids for each table involved in a query with a query, and the worker must pay attention to table uuids sent with the query, and match them against uuids of table it has on disk
(in general, note that if we adopt the above design, there will be at least ~30 sec lag, introduced to allow all czars to synchronize their state.)

If the watcher dies

the process that periodically checks the health of the system will restart it. Then when the watcher starts up, it will:
- scan all keys /DBS/<dbName>/TABLES/<tableName> and acts when value is set to "PRE_DELETE"
- scan all keys in /DBS/DELETING and sends a message to the Data Distribution System: "DROP TABLE XYZ_<chunkId>" for each chunk of the XYZ table.

If active query never ends, blocks deletion

We can have a max limit, and if the query exceeds the limit, the process checking health of the system will kill it
Alternatively admin can kill the query

If watcher deletes entry in CSS, but does not create entry in /DBS/DELETING

process checking health of the system detects orphan entries and requests deletion

If data distribution never deletes some chunks

process checking health of the system intervenes

Debugging the system and getting status

Suppose the user or administrator wants to check on the status of just executed "DROP TABLE XYZ".

Introduce a command "SHOW STATUS FOR TABLE <tableName>". It shows the value of /DBS/<dbName>/TABLES/<tableName>. The command can have different levels of details

if "PRE-DELETE" is detected, it reports when the delete was triggered. Optionally, it can report why deletion is pending (e.g. maybe it was just triggered and there is 30 sec grace period, or maybe active queries are blocking delete. If active queries are blocking, it can list the queries)
We can have add a flag "-scan" to SHOW STATUS FOR TABLE, which would broadcast to all workers, and get information about all chunks. Such broadcast will be needed anyway by the process checking health of the system

Space shortcuts

Page tree

Related CSS structures

Discussion about unusual conditions

Debugging the system and getting status

Related reading

12 Comments

Andy Salnikov

Fritz Mueller

Unknown User (danielw)

Jacek Becla

Unknown User (danielw)

Jacek Becla

Unknown User (danielw)

Jacek Becla

Unknown User (danielw)

Jacek Becla

Andy Salnikov

Jacek Becla