Hardware design document for FY16 version: https://docs.google.com/document/d/1z7kyLLkTCWI2Ye-jKju0kHjVJAjB49uksXoRhWWbqdM/edit

Deployment diagram for FY16 version: https://docs.google.com/drawings/d/12KibzTgviTR5VoQLdZN3LMUfZ1-XS6YoLqG76FraJJ8/edit

Schemas definition: https://github.com/lsst/cat  

Schema browser: https://lsst-web.ncsa.illinois.edu/schema/

Report on user testing of PDACv1 (January-April 2017): DMTR-22

Additional pages:

metaserv API v1 Doc


  1. Gregory Dubois-FelsmannKian-Tat Lim: There is something in the "Hardware design document" which I don't quite understand. The catalogs (MySQL databases) are supposed to be located on the non-RAID plain vanilla (not even ZFS) filesystem?

    • 30x Qserv worker requirements

      • 10x2TB (No RAID) (XFS) (Catalogs)

    I would naively expect this to be a potential source of problems in the future.

    This (no RAID) option may probably work in case if chunks are replicated across worker nodes, provided Qserv can properly handle and recognize MySQL exceptions caused by the disk failures.

    1. Yes, Qserv is specified as storing its worker databases on JBOD disks.  Replication of chunks across workers is indeed the intended fault protection mechanism.

      1. If understand it correctly the idea is that a whole worker node will go down due to just one failed disk. A probability for spinning disks to fail under a heavily usage scenario is far from negligible. In a setup of 10 disks per node the multiplication effect of the failures will be x10. It will require fixing "failed" nodes on a very short notice instead of just "walking around" 200 nodes and replacing failed disks in redundant arrays (RAID or ZFS) on a regular schedule (say, daily or weekly).

        What will make the situation worse is that the whole file system of the worker node will need to be rebuild and repopulated with terabytes of data. This translates into downtime latency on the order of many hours (realistically speaking 1 day) and higher maintenance cost of such system. The long latency will obviously increase a probability (yet to be assessed in quantitive terms) that another node with a replica of a failed chunk may also become unavailable. This will result in a total failure of the system (shared scans will be based on a subset of data).

        Has Qserve ever been tested for the failed file system scenario? My understanding is that Qserv is a complex stack of technologies in which the low-level hardware failure is supposed to be properly handled through a chain of events:

          disk -> file system -> database server -> xrootd -> Qserv fault processing logic

        I have no doubt in a distributed system based on this stack there will be many other failure modes beside disks. Eliminating (by design) at least of of those would obviously help to increase the overall stability of the system.

  2. This might cause some issues with mlock and we should do some testing to find out. With the machines at in2p3, using mlock causes queries to run about twice as fast as not using it, but only one thread can be calling mlock at any given time, the mlock call has to complete before sending the query to mysql, and the mlock call can take several seconds. I'm not sure what mlock is doing, my guess it is pre-loading the file, but that isn't what the documentation implies. Part of the idea for shared scans was to try and read from multiple drives at the same time, I don't know if that will work well with mlock and its limitations. Several drives would probably be fine for non-shared scan tables as we wont call mlock for those, but there may be a real benefit for using striping raid on the shared scan table files as we may only be able to read one shared scan file at a time. We should find out what mlock is really doing and run run some experiments.