1. Scope

This is an informal overview of the present state of Qserv. Specific JIRA tickets should be attached as needed.

2. Status

  • architecture
    • Still "shared-nothing" based on XROOTD/SSI 
  • implementation
    • memory-to-memory result streaming from workers to Czar
      • Problem 1: requires a lot of memory for large results
      • Problem 2: complex synchronization to manage memory
      • Problem 3: MySQL connections are unavailable while results are being streamed all the way from the connections to the worker's memory buffers,m to XROOTD streams, to Czar's memory buffers, and (finally) to the query result table at Czar
    • New file-based result delivery protocol (still under testing)
    • ANTRL4 for parsing user queries
  • stability
    • Occasional crashes of various components
      • OOM crashes of workers when processing many large-result queries
      • Occasional SEGFAULT crashes at workers when canceling queries
        • could be a race condition
        • waiting for the SLAC IT folks to configure cored service at USDF to allow further investigation
      • The Replication Controller still crashes when Qserv workers crash (complicated memory management model in the XROOTD/SSI API that might not be fully satisfied in the implementation of the Controller?)
  • performance
    • Has not been formally tested
    • Stopped doing KPMs a few years ago due to a manpower shortage
      • A formal testing framework Kraken still exists in the Qserv source code tree
  • scalability
    • Limited by a number of factors (see Problems below for more info)
  • work-in-progress
    • Improving the internal monitoring
    • Testing the file-based result delivery protocol
    • migrating to Alma Linux 9 (C++20, etc.)
  • deployments
    • deployment modes
      • docker-compose for single-node integration testing and GHA CI
      • k8s a IDF, FrDF and UKDF (new)
      • Docker-based (so-called "Igor" node) at USDF
    • installations & catalogs
      • DP01, DP02 at IDF, USDF, FrDF
      • all-sky Gaia DR2 and Wise (PSD and MEP) at USDF
  • the Replication/Ingest system
    • architecture (semi-decoupled from Qserv)
    • mostly implemented
    • Relatively low-level, highly versatile, and fully documented Ingest API  (REST-based) exists. It forms a foundation for building various forms of the Ingest workflows
    • (new) have a high-performance Parquet-to-CSV translator
  • ingest workflows
    • qserv-ingest  is not suitable for LSST scale catalogs for a variety of reasons
  • backups/archiving
    • none exists so far either for data and metadata
    • an assumption (used to be) that the Replication system would take care of this .. which is wrong as it serves different purposes.
  • monitoring
    • inner monitoring
      • almost done
      • explain more details, demo(?)
    • external monitoring
      • none, apart from Ganglia monitoring of the Qserv cluster infrastructure at USDF
    • logger
      • too verbose
      • affects (badly slows down Qserv) in the verbose mode
      • truncated logs at IDF. GCE logs are "hidden" somewhere

2.1. Known problems:

  • implementation
    • the code has been grown semi-"organically" and now it requires serious refactoring in a number of areas
      • the current code (AS-IS) would be hard to maintain in the long run
      • The MySQL connection layer in Qserv is a mess. This has to be reworked (ideally - ASAP)
      • code simplification (to get rid of several original ideas that have not proven to be useful, or turned out to be harmful)
  •  stability
    • unexplained (though, rare) crashes in workers
    • predictable crashes in workers due to OOM when processing many large result  (especially N-N) queries
    • predictable crashes of Czar when it's overloaded by (all-Sky) queries
    • unpredictable (though, rarely crashed in the Replication Controller)
  • scalability
    • Poor performance of the partial result aggregation at Czar for the (very) large result queries
      • there are good ideas on how to address those: partitioned MyISAM tables
    • Escalated latency, thread count, and memory usage within the XROOTD/SSI message delivery service when the number of chunks exceeds 100,000. This gets escalated exponentially with the large number of chunks.
  • performance
    • small queries may be blocked when Qserv is processing large result queries, This could be a problem with the query scheduling or the above-mentioned issues with the XROOTD/SSI message delivery service.


2.2. Is Qserv and the Replication/Ingest system ready for the prime time?

  • Qserv: 50%
  • Replication/Ingest system: 75%

3. Preliminary development roadmap

3.1. Qserv

  • A more efficient message delivery service is needed to avoid poor scalability of the XROOTD/SSI message delivery service.
    • Options (that have been discussed so far):
      • Batching requests into uber-jobs (still XROOTD/SSI)
      • Totally different protocol (BOOST ASIO)
    • Both require maintaining chunk-to-worker maps at Czar (forget about the "shared nothing" architecture mentioned earlier)
  • An improved implementation of the partial result merger at Czar
    • Options:
      • conservative: "beefier" MySQL service for results
        • This aims at increasing the scalability of Qserv to process many queries in parallel
      • high-performance: "parallel" merger based on MySQL table partitions (the same technology that was used in the "heart" of the Qserv Ingest system)
        • The focus is on speeding the processing of individual queries
      • In reality, we may need both options.
  • An improved query dispatching mechanism for fair treatment of small queries
  • Container entry points need to be improved
    • The overall approach for managing Qserv in the Kubernetes-based deployment needs to be revisited
  • (Ideally) we need a tracer to replace the LSST Logger (still need the Logger for posting important events). This would prevent Qserv performance from being affected.
  • (possibly) moving away from the (MySQL-like) connection-oriented API to the REST API at the Qserv front-end (includes removing MySQL proxy)
  • The general code cleanup and refactoring are needed in many areas.

3.2. Preliminary development roadmap for the Replication/Ingest system

  • The new Parquet-to-CSV translator still needs to be tested at scale and for correctness
  • Need qhttp to support file uploads in the POST body (multi-part requests)
    • For ingesting user-generated data products
    • DM-41216 - Getting issue details... STATUS
  • Replication of the partially-loaded tables
    • to increase stability when ingesting (very) large-scale catalogs in case one of the workers is lost during ingest
    • the commit stage of the super-transactions would include the creation (or update/resync) of the replicas
  • Priority-based ingests to fast-track requests for user-generated data products
  • Improved monitoring
    • the present implementation of the ingest monitor doesn't work well for the number of contributions that exceed ~100,000, and it's really slow for millions of contributions
  • Backups and archiving are to be managed by the Replication system. Keeping copies of tables at an object store
    • compared with the low-level filesystem-based backups, this mechanism provides a number of advantages:
      • selective backups or restores of specific catalogs, tables, or chunks
      • table versioning before applying in-place  modifications of the tables (removing columns, adding synthetic columns, data corrections, etc.) by keeping older versions of the tables in the object store
        • though, the previous versions could be stored locally in the "hidden" catalogs
    • this may replace or complement  the filesystem-level backups
  • design a systemic approach for managing in-place table fix-up (metadata, bookkeeping, versioning, rollbacks)
    • it's about REST services of the Replication Controller
  • Add support for ingesting user-generated data producta
    • TBC
  • No labels