1. Introduction

Database initialization is performed in 2 parts:

  • schema+users creation
  • adding some data inside the database (does not require root access to the database)

This document emphasize about the first part, but the second one could be implemented in the very same step.

Below are the motivations:

  • Prevent accidental schema upgrades
  • Provide administrator more detail and debugging information so they know how a schema upgrade is progressing
    • Add init method for xrootd to wait for schema upgrade to proceed
  • Separation of xrootd and mariadb for development


Technical requirements:

  • xrootd must communicate with mariadb via sockets.  TCP connections are not performant in past testing.
  • mariadb connections via sockets are authenticated
  • mariadb needs to start before xrootd
  • 1 xrootd instance must be paired with 1 mariadb instance.  This pairing should run as 1 per node.


2. Strategies

2.1. From inside the Pod running mysql

Reminder: Both the Qserv worker and Czar Pods embed a mysql container, so running entrypoint script in any of their container will allow access to mysql file socket.


2.1.1. From an initContainer

Pros:

Network safe: Can use local mysql socket

User friendliness for k8s admin: +++++ (the initContainer from the database pod initialize the database schema)

For xrootd add in schema upgrade flag file (emptyDir) to signal to xrootd that schema upgrading is occurring.  xrootd startup script pauses if that flag is there.

Cons:

Need to start/stop mysqld inside the initContainer

mysql and entrypoint run be in the same initContainer

xrootd will fail until schema update and start/top mysqld finishes.

xrootd is stateless application so can be run as a deployment

Example:

Current Qserv setup, works since ~2 years

Also used for creating qserv-ingest schema

2.1.2. From a sidecar container

Pros:

Network safe: Can use local mysql socket

User friendliness for k8s admin: ++++

entrypoint container do not need to embed mysql

For Qserv worker the sidecar container can be xrootd or cmsd, but it is not a good separation of concern then. And it would not be trivial for the

Cons:

Need to run `sleep infinity` at the end of the sidecar container command

Example:

From the k8s official book: MongoDB is initialized inside a sidecar container which finishes with an infinite loop: http://eddiejackson.net/azure/Kubernetes_book.pdf#%5B%7B%22num%22%3A547%2C%22gen%22%3A0%7D%2C%7B%[…]%22%3A%22XYZ%22%7D%2Cnull%2C125.91296%2Cnull%5D

2.1.3. From the database container

This behavior is the one used by two mainstream products: MongoDB helm chart and Vitess operator.

Question: does mariadb provide a feature to create custom schema at startup (ping Fritz Mueller )?

Pros:

Network safe: Can use local mysql socket

User friendliness for k8s admin: +++

Cons:

mysql and entrypoint run be in the same initContainer

Example:

  1. Vitess initializes the schemas from inside its mysql pods:

    kubectl describe pods example-vttablet-zone1-2469782763-bfadd780
    ...
    mysqld:
        Container ID:  
        Image:         vitess/lite:latest
        Image ID:      
        Port:          3306/TCP
        Host Port:     0/TCP
        Command:
          /vt/bin/mysqlctld
        Args:
          --db-config-dba-uname=vt_dba
          --db_charset=utf8
          --init_db_sql_file=/vt/secrets/db-init-script/init_db.sql
          --logtostderr=true
          --mysql_socket=/vt/socket/mysql.sock
          --socket_file=/vt/socket/mysqlctl.sock
          --tablet_uid=2469782763
          --wait_time=2h0m0s
    ...

    Look at the option --init_db_sql_file above. It seems {Vitess} team has developed a mysqlctld process to start and init its mysql pods, this looks a bit like Qserv entrypoint but it is running from inside the MySQL container
  2. MongoDB helm chart initialize the clustered database in the same pod that start mongodb (see setup.sh script): https://github.com/bitnami/charts/blob/master/bitnami/mongodb/templates/replicaset/scripts-configmap.yaml

2.2. From outside the Pod running mysql

WARNING: this strategy will require an additional network 'root@%' access to mysql, this is not a recommended for security.  The security consideration is that another pod could be scheduled to read data from localhost.  Note they would need to have the authentication credentials too to access data.

This strategy could apply for replication controller/replication database, but not for Czar and Worker which both embed their mysql container.


2.2.1. From an initContainer

Pros:

emtrypoint container does not need to embed mysql

Cons:

Network unsafe: Require 'root@%' access to mysql (`%` will be required because the pod running the container will not be registered inside DNS if its main service is not started, because of readinessProbe)

User friendliness for k8s admin: ++, difficult to understand which service initialize which database without having a good knowledge of the project

Not useful for Czar and Worker as long as they embed a mysql container?

Example:

None yet

2.2.2. From a Container

Pros:

entrypoint container does not need to embed mysql

seperate init for xrootd

Cons:

Network unsafe: Require `root@%` access to mysql (`%` will be required because the pod running the container will not be registered inside DNS if its main service is not started, because of readinessProbe)

User friendliness for k8s admin: +, difficult to understand which service initialize which database without having a good knowledge of the project

Not useful for Czar and Worker as long as they embed a mysql container?

Example:

None yet


Remark: `root@%` access to mysql can be removed after database initialization in order to reduce risk, but this may end-up in a complex database upgrade procedure (add this access  at the beginning of each upgrade, and then remove it)

  • No labels

8 Comments

  1. Thanks, Fabrice Jammes, for this analysis!  I have been putting some thought into this lately as well.  I'd like to call out a few needs/goals that I think we should try to accommodate in our chosen approach, some of which are already touched on in the summary above:

    • Leverage init containers if possible for best visibility into init activity/problems for the k8s admin and to best leverage k8s tooling ("principle of least surprise" for k8s)
    • Keep service-specific (e.g. czar-specific, RI-specific, etc.) init code closely coupled with the corresponding service code, rather than splitting/mingling init logic into either new flavors of db container images or administrative scripts managed/maintained separately as part of the operator (principle of "separation of concerns", per recent progress with Qserv container entrypoints, etc.)
    • Provide a well-defined distinction between "initializing" and "operating" modes of a Qserv deployment
    • Recognize and accommodate the separate needs of both db init and schema evolution
    • Provide a straightforward "dbs running only" operating mode for Qserv, to facilitate troubleshooting / disaster-recovery / offline maintenance per previous discussions
    • Allow continued use of Unix-domain socket for db access, for performance reasons
    • Avoid potential db security issues such as "carte-blanche" persistent global root access

    In thinking about this and surveying the literature, I think we've been swimming upstream a little against some k8s principles by sticking to designs which combine Qserv service containers and db service containers into multi-container pods.  I think we have done this primarily to make the socket and security points above easy to address, but now we are causing ourselves some troubles.

    Proposal

    I would like to propose that we re-architect our system to move db service containers out into their own pods, and furthermore consider a tiered service design, where db services are a lower tier and Qserv services such as czars, workers, and RI ctrlrs are considered a second tier layered above.  Very broadly, this would look like:

    Database Tier:

      • just run database service instance, no Qserv-specific services
      • init containers which just bootstrap security users/grants for the deployment (more on this later)

    Qserv Service Tier (czars, workers, RI ctrlrs):

      • each service configured with service name for db service it is to use
      • init containers for service tier containers:
        • wait for needed db services to become available
        • if "dbs-only" mode is in effect (attribute on qserv CR?) wait here
        • check schema version for this service's dbs in db service
          • if not authorized to upgrade schema (attribute on qserv CR?), log and wait here
          • if authorized, do schema upgrade

    I think such a design would have these advantages:

    • orderly and observable startup sequencing in k8s tooling
    • dbs-only mode and gated schema upgrade are accommodated
    • service entrypoints simplified (sequencing and schema update concerns moved to service init entrypoints)
    • additional modularity wrt. Qserv services and the db services which support them (change sharing of db service instances, etc.)

    Some possible complications:

    • Unix domain socket.  Potentially addressed by setting affinities such that service pods and supporting db pods are co-located on the same node in cases where the socket must be used, then locating socket file in shared local filesystem.
    • db security.  Would host grants based on service names work here?  If so, that might be a nice solution.  If not, something more complicated (temporary accounts, privilege de-escalation, etc.) might need to be pursued.  We should also analyze the threat model a bit before going bonkers on this...

    (More to come re. qserv service and db service init containers in a subsequent comment...)

  2. More on init containers

    Following the ideas developed in the comment above, I would propose that init containers for the Qserv service tier be further instantiations of the existing Qserv unified binary container, implemented as additional container entrypoint "flavors".  This maintains a low artifact count for our builds and keeps a tight coupling between service versions and their associated schema maintenance.

    What about init containers for the proposed db service pods?  In our current system, these are MariaDB instances.

    The off-the-shelf MariaDB container image accommodates initialization by allowing mount-in of an init directory.  When the container is started with the factory-supplied entry point, that entry point code determines whether there is a pre-existing db state directory or if the server is being started with a "blank slate".  In the blank-slate case, it will look for and execute any shell scripts or SQL scripts provided in the (presumed mounted-in) init directory.

    What seems to be commonly done here is to create an init container for a MariaDB instance which bundles in the desired init shell/SQL scripts.  This init container then disgorges these scripts into a new empty-dir  volume, which is also provided to the MariaDB container as a mount for the init dir in the MariaDB pod descriptor.  This seems a practical and workable approach to me, and preferable to either using an entire separate instance of MariaDB (modified by script injection) as its own init container, or to building additional layers onto the stock MariaDB images.  It may also be the case that this mechanism is sufficient to address our needs of setting the process user and installation of scisql.  If this is the case, we could eliminate customized MariaDB image flavors from our build entirely.  Needs a little further investigation...

    A remaining question would be "whither container image for MariaDB init containers?"  One possibility might be to be to continue in the pattern of adding these responsibilities as and additional entrypoint flavor to the existing unified binary container.  Is this starting to go too far with this, though?


  3. Fritz Mueller thanks for these interesting ideas, most of them are great and will lead to a nicer design. However I think we need to fin-tune it for some Qserv components:

    1. About splitting the czar and worker pod in service pod and db pod (please note that replication service, qserv-ingest are already splitted):
      1. This might be a strongly recommended for multi-czar mode, especially if all czar instances share the same secondary index (this might not be useful for result tables if they are not shared between czar instances because we may have huge I/O here)
      2. for workers, this would require having 2 Statefulsets, one for worker Pods, and one for db Pods, and 2 headless Services (instead of one currently), plus some affinity. I think this complex setup might not be compliant with the "principle of least surprise" for the Qserv admin, having one unique statefulset with one unique pod looks much more simpler to me. In addition, I think it will be very complex, or even maybe impossible, to use the affinity to colocate pod worker-1 with pod db-1. Indeed, in my understanding, the affinity will allow to colocate any worker pod with any db pod, but not worker-X with pod-X. In addition, it will also be difficult to have worker-1 and db-1 pod share the same persistentvolume for the mysql data+ socket (the volumeClaimTemplate in the statefulset create PVCs named `qserv-data-qserv-worker-X`), and this is not a good practice to have a ReadWriteOnce volume shared by multiple pods. I am not sure this setup is impossible to do, but I strongly think it will be complicated and difficult to understand and maintain, compared to having a unique statefulset with a unique pod
    2. I think that using the idea of using the off-the-shelf MariaDB container image database initialization tooling is very good. Why not creating and initContainer inside each DB pod with an entrypoint script which creates the .sh/.sql files used to fully initialize the DB pod? I think this would fit all of our requirements:
      1. orderly and observable startup sequencing in k8s tooling: ok, for a Qserv admin, it would look reasonable that the db initContainer(s) prepare the db initialization(/upgrade)
      2. dbs-only mode and gated schema upgrade are accommodated: dbs-only mode would work by changing the command(s) for starting the Qserv processes from qserv.yaml crd. gated schema upgrade might be also triggered from qserv.yaml crd and would spawn a db initContainer for upgrade
      3. service entrypoints simplified (sequencing and schema update concerns moved to db init entrypoints)
      4. additional modularity w.r.t. Qserv services and the db services which support them (change sharing of db service instances, etc.): this is already the case for replication service and qserv-ingest, and this might also be applied to czar secondary index (see 1.a), modularity is not needed for worker because of the huge I/O between worker and their related db.
      5. Unix domain socket: no problem as it continue being used for the worker (and maybe for czar+resultTable, see 1.a), the socket problem does not apply to replication service and qserv-ingest
      6. db security: great! not root access via the network, everything goes through the socket
      7. Leverage init containers if possible for best visibility into init activity/problems: ok, we may also add an initContainer running entrypoint for db upgrade to the db pods, if specified in qserv.yaml crd
      8. Keep service-specific (e.g. czar-specific, RI-specific, etc.) init code closely coupled with the corresponding service code, rather than splitting/mingling init logic into either new flavors of db container images or administrative scripts managed/maintained separately as part of the operator (principle of "separation of concerns", per recent progress with Qserv container entrypoints, etc.): no new flavor of db image is needed, for workers (+czar ?) the db initContainer would be inside the (worker+db) pod, for replication (+ czar ?) it would be inside the db pod, which is called "qserv-repl-db", which is replication service specific. The db initContainer image would be the qserv-lite image.
      9. Provide a well-defined distinction between "initializing" and "operating" modes of a Qserv deployment: yes, thanks to initContainers
      10. Recognize and accommodate the separate needs of both db init and schema evolution: we can provide two different initContainer to do that, and use qserv.yaml crd to device which one(s) enable.

    Do you think we need to fine-tune this proposal (do not split the worker pod + use initContainer next to db container to fully init/upgrade the db)? On my side I think it looks like a good compromise between modularity, simplicity and security, and in addition it seems to feet pretty well most of our requirements? I'll be happy to discuss more about it if needed.

  4. Added comments.   Option 2.1.1 with init container seems most viable approach with addition of flag file to signify if upgrade is occurring.  For splitting the use of a hostpath with sockets is not much of an incremental security risk.

  5. Fritz Mueller Fabrice Jammes added some notes to above and diagrams.  Option 2.1.1 looks most viable with recommended addition of adding flag file to signify upgrades.  2.2 would be nice, but scheduling would be complex because the exactly once requirement for xrootd.  Could be done with hard anti-affinity, but not ideal.  I can meet to discuss when convenient.

  6. HI Dan Speck , thanks so much for your feedback on these complex feature!

  7. Hi Fritz Mueller Unknown User (npease) , according to Dan Speck and what we have now inside Qserv, which is working pretty well for database initialization. I also think option 2.1.1 looks most viable (initialiaze the datase inside an initContainer running the pods which will run the mysql container). In order to do that, I think that extracting qserv-smig.py from qserv image and adding it to our mysql image would be enough. Would you please allow me to implement this solution?