1. Introduction
Database initialization is performed in 2 parts:
- schema+users creation
- adding some data inside the database (does not require root access to the database)
This document emphasize about the first part, but the second one could be implemented in the very same step.
Below are the motivations:
- Prevent accidental schema upgrades
- Provide administrator more detail and debugging information so they know how a schema upgrade is progressing
- Add init method for xrootd to wait for schema upgrade to proceed
- Separation of xrootd and mariadb for development
Technical requirements:
- xrootd must communicate with mariadb via sockets. TCP connections are not performant in past testing.
- mariadb connections via sockets are authenticated
- mariadb needs to start before xrootd
- 1 xrootd instance must be paired with 1 mariadb instance. This pairing should run as 1 per node.
2. Strategies
2.1. From inside the Pod running mysql
Reminder: Both the Qserv worker and Czar Pods embed a mysql container, so running entrypoint script in any of their container will allow access to mysql file socket.
2.1.1. From an initContainer
Pros:
Network safe: Can use local mysql socket
User friendliness for k8s admin: +++++ (the initContainer from the database pod initialize the database schema)
For xrootd add in schema upgrade flag file (emptyDir) to signal to xrootd that schema upgrading is occurring. xrootd startup script pauses if that flag is there.
Cons:
Need to start/stop mysqld inside the initContainer
mysql and entrypoint run be in the same initContainer
xrootd will fail until schema update and start/top mysqld finishes.
xrootd is stateless application so can be run as a deployment
Example:
Current Qserv setup, works since ~2 years
Also used for creating qserv-ingest schema
2.1.2. From a sidecar container
Pros:
Network safe: Can use local mysql socket
User friendliness for k8s admin: ++++
entrypoint container do not need to embed mysql
For Qserv worker the sidecar container can be xrootd or cmsd, but it is not a good separation of concern then. And it would not be trivial for the
Cons:
Need to run `sleep infinity` at the end of the sidecar container command
Example:
From the k8s official book: MongoDB is initialized inside a sidecar container which finishes with an infinite loop: http://eddiejackson.net/azure/Kubernetes_book.pdf#%5B%7B%22num%22%3A547%2C%22gen%22%3A0%7D%2C%7B%[…]%22%3A%22XYZ%22%7D%2Cnull%2C125.91296%2Cnull%5D
2.1.3. From the database container
This behavior is the one used by two mainstream products: MongoDB helm chart and Vitess operator.
Question: does mariadb provide a feature to create custom schema at startup (ping Fritz Mueller )?
Pros:
Network safe: Can use local mysql socket
User friendliness for k8s admin: +++
Cons:
mysql and entrypoint run be in the same initContainer
Example:
Vitess initializes the schemas from inside its mysql pods:
kubectl describe pods example-vttablet-zone1-2469782763-bfadd780 ... mysqld: Container ID: Image: vitess/lite:latest Image ID: Port: 3306/TCP Host Port: 0/TCP Command: /vt/bin/mysqlctld Args: --db-config-dba-uname=vt_dba --db_charset=utf8 --init_db_sql_file=/vt/secrets/db-init-script/init_db.sql --logtostderr=true --mysql_socket=/vt/socket/mysql.sock --socket_file=/vt/socket/mysqlctl.sock --tablet_uid=2469782763 --wait_time=2h0m0s ...
Look at the option--init_db_sql_file
above. It seems {Vitess} team has developed amysqlctld
process to start and init its mysql pods, this looks a bit like Qserventrypoint
but it is running from inside the MySQL container- MongoDB helm chart initialize the clustered database in the same pod that start mongodb (see setup.sh script): https://github.com/bitnami/charts/blob/master/bitnami/mongodb/templates/replicaset/scripts-configmap.yaml,
2.2. From outside the Pod running mysql
WARNING: this strategy will require an additional network 'root@%' access to mysql, this is not a recommended for security. The security consideration is that another pod could be scheduled to read data from localhost. Note they would need to have the authentication credentials too to access data.
This strategy could apply for replication controller/replication database, but not for Czar and Worker which both embed their mysql container.
2.2.1. From an initContainer
Pros:
emtrypoint container does not need to embed mysql
Cons:
Network unsafe: Require 'root@%' access to mysql (`%` will be required because the pod running the container will not be registered inside DNS if its main service is not started, because of readinessProbe)
User friendliness for k8s admin: ++, difficult to understand which service initialize which database without having a good knowledge of the project
Not useful for Czar and Worker as long as they embed a mysql container?
Example:
None yet
2.2.2. From a Container
Pros:
entrypoint container does not need to embed mysql
seperate init for xrootd
Cons:
Network unsafe: Require `root@%` access to mysql (`%` will be required because the pod running the container will not be registered inside DNS if its main service is not started, because of readinessProbe)
User friendliness for k8s admin: +, difficult to understand which service initialize which database without having a good knowledge of the project
Not useful for Czar and Worker as long as they embed a mysql container?
Example:
None yet
Remark: `root@%` access to mysql can be removed after database initialization in order to reduce risk, but this may end-up in a complex database upgrade procedure (add this access at the beginning of each upgrade, and then remove it)
8 Comments
Fritz Mueller
Thanks, Fabrice Jammes, for this analysis! I have been putting some thought into this lately as well. I'd like to call out a few needs/goals that I think we should try to accommodate in our chosen approach, some of which are already touched on in the summary above:
In thinking about this and surveying the literature, I think we've been swimming upstream a little against some k8s principles by sticking to designs which combine Qserv service containers and db service containers into multi-container pods. I think we have done this primarily to make the socket and security points above easy to address, but now we are causing ourselves some troubles.
Proposal
I would like to propose that we re-architect our system to move db service containers out into their own pods, and furthermore consider a tiered service design, where db services are a lower tier and Qserv services such as czars, workers, and RI ctrlrs are considered a second tier layered above. Very broadly, this would look like:
Database Tier:
Qserv Service Tier (czars, workers, RI ctrlrs):
I think such a design would have these advantages:
Some possible complications:
(More to come re. qserv service and db service init containers in a subsequent comment...)
Fritz Mueller
More on init containers
Following the ideas developed in the comment above, I would propose that init containers for the Qserv service tier be further instantiations of the existing Qserv unified binary container, implemented as additional container entrypoint "flavors". This maintains a low artifact count for our builds and keeps a tight coupling between service versions and their associated schema maintenance.
What about init containers for the proposed db service pods? In our current system, these are MariaDB instances.
The off-the-shelf MariaDB container image accommodates initialization by allowing mount-in of an init directory. When the container is started with the factory-supplied entry point, that entry point code determines whether there is a pre-existing db state directory or if the server is being started with a "blank slate". In the blank-slate case, it will look for and execute any shell scripts or SQL scripts provided in the (presumed mounted-in) init directory.
What seems to be commonly done here is to create an init container for a MariaDB instance which bundles in the desired init shell/SQL scripts. This init container then disgorges these scripts into a new
empty-dir
volume, which is also provided to the MariaDB container as a mount for the init dir in the MariaDB pod descriptor. This seems a practical and workable approach to me, and preferable to either using an entire separate instance of MariaDB (modified by script injection) as its own init container, or to building additional layers onto the stock MariaDB images. It may also be the case that this mechanism is sufficient to address our needs of setting the process user and installation ofscisql
. If this is the case, we could eliminate customized MariaDB image flavors from our build entirely. Needs a little further investigation...A remaining question would be "whither container image for MariaDB init containers?" One possibility might be to be to continue in the pattern of adding these responsibilities as and additional entrypoint flavor to the existing unified binary container. Is this starting to go too far with this, though?
Fabrice Jammes
Fritz Mueller thanks for these interesting ideas, most of them are great and will lead to a nicer design. However I think we need to fin-tune it for some Qserv components:
Do you think we need to fine-tune this proposal (do not split the worker pod + use initContainer next to db container to fully init/upgrade the db)? On my side I think it looks like a good compromise between modularity, simplicity and security, and in addition it seems to feet pretty well most of our requirements? I'll be happy to discuss more about it if needed.
Fabrice Jammes
I have asked a question related to the scheduling problem: https://stackoverflow.com/questions/71334775/is-it-possible-to-colocate-two-pods-with-the-same-index-but-belonging-to-two-dif
Dan Speck
Added comments. Option 2.1.1 with init container seems most viable approach with addition of flag file to signify if upgrade is occurring. For splitting the use of a hostpath with sockets is not much of an incremental security risk.
Dan Speck
Fritz Mueller Fabrice Jammes added some notes to above and diagrams. Option 2.1.1 looks most viable with recommended addition of adding flag file to signify upgrades. 2.2 would be nice, but scheduling would be complex because the exactly once requirement for xrootd. Could be done with hard anti-affinity, but not ideal. I can meet to discuss when convenient.
Fabrice Jammes
HI Dan Speck , thanks so much for your feedback on these complex feature!
Fabrice Jammes
Hi Fritz Mueller Unknown User (npease) , according to Dan Speck and what we have now inside Qserv, which is working pretty well for database initialization. I also think option 2.1.1 looks most viable (initialiaze the datase inside an initContainer running the pods which will run the mysql container). In order to do that, I think that extracting qserv-smig.py from qserv image and adding it to our mysql image would be enough. Would you please allow me to implement this solution?