This page responds to the DM-SST action to write description of how our deployment strategy meets DMS-REQ-0297 and provides the input to the verification of LVV-T146.
DMSR Requirement & Interpretation
3.3.1 DMS Initialization Component
ID: DMS-REQ-0297 (Priority: 1a)
Specification: The DMS shall contain a component that, at each Center, can initialize the DM Subsystem into a well-defined safe state when powered up.
Discussion: A safe state is one that does not permit the corruption or loss of previously archived data, nor of sending spurious information over any interface.
Derived from requirements:
- OSS-REQ-0041: Subsystem Activation
- OSS-REQ-0307: Subsystem Initialization
- OSS-REQ-0121: Open Source, Open Configuration
- OSS-REQ-0122: Provenance
The requirement was written 10 yrs ago and does not map to how we manage services today. Hence, we interpret the intent of this requirement to be that
-- DM services can be automatically (re-)started into a defined state without the need for manual procedures, e.g. someone needing to log on to potentially many different consoles to start services.
DM deployment Strategy
The DM service deployment model comprises::
- Kubernetes (K8s) : infrastructure for managing containerized services in one or more cluster environments. Most DM services run on K8s
- Knative: is built on top of Kubernetes and adds a higher-level abstraction specifically for serverless workloads.
- Phalanx: Rubin Observatory’s GitOps repository for managing Kubernetes environments. Phalanx provides an installation and configuration platform for services deployed on Kubernetes clusters
Other related components
- Safir: Rubin Observatory’s library for building FastAPI services for Phalanx / Kubernetes clusters
- Rucio: Service for managing large volumes of data spread across facilities at multiple institutions and organisations.
- FTS3: Data movement service (see https://iopscience.iop.org/article/10.1088/1742-6596/513/3/032081)
- S3: Amazon Simple Storage Service is a service offered – object storage through a web service interface
DM Services – current deployment status
The following table lists the DM services described in LDM-148. The deployment plan and current status are listed (from a slack discussion in #dm-arch with Kian-Tat Lim )
The exact services have changed since the last iteration of LDM-148 but the categories remain the same. The only change is that all services originally planned for the base are now at the summit
All of the following services are on Kubernetes or soon will be, except for the batch system which is independent but has its own "safe" startup state.
Service | Where | Notes | Status |
---|---|---|---|
Archiving (Prompt Base) | Summit, USDF | S3 on K8s. Summit currently ssh/NFS to bare metal machine, which also comes up in a known/safe state. | Deployed |
Planned Observation Publication (Prompt Base) (a.k.a. ObsLocTAP) | USDF | Will be Safir/Phalanx on K8s | Not deployed |
Prompt Processing Ingest (Prompt Base) (a.k.a. auto-ingest/embargo_butler) | USDF | On K8s | Deployed |
Observatory Operations Data (Prompt Base) | Summit | On K8s | Deployed |
Observatory Control System (OCS) Controlled Pipeline (Prompt Base) | Summit | On K8s | Deployed |
Telemetry Gateway (Prompt Base) | Summit | Sasquatch/Phalanx/Kafka on K8s | Deployed |
Prompt Processing (Prompt US) | USDF | KNative on K8s | Deployed |
Alert Distribution (Prompt US) | USDF | On K8s | Tests deployed, operational system not quite |
Prompt Quality Control (QC) (Prompt US) | USDF | Part of Prompt Processing payload, with publication to Sasquatch/Phalanx/Kafka and InfluxDB on K8s | Deployed |
Batch Production (Offline Production, Satellite Facility) | USDF | The batch system is independent. It uses Slurm not on K8s and has its own power-up safe state. PanDA and its RCEs are either on K8s or are like Slurm. | Deployed |
Offline QC (Offline Production) | USDF | Part of Batch Production payloads, with publication to Sasquatch/Phalanx/Kafka and InfluxDB on K8s | Deployed |
Bulk Distribution (Offline Production) | USDF | Rucio/FTS3 on K8s | Deployed |
Data Backbone (Archive Base and US) | USDF | Rucio/FTS3 + ingested on K8s | Deployed |
RSP Portal (Commissioning Cluster and DACs) | Summit, USDF | Phalanx on K8s | Deployed |
RSP Notebook (Commissioning Cluster and DACs) | Summit, USDF | Phalanx on K8s | Deployed |
RSP Web API (Commissioning Cluster and DACs) | Summit, USDF | Phalanx on K8s | Deployed |