openshift · marioferh · Apr 30, 2024 · Jun 10, 2024 · Jun 20, 2024 · Jul 1, 2024
diff --git a/enhancements/monitoring/monitoring-config.md b/enhancements/monitoring/monitoring-config.md
@@ -0,0 +1,286 @@
+---
+title: monitoring-config
+authors:
+  - "@marioferh"
+  - "@danielmellado"
+  - "@jan--f"
+  - "@moadz"
+  - "@simonpasquier"
+reviewers:
+  - "@jan--f"
+  - "@moadz"
+  - "@simonpasquier"
+approvers:
+  - "@jan--f"
+  - "@moadz"
+  - "@simonpasquier"
+  - "@openshift/openshift-team-monitoring"
+api-approvers:
+  - "@deads2k"
+  - "@JoelSpeed"
+creation-date: 2024-04-26
+last-updated: 20204-26
+tracking-link:
+- "https://issues.redhat.com/browse/MON-1100"
+
+---
+
+# CRD ClusterMonitoring
+
+## Release Signoff Checklist
+
+- [ ] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+## Summary
+
+* Currently, the OCP monitoring stacks are configured using ConfigMaps (1 for platform monitoring and 1 for user-defined monitoring). In OpenShift though the best practice is to configure operators using custom resources.
+
+
+## Motivation
+
+* The specification is well known and to a degree self-documenting.
+* The Kubernetes API Server validates custom resources based on their API specification, so users get immediate feedback on errors instead of checking the ClusterOperator object and the cluster monitoring operator's logs. Common expressions language even allows to write complex validation logic, including cross-field tests.
+* Many users expect to interact with operators through a CRD.
+* Custom resources play better with GitOps workflows. 
+* CRDs supports multiple actors managing the same resource which is a key property for the Observability service of Advanced Cluster Management.
+
+### User Stories
+
+As a cluster administrator, I want to be able to add a CRD so that I can configure monitoring stack.
+
+
+### Goals
+
+- Replace the existing ConfigMaps with CRDs.
+- Automated and friction-less upgrade for users.
+
+### Non-Goals
+
+- Make things more complicated and easy to break.
+- Change any behavior of monitoring stack
+
+## Proposal
+
+Currently in CMO, a config map provides a way to inject configuration data into pods. There are two configmaps for the different stacks:
+
+    cluster-monitoring-config: Default platform monitoring components. A set of platform monitoring components are installed in the openshift-monitoring project by default during an OpenShift Container Platform installation. This provides monitoring for core cluster components including Kubernetes services. The default monitoring stack also enables remote health monitoring for clusters.
+
+    user-workload-monitoring-config: Components for monitoring user-defined projects. After optionally enabling monitoring for user-defined projects, additional monitoring components are installed in the openshift-user-workload-monitoring project. This provides monitoring for user-defined projects. These components are illustrated in the User section in the following diagram.
+
+
+Two distinct CRDs are necessary because they are managed by different personas with specific roles and responsibilities:
+
+    - UWM admins: manage the configuration of the UWM components (edit permissions on the openshift-user-workload-monitoring/user-workload-monitoring-config ConfigMap).
+    - Cluster admins: manage the configuration of the Platform monitoring components.
+
+In managed OpenShift clusters like OSD/ROSA, two separate CRDs are necessary because platform SREs manage the cluster's platform monitoring stack, while customers manage the user-defined monitoring stack. This separation ensures that each group maintains control over their specific monitoring configurations, reducing conflicts and enhancing system management.
+
+[More info](https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/multi-tenant-alerting.md)
+
+
+- Replace configmaps with CRD:
+
+Current Config struct in CMO
+
+```
+  type Config struct {
+	Images                               *Images `json:"-"`
+	RemoteWrite                          bool    `json:"-"`
+	CollectionProfilesFeatureGateEnabled bool    `json:"-"`
+
+	ClusterMonitoringConfiguration *ClusterMonitoringConfiguration `json:"-"`
+	UserWorkloadConfiguration      *UserWorkloadConfiguration      `json:"-"`
+}
+```
+
+We will strive to maintain the previous structure as much as possible while adapting it to OpenShift API standards. 
+
+Current ClusterMonitoringConfiguration struct:
+```
+type ClusterMonitoringConfiguration struct {
+	AlertmanagerMainConfig *AlertmanagerMainConfig `json:"alertmanagerMain,omitempty"`
+	UserWorkloadEnabled *bool `json:"enableUserWorkload,omitempty"`
+	HTTPConfig *HTTPConfig `json:"http,omitempty"`
+	K8sPrometheusAdapter *K8sPrometheusAdapter `json:"k8sPrometheusAdapter,omitempty"`
+	MetricsServerConfig *MetricsServerConfig `json:"metricsServer,omitempty"`
+	KubeStateMetricsConfig *KubeStateMetricsConfig `json:"kubeStateMetrics,omitempty"`
+	PrometheusK8sConfig *PrometheusK8sConfig `json:"prometheusK8s,omitempty"`
+	PrometheusOperatorConfig *PrometheusOperatorConfig `json:"prometheusOperator,omitempty"`
+	PrometheusOperatorAdmissionWebhookConfig *PrometheusOperatorAdmissionWebhookConfig `json:"prometheusOperatorAdmissionWebhook,omitempty"`
+	OpenShiftMetricsConfig *OpenShiftStateMetricsConfig `json:"openshiftStateMetrics,omitempty"`
+	TelemeterClientConfig *TelemeterClientConfig `json:"telemeterClient,omitempty"`
+	ThanosQuerierConfig *ThanosQuerierConfig `json:"thanosQuerier,omitempty"`
+	NodeExporterConfig NodeExporterConfig `json:"nodeExporter,omitempty"`
+	MonitoringPluginConfig *MonitoringPluginConfig `json:"monitoringPlugin,omitempty"`
+}
+```
+
+
+Each component within the ConfigMap will be migrated to the OpenShift API in separate PRs. This approach allows for a thorough review, improvement, and modification of each ConfigMap component to ensure it aligns with OpenShift API standards. As part of this process, types will be modified, outdated elements will be removed and names and configurations will be refined.
+
+
+## Design Details
+
+To initiate the process, let's establish a feature gate that will serve as the entry point for implementing a CRD configuration approach. This strategy enables us to make incremental advancements without the immediate burden of achieving complete feature equivalence with the config map. We can commence with the basics and progressively incorporate additional functionalities as they develop.
+
+One proposal for a minimal DoD was:
+- Feature gate in openshift/api
+    - API types moved to openshift/api
+    - CRD Initial dev https://github.com/openshift/api/pull/1929
+        Add controller-gen logic to makefile
+        Add API to api/config/v1
+        Add Generated CRD: config/v1/zz_generated.crd-manifests/0000_10_config-operator_01_clustermonitoring.crd.yaml
+        Add example CustomResource: config/examples/clustermonitoringoperator.yaml
+    - Client codegen 
+    - Reconcile logic: https://github.com/openshift/cluster-monitoring-operator/pull/2350
+    - Add decoupling ConfigMap / CustomResource:
+        Controller logic is strongly dependant of *manifests.Config struct.
+
+
+### Example configuration
+
+
+#### current ConfigMap
+
+```
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: cluster-monitoring-config
+  namespace: openshift-monitoring
+data:
+  config.yaml: |
+    telemeterClient:
+      enabled: false
+    prometheusK8s:
+      volumeClaimTemplate:
+        metadata:
+          name: prometheus-data
+          annotations:
+            openshift.io/cluster-monitoring-drop-pvc: "yes"
+        spec:
+          resources:
+            requests:
+              storage: 20Gi
+```
+
+### CRD
+
+```
+apiVersion: clustermonitoring.config.openshift.io
+kind: ClusterMonitoring
+metadata:
+  name: cluster
+  namespace: openshift-monitoring
+spec:
+  prometheusK8s:
+    volumeClaimTemplate:
+      metadata:
+        name: prometheus-data
+        annotations:
+          openshift.io/cluster-monitoring-drop-pvc: "yes"
+      spec:
+        resources:
+          requests:
+            storage: 20G
+```
+
+
+### Workflow Description
+
+See example in the [Proposal](#Proposal) for details.
+
+### API Extensions
+
+Add two new CRD's:
+- ClusterMonitoring
+- UserWorkloadClusterMonitoring
+
+### Topology Considerations
+
+Will function identically in all topologies.
+
+#### Hypershift / Hosted Control Planes
+No special considerations.
+
+#### Standalone Clusters
+No special considerations.
+
+#### Single-node Deployments or MicroShift
+No special considerations.
+
+### Implementation Details/Notes/Constraints
+
+
+### Migration path
+
+
+- Switch mechanism
+- Both ConfigMap and CRD will coexist for a while until the transition is complete.
+- CRD and ConfigMap will be merged into a single structure, this structure will config CMO
+- Extensive testing will be required to validate the new behavior.
+- If there are different fields in CRD and ConfigMap, the field in the ConfigMap will take precedence.
+- Some fields in CRD and ConfigMap diverge due to API requirements. Helper functions will be needed to resolve this.
+- The end user will be able to choose to use ConfigMap or CRD at any time, keeping in mind that the fields in the CRD will gradually expand.
+
+### Risks and Mitigations
+
+- Issues when ConfigMap and CRD coexist, as ConfigMap and CRD will be merged in a single structure, this could be error prone and an extensive testing should be done.
+- Mismatch in the configuration of operators with ConfigMap and CR
+
+### Drawbacks
+
+- Migration. The user can use either ConfigMap or CRD, or both; we need to explain how to make the transition, and the preference is to use the CRD.
+- Coexisting of both ConfigMap and CRD.
+
+
+## Test Plan
+
+- Unit tests for the feature
+- e2e tests covering the feature
+- Upgrade testing
+- FeatureGate
+
+### Open questions
+
+- Merge CRD and ConfigMap could be error prone and will need extra test
+- Change of types and names and how handle it when there are CRD and ConfigMap
+- Which tools/mechanisms do we provide to cluster admins to minimize the migration burden? Should CMO automatically create the custom resource from the current ConfigMap when the CRD becomes GA? Or do we want a semi-automated approach with explicit approval from the cluster admins?
+- Should we provide an escape hatch (via some annotation?) telling CMO to ignore the custom resource and only use the ConfigMap in case something goes bad?
+- How do we tell cluster admins that they need to migrate? CMO conditions could be an option as well as alerting and Insights?- How do we know how many clusters have migrated to the new CRD? Telemetry I guess.
+- How long should we keep the support for the ConfigMap?
+
+## Graduation Criteria
+
+From Tech Preview to GA
+
+### Dev Preview -> Tech Preview
+
+### Tech Preview -> GA
+
+- Tech Preview was available for at least one release
+- High severity bugs are fixed
+- Reliable CI signal, minimal test flakes.
+- e2e tests are in place
+
+### Removing a deprecated feature
+
+## Upgrade / Downgrade Strategy
+
+Feature will not affect standard upgrade or downgrade process of the cluster.
+
+## Version Skew Strategy
+
+N/A
+
+## Implementation History
+
+## Operational Aspects of API Extensions
+
+## Support Procedures
+
+## Alternatives