Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MON-3547: Configuration CRD for the Cluster Monitoring Operator #1627

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
286 changes: 286 additions & 0 deletions enhancements/monitoring/monitoring-config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,286 @@
---
title: monitoring-config
authors:
- "@marioferh"
- "@danielmellado"
- "@jan--f"
- "@moadz"
- "@simonpasquier"
reviewers:
- "@jan--f"
- "@moadz"
- "@simonpasquier"
approvers:
- "@jan--f"
- "@moadz"
- "@simonpasquier"
- "@openshift/openshift-team-monitoring"
api-approvers:
- "@deads2k"
- "@JoelSpeed"
creation-date: 2024-04-26
last-updated: 20204-26
tracking-link:
- "https://issues.redhat.com/browse/MON-1100"

---

# CRD ClusterMonitoring

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

* Currently, the OCP monitoring stacks are configured using ConfigMaps (1 for platform monitoring and 1 for user-defined monitoring). In OpenShift though the best practice is to configure operators using custom resources.


## Motivation

* The specification is well known and to a degree self-documenting.
* The Kubernetes API Server validates custom resources based on their API specification, so users get immediate feedback on errors instead of checking the ClusterOperator object and the cluster monitoring operator's logs. Common expressions language even allows to write complex validation logic, including cross-field tests.
* Many users expect to interact with operators through a CRD.
* Custom resources play better with GitOps workflows.
* CRDs supports multiple actors managing the same resource which is a key property for the Observability service of Advanced Cluster Management.

### User Stories

As a cluster administrator, I want to be able to add a CRD so that I can configure monitoring stack.


### Goals

- Replace the existing ConfigMaps with CRDs.
- Automated and friction-less upgrade for users.

### Non-Goals

- Make things more complicated and easy to break.
- Change any behavior of monitoring stack

## Proposal

Currently in CMO, a config map provides a way to inject configuration data into pods. There are two configmaps for the different stacks:

cluster-monitoring-config: Default platform monitoring components. A set of platform monitoring components are installed in the openshift-monitoring project by default during an OpenShift Container Platform installation. This provides monitoring for core cluster components including Kubernetes services. The default monitoring stack also enables remote health monitoring for clusters.

user-workload-monitoring-config: Components for monitoring user-defined projects. After optionally enabling monitoring for user-defined projects, additional monitoring components are installed in the openshift-user-workload-monitoring project. This provides monitoring for user-defined projects. These components are illustrated in the User section in the following diagram.


Two distinct CRDs are necessary because they are managed by different personas with specific roles and responsibilities:

- UWM admins: manage the configuration of the UWM components (edit permissions on the openshift-user-workload-monitoring/user-workload-monitoring-config ConfigMap).
- Cluster admins: manage the configuration of the Platform monitoring components.

In managed OpenShift clusters like OSD/ROSA, two separate CRDs are necessary because platform SREs manage the cluster's platform monitoring stack, while customers manage the user-defined monitoring stack. This separation ensures that each group maintains control over their specific monitoring configurations, reducing conflicts and enhancing system management.

[More info](https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/multi-tenant-alerting.md)


- Replace configmaps with CRD:

Current Config struct in CMO

```
type Config struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the existing config?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Images *Images `json:"-"`
RemoteWrite bool `json:"-"`
CollectionProfilesFeatureGateEnabled bool `json:"-"`

ClusterMonitoringConfiguration *ClusterMonitoringConfiguration `json:"-"`
UserWorkloadConfiguration *UserWorkloadConfiguration `json:"-"`
}
```

marioferh marked this conversation as resolved.
Show resolved Hide resolved
We will strive to maintain the previous structure as much as possible while adapting it to OpenShift API standards.

Current ClusterMonitoringConfiguration struct:
```
type ClusterMonitoringConfiguration struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the existing state? Or desired?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AlertmanagerMainConfig *AlertmanagerMainConfig `json:"alertmanagerMain,omitempty"`
UserWorkloadEnabled *bool `json:"enableUserWorkload,omitempty"`
HTTPConfig *HTTPConfig `json:"http,omitempty"`
K8sPrometheusAdapter *K8sPrometheusAdapter `json:"k8sPrometheusAdapter,omitempty"`
MetricsServerConfig *MetricsServerConfig `json:"metricsServer,omitempty"`
KubeStateMetricsConfig *KubeStateMetricsConfig `json:"kubeStateMetrics,omitempty"`
PrometheusK8sConfig *PrometheusK8sConfig `json:"prometheusK8s,omitempty"`
PrometheusOperatorConfig *PrometheusOperatorConfig `json:"prometheusOperator,omitempty"`
PrometheusOperatorAdmissionWebhookConfig *PrometheusOperatorAdmissionWebhookConfig `json:"prometheusOperatorAdmissionWebhook,omitempty"`
OpenShiftMetricsConfig *OpenShiftStateMetricsConfig `json:"openshiftStateMetrics,omitempty"`
TelemeterClientConfig *TelemeterClientConfig `json:"telemeterClient,omitempty"`
ThanosQuerierConfig *ThanosQuerierConfig `json:"thanosQuerier,omitempty"`
NodeExporterConfig NodeExporterConfig `json:"nodeExporter,omitempty"`
MonitoringPluginConfig *MonitoringPluginConfig `json:"monitoringPlugin,omitempty"`
}
```


Each component within the ConfigMap will be migrated to the OpenShift API in separate PRs. This approach allows for a thorough review, improvement, and modification of each ConfigMap component to ensure it aligns with OpenShift API standards. As part of this process, types will be modified, outdated elements will be removed and names and configurations will be refined.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need to make sure that we bring in co-dependent fields for consideration at the same time. If there's any fields that are closely related we may want to shape the API together around them



## Design Details

To initiate the process, let's establish a feature gate that will serve as the entry point for implementing a CRD configuration approach. This strategy enables us to make incremental advancements without the immediate burden of achieving complete feature equivalence with the config map. We can commence with the basics and progressively incorporate additional functionalities as they develop.

One proposal for a minimal DoD was:
- Feature gate in openshift/api
- API types moved to openshift/api
- CRD Initial dev https://github.com/openshift/api/pull/1929
Add controller-gen logic to makefile
Add API to api/config/v1
Add Generated CRD: config/v1/zz_generated.crd-manifests/0000_10_config-operator_01_clustermonitoring.crd.yaml
Add example CustomResource: config/examples/clustermonitoringoperator.yaml
- Client codegen
- Reconcile logic: https://github.com/openshift/cluster-monitoring-operator/pull/2350
- Add decoupling ConfigMap / CustomResource:
Controller logic is strongly dependant of *manifests.Config struct.


### Example configuration


#### current ConfigMap

```
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
telemeterClient:
enabled: false
prometheusK8s:
volumeClaimTemplate:
metadata:
name: prometheus-data
annotations:
openshift.io/cluster-monitoring-drop-pvc: "yes"
spec:
resources:
requests:
storage: 20Gi
```

### CRD

```
apiVersion: clustermonitoring.config.openshift.io
kind: ClusterMonitoring
metadata:
name: cluster
namespace: openshift-monitoring
spec:
prometheusK8s:
volumeClaimTemplate:
metadata:
name: prometheus-data
annotations:
openshift.io/cluster-monitoring-drop-pvc: "yes"
spec:
resources:
requests:
storage: 20G
```


### Workflow Description

See example in the [Proposal](#Proposal) for details.

### API Extensions

Add two new CRD's:
- ClusterMonitoring
- UserWorkloadClusterMonitoring

### Topology Considerations

Will function identically in all topologies.

#### Hypershift / Hosted Control Planes
No special considerations.

#### Standalone Clusters
No special considerations.

#### Single-node Deployments or MicroShift
No special considerations.

### Implementation Details/Notes/Constraints


### Migration path


- Switch mechanism
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the migration path, I'd elaborate this a bit further and explain possible cornercases. How do you migrate a field from CM to CRD? just as an example.

Copy link
Author

@marioferh marioferh Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"How do you migrate a field from CM to CRD?" The idea is migrate type by type, but due a API constrains for example there is no bool then, an enum is used: e.g. https://github.com/openshift/api/blob/d505f130e50cfeaf8af78b72f8e40cab4f9a4325/config/v1/types_cluster_monitoring.go#L75 instead of https://github.com/openshift/cluster-monitoring-operator/blob/742eb6c326af8efc215413c08546db565fe40416/pkg/manifests/types.go#L41 and so on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a workflow that shows how a user is expected to migrate between the two would be helpful

- Both ConfigMap and CRD will coexist for a while until the transition is complete.
- CRD and ConfigMap will be merged into a single structure, this structure will config CMO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the order of priority in this merge?

- Extensive testing will be required to validate the new behavior.
- If there are different fields in CRD and ConfigMap, the field in the ConfigMap will take precedence.
- Some fields in CRD and ConfigMap diverge due to API requirements. Helper functions will be needed to resolve this.
- The end user will be able to choose to use ConfigMap or CRD at any time, keeping in mind that the fields in the CRD will gradually expand.

### Risks and Mitigations

- Issues when ConfigMap and CRD coexist, as ConfigMap and CRD will be merged in a single structure, this could be error prone and an extensive testing should be done.
- Mismatch in the configuration of operators with ConfigMap and CR

### Drawbacks

- Migration. The user can use either ConfigMap or CRD, or both; we need to explain how to make the transition, and the preference is to use the CRD.
- Coexisting of both ConfigMap and CRD.


## Test Plan

- Unit tests for the feature
- e2e tests covering the feature
marioferh marked this conversation as resolved.
Show resolved Hide resolved
- Upgrade testing
- FeatureGate

marioferh marked this conversation as resolved.
Show resolved Hide resolved
### Open questions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these questions still need to be answered?


- Merge CRD and ConfigMap could be error prone and will need extra test
- Change of types and names and how handle it when there are CRD and ConfigMap
- Which tools/mechanisms do we provide to cluster admins to minimize the migration burden? Should CMO automatically create the custom resource from the current ConfigMap when the CRD becomes GA? Or do we want a semi-automated approach with explicit approval from the cluster admins?
- Should we provide an escape hatch (via some annotation?) telling CMO to ignore the custom resource and only use the ConfigMap in case something goes bad?
- How do we tell cluster admins that they need to migrate? CMO conditions could be an option as well as alerting and Insights?- How do we know how many clusters have migrated to the new CRD? Telemetry I guess.
- How long should we keep the support for the ConfigMap?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will you know that all users have upgraded/how will you force them all to upgrade?


## Graduation Criteria

From Tech Preview to GA

### Dev Preview -> Tech Preview

### Tech Preview -> GA

- Tech Preview was available for at least one release
- High severity bugs are fixed
- Reliable CI signal, minimal test flakes.
- e2e tests are in place

### Removing a deprecated feature

## Upgrade / Downgrade Strategy

Feature will not affect standard upgrade or downgrade process of the cluster.

## Version Skew Strategy

N/A

## Implementation History

## Operational Aspects of API Extensions

## Support Procedures

## Alternatives