Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: MON-3547: Configuration CRD for the Cluster Monitoring Operator #1627

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

marioferh
Copy link

No description provided.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2024
Copy link
Contributor

openshift-ci bot commented May 15, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jan--f for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@simonpasquier
Copy link
Contributor

/cc @jan--f

@openshift-ci openshift-ci bot requested a review from jan--f May 23, 2024 07:58
@marioferh marioferh force-pushed the crd_based_cmo branch 2 times, most recently from f9b89e6 to 0705ade Compare May 23, 2024 14:41
@marioferh marioferh changed the title WIP: CRD based CMO WIP: MON-3547: CRD based CMO May 23, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented May 23, 2024

@marioferh: This pull request references MON-3547 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 23, 2024
Signed-off-by: Mario Fernandez <mariofer@redhat.com>
enhancements/monitoring/crd-based-cmo.md Outdated Show resolved Hide resolved
enhancements/monitoring/crd-based-cmo.md Outdated Show resolved Hide resolved
enhancements/monitoring/crd-based-cmo.md Outdated Show resolved Hide resolved
enhancements/monitoring/crd-based-cmo.md Outdated Show resolved Hide resolved
enhancements/monitoring/crd-based-cmo.md Outdated Show resolved Hide resolved
enhancements/monitoring/crd-based-cmo.md Outdated Show resolved Hide resolved
enhancements/monitoring/crd-based-cmo.md Outdated Show resolved Hide resolved

### Transition to the user

- How the user could adopt CR instead of configmap.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO the upgrade experience brings many questions (not limited to):

  • Which tools/mechanisms do we provide to cluster admins to minimize the migration burden? Should CMO automatically create the custom resource from the current configmap when the CRD becomes GA? Or do we want a semi-automated approach with explicit approval from the cluster admins?
  • Should we provide an escape hatch (via some annotation?) telling CMO to ignore the custom resource and only use the configmap in case something goes bad?
  • How do we tell cluster admins that they need to migrate? CMO conditions could be an option as well as alerting and Insights?
  • How do we know how many clusters have migrated to the new CRD? Telemetry I guess.
  • How long should we keep the support for the configmap?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how CMO works I think we might want to prioritize a low friction experience. Ideally admins won't have to do anything even after we enable the feature gate by default...if possible.
It seems that either the configmaps and CRs coexisting or a translation from configmap to CRD will have to be part of this.

Currently I lean towards the coexisting approach. I think this allows more flexibility when adding CRD fields. For example we might want to add stricter validations to a field in the CRD. With the translation approach we might not be able to pull something in without user intervention.
Otoh both config artifacts coexisting might raise issues with how a feature is configured. Perhaps that could be clarified through status conditions and additional logging?

enhancements/monitoring/crd-based-cmo.md Outdated Show resolved Hide resolved
enhancements/monitoring/crd-based-cmo.md Outdated Show resolved Hide resolved
Signed-off-by: Mario Fernandez <mariofer@redhat.com>
@marioferh marioferh changed the title WIP: MON-3547: CRD based CMO WIP: MON-3547: Configuration CRD for the Cluster Monitoring Operator Jun 12, 2024
Co-authored-by: Jan Fajerski <jan--f@users.noreply.github.com>
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 30, 2024
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, it seems there are a lot of open questions and things that have not yet been thought through. It would be good to get some of those open questions discussed before we get too far down the road of implementing an API

Comment on lines 99 to 100
ClusterMonitoringConfiguration
UserWorkloadConfiguration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2 CRDs instead of one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they can have different personas managing each. For instance with managed OpenShift clusters (OSD/ROSA), the platform SREs own the cluster (platform) monitoring stack while platform users (e.g. customers) own the user-defined monitoring stack.

Signed-off-by: Mario Fernandez <mariofer@redhat.com>
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the proposal lacks a few sections from the template (like "Risks and mitigations" for instance). Can you add them?

enhancements/monitoring/monitoring-config.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-config.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-config.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-config.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-config.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-config.md Outdated Show resolved Hide resolved

### Overview

- Replace confimgaps with CRD:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a bit more details on the current implementation and the overall context for readers not familiar with the product?

  • Describe the 2 stacks (platform & UWM)
  • How does an admin configure each stack? Relationship between the 2?
  • Explain that there can be 2 personas responsible for each stack (the managed OpenShift use case).

A link to https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/multi-tenant-alerting.md would be useful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add improvements

UserWorkloadConfiguration *UserWorkloadConfiguration `json:"-"`
}
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add here the general direction that we're going to keep more or less the structure of ClusterMonitoringConfiguration and UserWorkloadConfiguration but align them to the OpenShift API standards.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add improvements

enhancements/monitoring/monitoring-config.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-config.md Show resolved Hide resolved
marioferh and others added 2 commits August 7, 2024 15:17
Co-authored-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Mario Fernandez <mariofer@redhat.com>
Copy link
Contributor

openshift-ci bot commented Aug 9, 2024

@marioferh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint 1ed7227 link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

## Motivation

* The specification is well known and to a degree self-documenting.
* The Kubernetes API Server validate custom resources based on their API specification, so users get immediate feedback on errors instead of checking the ClusterOperator object and the cluster monitoring operator's logs. Custom expression language (CEL) even allows to write complex validation logic, including cross-field tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Common expressions language, not Custom expressions language

- Replace confimgaps with CRD:

```
type Config struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the existing config?



```
type ClusterMonitoringConfiguration struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the existing state? Or desired?



- Switch mecanishm
- Both ConfigMap and CRD will coexist for a while until the transition is complete.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will clusters that do not currently have a CR configured, but do have the configmap, continue to function without manual intervention as you go through upgrades?

- Extensive testing will be required to validate the new behavior.
- If there are different fields in CRD and ConfigMap, the field in the ConfigMap will take precedence.
- Some fields in CRD and ConfigMap diverge due to API requirements. Helper functions will be needed to resolve this.
- The end user will be able to choose to use ConfigMap or CRD at any time, keeping in mind that the fields in the CRD will gradually expand.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be a conscious choice? Is there some explicit switch for this or is it just a case of if one or the other exists?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does user still can choose to use ConfigMap or CRD after the feature GA?

[More info](https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/multi-tenant-alerting.md)


- Replace confimgaps with CRD:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Replace confimgaps with CRD:
- Replace configmaps with CRD:

Add example CustomResource: config/examples/clustermonitoringoperator.yaml
- Client codegen
- Reconcile logic: https://github.com/openshift/cluster-monitoring-operator/pull/2350
- Add decoupling Confimgap / CustomResource:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Add decoupling Confimgap / CustomResource:
- Add decoupling Configmap / CustomResource:

## Motivation

* The specification is well known and to a degree self-documenting.
* The Kubernetes API Server validate custom resources based on their API specification, so users get immediate feedback on errors instead of checking the ClusterOperator object and the cluster monitoring operator's logs. Custom expression language (CEL) even allows to write complex validation logic, including cross-field tests.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The Kubernetes API Server validate custom resources based on their API specification, so users get immediate feedback on errors instead of checking the ClusterOperator object and the cluster monitoring operator's logs. Custom expression language (CEL) even allows to write complex validation logic, including cross-field tests.
* The Kubernetes API Server validates custom resources based on their API specification, so users get immediate feedback on errors instead of checking the ClusterOperator object and the cluster monitoring operator's logs. Custom expression language (CEL) even allows to write complex validation logic, including cross-field tests.

### Migration path


- Switch mecanishm
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Switch mecanishm
- Switch mechanism


## Design Details

To initiate the process, let's establish a feature gate that will serve as the entry point for implementing a CRD configuration approach. This strategy enables us to make incremental advancements without the immediate burden of achieving complete feature equivalence with the config map. We can commence with a the basics and progressively incorporate additional functionalities as they develop.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To initiate the process, let's establish a feature gate that will serve as the entry point for implementing a CRD configuration approach. This strategy enables us to make incremental advancements without the immediate burden of achieving complete feature equivalence with the config map. We can commence with a the basics and progressively incorporate additional functionalities as they develop.
To initiate the process, let's establish a feature gate that will serve as the entry point for implementing a CRD configuration approach. This strategy enables us to make incremental advancements without the immediate burden of achieving complete feature equivalence with the config map. We can commence with the basics and progressively incorporate additional functionalities as they develop.

- "@danielmellado"
- "@jan--f"
- "@moadz"
- "simonpasquier"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- "simonpasquier"
- "@simonpasquier"

reviewers:
- "@jan--f"
- "@moadz"
- "simonpasquier"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- "simonpasquier"
- "@simonpasquier"

approvers:
- "@jan--f"
- "@moadz"
- "simonpasquier"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- "simonpasquier"
- "@simonpasquier"


### Overview

Currently in CMO a config map provides a way to inject configuration data into pods. There are two configmaps for the different stacks:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Currently in CMO a config map provides a way to inject configuration data into pods. There are two configmaps for the different stacks:
Currently in CMO, a config map provides a way to inject configuration data into pods. There are two configmaps for the different stacks:


Currently in CMO a config map provides a way to inject configuration data into pods. There are two configmaps for the different stacks:

cluster-monitoring-configmap: Default platform monitoring components. A set of platform monitoring components are installed in the openshift-monitoring project by default during an OpenShift Container Platform installation. This provides monitoring for core cluster components including Kubernetes services. The default monitoring stack also enables remote health monitoring for clusters.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cluster-monitoring-configmap: Default platform monitoring components. A set of platform monitoring components are installed in the openshift-monitoring project by default during an OpenShift Container Platform installation. This provides monitoring for core cluster components including Kubernetes services. The default monitoring stack also enables remote health monitoring for clusters.
cluster-monitoring-config: Default platform monitoring components. A set of platform monitoring components are installed in the openshift-monitoring project by default during an OpenShift Container Platform installation. This provides monitoring for core cluster components including Kubernetes services. The default monitoring stack also enables remote health monitoring for clusters.


One proposal for a minimal DoD was:
- Feature gate in openshift/api
- Api types moved to openshift/api
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Api types moved to openshift/api
- API types moved to openshift/api

To initiate the process, let's establish a feature gate that will serve as the entry point for implementing a CRD configuration approach. This strategy enables us to make incremental advancements without the immediate burden of achieving complete feature equivalence with the config map. We can commence with a the basics and progressively incorporate additional functionalities as they develop.

One proposal for a minimal DoD was:
- Feature gate in openshift/api
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Feature gate in openshift/api
- Feature gate in openshift/api

it seems you want to get a list, but the format is

One proposal for a minimal DoD was: - Feature gate in openshift/api - Api types moved to openshift/api - CRD Initial dev https://github.com/openshift/api/pull/1929 Add controller-gen logic to makefile Add API to api/config/v1 Add Generated CRD: config/v1/zz_generated.crd-manifests/0000_10_config-operator_01_clustermonitoring.crd.yaml Add example CustomResource: config/examples/clustermonitoringoperator.yaml - Client codegen - Reconcile logic: https://github.com/openshift/cluster-monitoring-operator/pull/2350 - Add decoupling Confimgap / CustomResource: Controller logic is strongly dependant of *manifests.Config struct.

resources:
requests:
storage: 20Gi

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could remove the empty line

spec:
resources:
requests:
storage: 20G`
Copy link

@juzhao juzhao Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
storage: 20G`
storage: 20G

or

Suggested change
storage: 20G`
storage: 20Gi

### Test Plan

- Unit tests for the feature
- e2e tests covering the feature
Copy link

@juzhao juzhao Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think upgrade testing should also be covered

@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants