Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MON-3802: implement cross-namespace rules for UWM #2307

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

simonpasquier
Copy link
Contributor

@simonpasquier simonpasquier commented Apr 9, 2024

This change introduces a way to deploy user-defined rules which are not
scoped to their namespace of origin.

To enable the feature, a user-defined monitoring admin needs to
configure at least one namespace in the UWM ConfigMap:

    kind: ConfigMap
    apiVersion: v1
    metadata:
      name: user-workload-monitoring-config
      namespace: openshift-user-workload-monitoring
    data:
      config.yaml: |-
        namespacesWithoutLabelEnforcement: [ user-monitoring-shared ]

For all PrometheusRule objects defined in the user-monitoring-shared
namespace, Prometheus and Thanos Ruler evaluate the PromQL expressions
without enforcing the namespace label of origin. It makes it possible to
have generic rules that get applied to all (or a subset of) the user
projects instead of having individual rule objects in each user project.

The capability is enabled by default but a cluster admin can decide to
disable it in the CMO ConfigMap:

    kind: ConfigMap
    apiVersion: v1
    metadata:
      name: cluster-monitoring-config
      namespace: openshift-monitoring
    data:
      config.yaml: |-
        userWorkloadEnabled: true
        userWorkload:
         rulesWithoutLabelEnforcementAllowed: false

For example, a user-defined admin can create a single rule that fires
when a user namespace doesn't enforce the Restricted pod security
policy.

    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      name: security
      namespace: user-monitoring-shared
    spec:
      groups:
        - name: pod-security-policy
          rules:
            - alert: "NamespaceNotEnforcingRestrictedPolicy"
              expr: kube_namespace_labels{namespace!~"(openshift|kube).*|default",label_pod_security_kubernetes_io_enforce!="restricted"}
              for: 5m
              annotations:
                summary: "Restricted policy not enforced"
                description: "Namespace {{ $labels.namespace }} doesn't enforce the Restricted pod security policy."
              labels:
                severity: warning
  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 9, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 9, 2024

@simonpasquier: This pull request references MON-3802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to this:

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@simonpasquier simonpasquier changed the title MON-3802: implement cross-namespace rules for UWM [WIP] MON-3802: implement cross-namespace rules for UWM Apr 9, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 9, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 9, 2024

@simonpasquier: This pull request references MON-3802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to this:

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Apr 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 9, 2024
@simonpasquier simonpasquier force-pushed the MON-3802 branch 5 times, most recently from 5db1763 to 3b2e644 Compare April 11, 2024 13:38
@simonpasquier
Copy link
Contributor Author

/skip

@simonpasquier simonpasquier changed the title [WIP] MON-3802: implement cross-namespace rules for UWM MON-3802: implement cross-namespace rules for UWM Apr 12, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 12, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 12, 2024

@simonpasquier: This pull request references MON-3802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to this:

This change introduces a way to deploy user-defined rules which are not
scoped to their namespace of origin.

To enable the feature, a user-defined monitoring admin needs to
configure at least one namespace in the UWM ConfigMap:

   kind: ConfigMap
   apiVersion: v1
   metadata:
     name: user-workload-monitoring-config
     namespace: openshift-user-workload-monitoring
   data:
     config.yaml: |-
       namespacesWithoutLabelEnforcement: [ user-monitoring-shared ]

For all PrometheusRule objects defined in the user-monitoring-shared
namespace, Prometheus and Thanos Ruler evaluate the PromQL expressions
without enforcing the namespace label of origin. It makes it possible to
have generic rules that get applied to all (or a subset of) the user
projects instead of having individual rule objects in each user project.

The capability is enabled by default but a cluster admin can decide to
disable it in the CMO ConfigMap:

   kind: ConfigMap
   apiVersion: v1
   metadata:
     name: cluster-monitoring-config
     namespace: openshift-monitoring
   data:
     config.yaml: |-
       userWorkloadEnabled: true
       userWorkload:
        rulesWithoutNamespaceLabelEnforcementEnabled: false
  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@simonpasquier
Copy link
Contributor Author

/cc @bburt-rh
/cc @jan--f

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 12, 2024

@simonpasquier: This pull request references MON-3802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to this:

This change introduces a way to deploy user-defined rules which are not
scoped to their namespace of origin.

To enable the feature, a user-defined monitoring admin needs to
configure at least one namespace in the UWM ConfigMap:

   kind: ConfigMap
   apiVersion: v1
   metadata:
     name: user-workload-monitoring-config
     namespace: openshift-user-workload-monitoring
   data:
     config.yaml: |-
       namespacesWithoutLabelEnforcement: [ user-monitoring-shared ]

For all PrometheusRule objects defined in the user-monitoring-shared
namespace, Prometheus and Thanos Ruler evaluate the PromQL expressions
without enforcing the namespace label of origin. It makes it possible to
have generic rules that get applied to all (or a subset of) the user
projects instead of having individual rule objects in each user project.

The capability is enabled by default but a cluster admin can decide to
disable it in the CMO ConfigMap:

   kind: ConfigMap
   apiVersion: v1
   metadata:
     name: cluster-monitoring-config
     namespace: openshift-monitoring
   data:
     config.yaml: |-
       userWorkloadEnabled: true
       userWorkload:
        rulesWithoutLabelEnforcementAllowed: false
  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 12, 2024

@simonpasquier: This pull request references MON-3802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to this:

This change introduces a way to deploy user-defined rules which are not
scoped to their namespace of origin.

To enable the feature, a user-defined monitoring admin needs to
configure at least one namespace in the UWM ConfigMap:

   kind: ConfigMap
   apiVersion: v1
   metadata:
     name: user-workload-monitoring-config
     namespace: openshift-user-workload-monitoring
   data:
     config.yaml: |-
       namespacesWithoutLabelEnforcement: [ user-monitoring-shared ]

For all PrometheusRule objects defined in the user-monitoring-shared
namespace, Prometheus and Thanos Ruler evaluate the PromQL expressions
without enforcing the namespace label of origin. It makes it possible to
have generic rules that get applied to all (or a subset of) the user
projects instead of having individual rule objects in each user project.

The capability is enabled by default but a cluster admin can decide to
disable it in the CMO ConfigMap:

   kind: ConfigMap
   apiVersion: v1
   metadata:
     name: cluster-monitoring-config
     namespace: openshift-monitoring
   data:
     config.yaml: |-
       userWorkloadEnabled: true
       userWorkload:
        rulesWithoutLabelEnforcementAllowed: false

For example, a user-defined admin can create a single rule that fires
when a user namespace doesn't enforce the Restricted pod security
policy.

   apiVersion: monitoring.coreos.com/v1
   kind: PrometheusRule
   metadata:
     name: security
     namespace: user-monitoring-shared
   spec:
     groups:
       - name: pod-security-policy
         rules:
           - alert: "NamespaceNotEnforcingRestrictedPolicy"
             expr: kube_namespace_labels{namespace!~"(openshift|kube).*|default)",label_pod_security_kubernetes_io_enforce!="restricted"}
             for: 5m
             annotations:
               summary: "Restricted policy not enforced"
               description: "Namespace {{ $labels.namespace }} doesn't enforce the Restricted pod security policy."
             labels:
               severity: warning
  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@simonpasquier simonpasquier force-pushed the MON-3802 branch 2 times, most recently from d319656 to 22bb813 Compare April 15, 2024 08:52
@simonpasquier
Copy link
Contributor Author

current status:

  • Need to fix the e2e tests when cross-namespace rules are disabled.
  • I found an issue with the dev console: the alerts generated from cross-namespace rules aren't visible because the console queries for the rules with the exact namespace value.

@jan--f
Copy link
Contributor

jan--f commented Apr 15, 2024

As mentioned in the sync earlier, I would like to make sure we make it very clear to the user what we mean by cross-namespace. Iiuc this means user namespace as well as system namespaces. Since this currently talks about cross namespace in the context of UWM, I think we should make it explicit that system namespaces can be queried as well for such rules (if that is actually the case).
I'm sure @bburt-rh would know best how to phrase that.

@simonpasquier simonpasquier force-pushed the MON-3802 branch 2 times, most recently from f9d4f45 to 5b874d1 Compare April 16, 2024 06:25
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 16, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 16, 2024
@simonpasquier
Copy link
Contributor Author

summarizing offline discussions about the console issue

The problem is that cross-namespace rules won't be displayed in the developer console.

As an example, I've deployed a cross-namespace rule (NamespaceNotEnforcingRestrictedPolicy ) into the user-monitoring-shared namespace. The rule fires an alert for the ns1 namespace but the alert isn't visible in the dev console:

Screenshot from 2024-04-16 12-22-13

It is visible in the admin console though:

Screenshot from 2024-04-16 12-23-09

In terms of user experience, it is less than ideal since a user with only access to the ns1 project can't see the alert being active and they can't silence it if they receive an Alertmanager notification for it.

The reason behind the issue is that the console uses the /api/v1/rules endpoint exposed by prom-label-proxy which will only return alerting rules with a static namespace="<selected namespace>" label.

Possible options being discussed:

  • Do nothing in the short term and accept it as a know issue.
  • Have the dev console use the /api/v1/alerts endpoint instead. The problem is that we won't have access to the alert definition (including the PromQL expression) which makes it hard for the user to understand the cause and investigate further.
  • Modify prom-label-proxy to return any rule that matches the given namespace or that has an alert matching the given namespace. It looks like the most appropriate solution and something that also makes outside of OCP.

@simonpasquier
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 16, 2024
@simonpasquier
Copy link
Contributor Author

Modify prom-label-proxy to return any rule that matches the given namespace or that has an alert matching the given namespace. It looks like the most appropriate solution and something that also makes outside of OCP.

I tested this with openshift/prom-label-proxy#369 and it's almost working. When clicking on the alert link to open the PromQL expression in the metrics dashboard, prom-label-proxy replies with a 400 status code and label matcher value (namespace="user-monitoring-shared") conflicts with injected value (namespace!~"(openshift|kube).*|default"). This is because prom-label-proxy runs with -error-on-replace.

Screenshot from 2024-04-17 14-46-54

Screenshot from 2024-04-17 14-46-49

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jun 17, 2024

@simonpasquier: This pull request references MON-3802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.17.0" version, but no target version was set.

In response to this:

This change introduces a way to deploy user-defined rules which are not
scoped to their namespace of origin.

To enable the feature, a user-defined monitoring admin needs to
configure at least one namespace in the UWM ConfigMap:

   kind: ConfigMap
   apiVersion: v1
   metadata:
     name: user-workload-monitoring-config
     namespace: openshift-user-workload-monitoring
   data:
     config.yaml: |-
       namespacesWithoutLabelEnforcement: [ user-monitoring-shared ]

For all PrometheusRule objects defined in the user-monitoring-shared
namespace, Prometheus and Thanos Ruler evaluate the PromQL expressions
without enforcing the namespace label of origin. It makes it possible to
have generic rules that get applied to all (or a subset of) the user
projects instead of having individual rule objects in each user project.

The capability is enabled by default but a cluster admin can decide to
disable it in the CMO ConfigMap:

   kind: ConfigMap
   apiVersion: v1
   metadata:
     name: cluster-monitoring-config
     namespace: openshift-monitoring
   data:
     config.yaml: |-
       userWorkloadEnabled: true
       userWorkload:
        rulesWithoutLabelEnforcementAllowed: false

For example, a user-defined admin can create a single rule that fires
when a user namespace doesn't enforce the Restricted pod security
policy.

   apiVersion: monitoring.coreos.com/v1
   kind: PrometheusRule
   metadata:
     name: security
     namespace: user-monitoring-shared
   spec:
     groups:
       - name: pod-security-policy
         rules:
           - alert: "NamespaceNotEnforcingRestrictedPolicy"
             expr: kube_namespace_labels{namespace!~"(openshift|kube).*|default",label_pod_security_kubernetes_io_enforce!="restricted"}
             for: 5m
             annotations:
               summary: "Restricted policy not enforced"
               description: "Namespace {{ $labels.namespace }} doesn't enforce the Restricted pod security policy."
             labels:
               severity: warning
  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@simonpasquier
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 8, 2024
@simonpasquier
Copy link
Contributor Author

/retest-required

1 similar comment
@simonpasquier
Copy link
Contributor Author

/retest-required

@simonpasquier
Copy link
Contributor Author

/skip

This change introduces a way to deploy user-defined rules which are not
scoped to their namespace of origin.

To enable the feature, a user-defined monitoring admin needs to
configure at least one namespace in the UWM ConfigMap:

```
kind: ConfigMap
apiVersion: v1
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |-
    namespacesWithoutLabelEnforcement: [ user-monitoring-shared ]
```

For all `PrometheusRule` objects defined in the `user-monitoring-shared`
namespace, Prometheus and Thanos Ruler evaluate the PromQL expressions
without enforcing the namespace label of origin. It makes it possible to
have generic rules that get applied to all (or a subset of) the user
projects instead of having individual rule objects in each user project.
The rules can also query for platform namespaces only.

The capability is enabled by default but a cluster admin can decide to
disable it in the CMO ConfigMap:

```
kind: ConfigMap
apiVersion: v1
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |-
    userWorkloadEnabled: true
    userWorkload:
     rulesWithoutLabelEnforcementAllowed: false
```

For example, a user-defined monitoring admin can create a single rule
that fires when a user namespace doesn't enforce the Restricted pod
security policy:

```
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: security
  namespace: user-monitoring-shared
spec:
  groups:
    - name: pod-security-policy
      rules:
        - alert: "NamespaceNotEnforcingRestrictedPolicy"
          expr: kube_namespace_labels{namespace!~"(openshift|kube).*|default",label_pod_security_kubernetes_io_enforce!="restricted"}
          for: 5m
          annotations:
            summary: "Restricted policy not enforced"
            description: "Namespace {{ $labels.namespace }} doesn't enforce the Restricted pod security policy."
          labels:
            severity: warning
```

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Copy link
Contributor

openshift-ci bot commented Aug 8, 2024

@simonpasquier: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@simonpasquier
Copy link
Contributor Author

/assign @machine424

@Tai-RedHat
Copy link

PR tested with cluster-bot, other user-defined namespace could trigger the alert in user-monitoring-shared namespace.
test case: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-75384
/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Aug 13, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 13, 2024

@simonpasquier: This pull request references MON-3802 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.18.0" version, but no target version was set.

In response to this:

This change introduces a way to deploy user-defined rules which are not
scoped to their namespace of origin.

To enable the feature, a user-defined monitoring admin needs to
configure at least one namespace in the UWM ConfigMap:

   kind: ConfigMap
   apiVersion: v1
   metadata:
     name: user-workload-monitoring-config
     namespace: openshift-user-workload-monitoring
   data:
     config.yaml: |-
       namespacesWithoutLabelEnforcement: [ user-monitoring-shared ]

For all PrometheusRule objects defined in the user-monitoring-shared
namespace, Prometheus and Thanos Ruler evaluate the PromQL expressions
without enforcing the namespace label of origin. It makes it possible to
have generic rules that get applied to all (or a subset of) the user
projects instead of having individual rule objects in each user project.

The capability is enabled by default but a cluster admin can decide to
disable it in the CMO ConfigMap:

   kind: ConfigMap
   apiVersion: v1
   metadata:
     name: cluster-monitoring-config
     namespace: openshift-monitoring
   data:
     config.yaml: |-
       userWorkloadEnabled: true
       userWorkload:
        rulesWithoutLabelEnforcementAllowed: false

For example, a user-defined admin can create a single rule that fires
when a user namespace doesn't enforce the Restricted pod security
policy.

   apiVersion: monitoring.coreos.com/v1
   kind: PrometheusRule
   metadata:
     name: security
     namespace: user-monitoring-shared
   spec:
     groups:
       - name: pod-security-policy
         rules:
           - alert: "NamespaceNotEnforcingRestrictedPolicy"
             expr: kube_namespace_labels{namespace!~"(openshift|kube).*|default",label_pod_security_kubernetes_io_enforce!="restricted"}
             for: 5m
             annotations:
               summary: "Restricted policy not enforced"
               description: "Namespace {{ $labels.namespace }} doesn't enforce the Restricted pod security policy."
             labels:
               severity: warning
  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Tai-RedHat
Copy link

If I set two or even more user-monitoring-shared namespaces, and then create same prometheusrules to each namespaces. This will result in repeated alerts. Should we remind or restrict users in this situation?
config.yaml: 'namespacesWithoutLabelEnforcement: [ns1, ns2]'

Screenshot 2024-08-13 at 12 55 49

@simonpasquier
Copy link
Contributor Author

If I set two or even more user-monitoring-shared namespaces, and then create same prometheusrules to each namespaces. This will result in repeated alerts. Should we remind or restrict users in this situation?

Thanks for the testing @Tai-RedHat! I don't see a strong reason to prevent this situation (and it would be quite hard to detect). But it would be good to mention in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants