Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy ingress-nginx ServiceMonitor in Rancher v2.5 for scraping by Monitoring v2 #30126

Closed
axeal opened this issue Nov 18, 2020 · 17 comments
Closed
Assignees
Labels
internal kind/enhancement Issues that improve or augment existing functionality QA/S
Milestone

Comments

@axeal
Copy link
Contributor

axeal commented Nov 18, 2020

What kind of request is this (question/bug/enhancement/feature request):

Enhancement

Steps to reproduce (least amount of steps as possible):

ingress-nginx-controller metrics are not scraped by default with Monitoring v2 in Rancher v2.5. For Rancher provisioned clusters with the ingress-nginx ingress controller, would it be possible to auto-deploy a ServiceMonitor and metrics Service similar to the below, which will enable scraping by Prometheus:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx
spec:
  endpoints:
  - interval: 30s
    port: metrics
  namespaceSelector:
    matchNames:
    - ingress-nginx
  selector:
    matchLabels:
      app: ingress-nginx
---
apiVersion: v1
kind: Service
metadata:
  name: ingress-nginx-controller-metrics
  namespace: ingress-nginx
  labels:
    app: ingress-nginx
spec:
  ports:
  - name: metrics
    port: 9913
    protocol: TCP
    targetPort: 10254
  selector:
    app: ingress-nginx
  type: ClusterIP

Environment information

  • Rancher v2.5.2

gz#11621

gz#13981

@aiyengar2
Copy link
Contributor

aiyengar2 commented Feb 24, 2021

@axeal, yes this is possible. Thanks for providing the Service / ServiceMonitor!

This issue will also encompass adding Grafana Dashboards, sourced from nginx.json and request-handling-performance.json

@k8w
Copy link

k8w commented Mar 1, 2021

@aiyengar2 Where to add these json files ?
Can it be finished in Grafana GUI ?

@aiyengar2
Copy link
Contributor

@k8w you can add these JSON files via the Rancher UI, here are the relevant docs: https://rancher.com/docs/rancher/v2.x/en/monitoring-alerting/v2.5/persist-grafana/.

Since you already have the JSON, all you need to do is the last step.

@thehejik
Copy link

@aiyengar2 @sowmyav27 could you please point me how to install the enhanced chart? IIUC it should be available in rancher/charts.git in dev-v2.5 branch which is present in my cluster (well, it is https://git.rancher.io/charts - is it the same?) but latest monitoring chart version in the v2 dropdown is 9.4.203-rc04 even after the repo refresh. I was trying with rancher:v2.5-head. The new monitoring chart version should be 9.4.204-rc01.

Should I clone the charts.git locally and deploy it manually or is there a way how to do it from webUI?

@thehejik
Copy link

I just successfully installed the testing monitoring chart from my local git checkout.

git clone https://git.rancher.io/charts --branch dev-v2.5
cd charts/charts/rancher-monitoring/rancher-monitoring-crd/9.4.204-rc01
helm install rancher-monitoring-crd ./
cd ../../rancher-monitoring/9.4.204-rc01/
helm install rancher-monitoring ./

@aiyengar2
Copy link
Contributor

@thehejik weird, I see https://github.com/rancher/charts/tree/dev-v2.5/assets/rancher-monitoring contains 9.4.204-rcXX versions so it should be working. But the approach you did via bash works too.

@sowmyav27
Copy link
Contributor

@thehejik The RC version of the chart is not available on 2.5-head because the rancher charts by default point to release-v2.5. We can manually change it to point to dev-v2.5 now and proceed with testing. I have asked it to be looked at by the dev.

@thehejik
Copy link

thehejik commented Mar 16, 2021

Thanks, I had some problems with DNS but now it works correctly.

The feature is working as described but so far user has to set ingressNginx.enabled=true rancher/dashboard#2485 in the chart value YAML file (Apps -> Monitoring -> General -> Edit as YAML).

  • The corresponding ServiceMonitor and Service are being created
  • New monitoring-ingress targets are all UP in Prometheus: ingress-nginx/rancher-monitoring-ingress-nginx/0 (3/3 up)
  • There is a new Grafana dashboard named NGINX / Ingress Controller available

I was testing on rancher 2.5.7 with v2.5-head image tag and used http://git.rancher.io/charts dev-v2.5 branch. Monitoring helm chart was in version 9.4.204-rc04.

Edit: tested on rancher v2.5-868c715554ed7b2de7e7b89595487df66c428a41-head

@h0jeZvgoxFepBQ2C
Copy link

h0jeZvgoxFepBQ2C commented May 6, 2021

I just updated our 2.5.7 rancher installation to 2.5.8, and updated also our rancher monitoring, but can't get it running somehow.

I've enabled ingressNginx.enabled: true, and also set a different namespace (we installed the nginx ingress controller into a "default" namespace, but get only following error in the end:

Error: UPGRADE FAILED: cannot patch "rancher-monitoring-alertmanager.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-general.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-k8s.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kube-apiserver-availability.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kube-apiserver-slos" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kube-apiserver.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.
cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kube-prometheus-general.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kube-prometheus-node-recording.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kube-state-metrics" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kubelet.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kubernetes-apps" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kubernetes-resources" with kind PrometheusRule: Internal error occurred: failed
calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kubernetes-storage" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 

@zube zube bot removed the [zube]: Done label Jun 15, 2021
@quangthe
Copy link

quangthe commented Jun 29, 2021

You need to define selector for rancher-monitoring-ingress-nginx service in order to collect metric from ingress nginx controller pod

ingressNginx:
  enabled: true
  namespace: ingress-nginx
  service:
    port: 9913
    targetPort: 10254
    selector:
      app.kubernetes.io/component: controller
      app.kubernetes.io/instance: ingress-nginx
      app.kubernetes.io/name: ingress-nginx
  serviceMonitor:
    interval: 1m
    metricRelabelings: []
    relabelings: []

@aiyengar2
Copy link
Contributor

I just updated our 2.5.7 rancher installation to 2.5.8, and updated also our rancher monitoring, but can't get it running somehow.

I've enabled ingressNginx.enabled: true, and also set a different namespace (we installed the nginx ingress controller into a "default" namespace, but get only following error in the end:

Error: UPGRADE FAILED: cannot patch "rancher-monitoring-alertmanager.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-general.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-k8s.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kube-apiserver-availability.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kube-apiserver-slos" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kube-apiserver.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.
cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kube-prometheus-general.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kube-prometheus-node-recording.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kube-state-metrics" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kubelet.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kubernetes-apps" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kubernetes-resources" with kind PrometheusRule: Internal error occurred: failed
calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 
&& cannot patch "rancher-monitoring-kubernetes-storage" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://rancher-monitoring-operator.cattle-monitoring-system.svc:443/admission-prometheusrules/validate?timeout=30s": dial tcp 10.245.26.67:443: connect: connection refused 

@h0jeZvgoxFepBQ2C see #32416 (comment). If you re-trigger an install, it should successfully install.

@h0jeZvgoxFepBQ2C
Copy link

You need to define selector for rancher-monitoring-ingress-nginx service in order to collect metric from ingress nginx controller pod

ingressNginx:
  enabled: true
  namespace: ingress-nginx
  service:
    port: 9913
    targetPort: 10254
    selector:
      app.kubernetes.io/component: controller
      app.kubernetes.io/instance: ingress-nginx
      app.kubernetes.io/name: ingress-nginx
  serviceMonitor:
    interval: 1m
    metricRelabelings: []
    relabelings: []

Do you have a link where this is documented? thx!

@aiyengar2
Copy link
Contributor

Hmm @thehejik when we tested this did we need to update the selector?

@quangthe sorry for the late ask, but could you share your environment information that required you to provide a selector in #30126 (comment) so we can attempt to reproduce? Specifically:

  • Rancher version: ``
  • Kubernetes Version: ``
  • Type of cluster
    • Custom: Running a docker command on a node
    • Imported: Running kubectl apply onto an existing k8s cluster
    • EKS
    • GKE
    • AKS
    • Infrastructure Provider: Rancher provisioning the nodes using different node drivers (e.g. AWS, Digital Ocean, etc)
    • Other: ``
  • Monitoring Chart Version: ``

Based on the labels shown in https://github.com/kubernetes/ingress-nginx/blob/8b3a6f02526e8bcf8b6fff2bf8d3613e20211b15/deploy/static/provider/cloud/deploy.yaml#L298, it seems like you are right that the default should be to use app.kubernetes.io/name: ingress-nginx instead of app: ingress-nginx, as used in Rancher's chart.

But I wonder if RKE / RKE2 still deploy ingress-nginx with the other label, which is why we left it as app: ingress-nginx.

@thehejik
Copy link

thehejik commented Sep 9, 2021

@aiyengar2 I don't think I was using the proposed selector, on the other hand I'm not sure if metrics from ingress were shown.

@aiyengar2
Copy link
Contributor

Hmm in that case, @sowmyav27 I'm re-opening this issue and putting it to test.

We need to confirm that setting ingressNginx.enabled: true shows the ServiceMonitor on the Prometheus UI for targets and that it has active targets that are emitting metrics.

We should also double check that the Grafana dashboards are eventually populated with metrics if an ingress is added to the cluster.

@thehejik
Copy link

thehejik commented Sep 10, 2021

Retested on 2.5.9 with v2 monitoring 14.5.100 from official release-v2.5 chart repo on RKE1 cluster.

I was using default values for the chart installation where ingressNginx.enabled: true is enabled by default (switched to Edit as YAML to validate). I didn't change any selector value - there is still only the original label app: nginx-ingress set on nginx-ingress-controller-* pods.

This is what was used for monitoring installation:

ingressNginx:
  enabled: true
  namespace: ingress-nginx
  service:
    port: 9913
    targetPort: 10254
  serviceMonitor:
    interval: ''
    metricRelabelings: []
    relabelings: []

Grafana
For testing purposes I've deployed an nginx deployement and created L7 Ingress entry for the nginx using default nginx-ingress as loadbalancer, then I was doing curl <nginx_lb> in a endless loop while checking the grafana board NGINX / Ingress Controller and its graphs. The metrics data were being processed there:
grafana_ingress

Prometheus

  • ServiceMonitor for rancher-monitoring-ingress-nginx exists:
    nginx-servismonitor

  • Ingress related targets are up, healthy and are giving data:
    ingress_prom_targets

I'd say everything is working as expected with default values.

@aiyengar2
Copy link
Contributor

Perfect! Thanks for checking @thehejik.

@h0jeZvgoxFepBQ2C @quangthe if you are still encountering this issue, please let us know your environment information! Perhaps there’s something we could document that is specific to the environment that you are deploying Monitoring into that requires selectors, but by default updating it should not be required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal kind/enhancement Issues that improve or augment existing functionality QA/S
Projects
None yet
Development

No branches or pull requests

10 participants