Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eks-prow-build-cluster: create simple monitoring stack #5011

Merged
merged 9 commits into from
Mar 27, 2023
1 change: 1 addition & 0 deletions infra/aws/terraform/prow-build-cluster/resources/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[Installing AWS Node Termination Handler](https://github.com/aws/aws-node-termination-handler/tree/main/config/helm/aws-node-termination-handler#installing-the-chart)
pkprzekwas marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# EKS monitoring

## Setting up monitoring

```bash
# Create monitoring namespace:
kubectl apply -f ../namespaces.yaml

# Install CRDs for Prometheus Operator:
# (server side is required due to long annotations)
kubectl apply --server-side -f ./prometheus-operator-crds

# Install Prometheus Operator:
kubectl apply -f ./prometheus-operator

# Install kube-state-metrics
kubectl apply -f ./kube-state-metrics

# Install node-exporter
kubectl apply -f ./node-exporter

# Install cadvisor
kubectl apply -f ./cadvisor

# Install dashboards for Grafana
kubectl apply --server-side -f ./grafana/dashboards

# Install Grafana
kubectl apply -f ./grafana
```

[Prometheus Operator CRDs](https://github.com/prometheus-operator/prometheus-operator/tree/v0.63.0/example/prometheus-operator-crd-full)

## Local access

```bash
# Prometheus
kubectl --namespace monitoring port-forward svc/prometheus-operated 9090

# Grafana
kubectl --namespace monitoring port-forward svc/grafana 3000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expose this Grafana instance publicly? If yes, let's create a follow up issue to track that.

Copy link
Contributor Author

@pkprzekwas pkprzekwas Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if this is something that we want to maintain in a longer term. IMO we can expose it for now till we haven't figured which monitoring solution works the best for both GCP and AWS clusters.

I wonder if there were other things we want to expose over time, in which case ingress-controller would be handy.

```

## Debugging

- [Troubleshooting Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/troubleshooting.md)

Checking Prometheus configuration:
```
kubectl -n monitoring get secret prometheus-main -ojson | jq -r '.data["prometheus.yaml.gz"]' | base64 -d | gunzip | less
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: cadvisor
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cadvisor
subjects:
- kind: ServiceAccount
name: cadvisor
namespace: monitoring
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: cadvisor
rules:
- apiGroups: ['policy']
resources: ['podsecuritypolicies']
pkprzekwas marked this conversation as resolved.
Show resolved Hide resolved
verbs: ['use']
resourceNames:
- cadvisor
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: cadvisor
namespace: monitoring
spec:
selector:
matchLabels:
name: cadvisor
template:
metadata:
labels:
name: cadvisor
spec:
serviceAccountName: cadvisor
containers:
- name: cadvisor
image: gcr.io/cadvisor/cadvisor:v0.45.0
resources:
requests:
memory: 400Mi
cpu: 400m
limits:
memory: 2000Mi
cpu: 800m
volumeMounts:
- name: rootfs
mountPath: /rootfs
readOnly: true
- name: var-run
mountPath: /var/run
readOnly: true
- name: sys
mountPath: /sys
readOnly: true
- name: docker
mountPath: /var/lib/docker
readOnly: true
- name: disk
mountPath: /dev/disk
readOnly: true
ports:
- name: http
containerPort: 8080
protocol: TCP
automountServiceAccountToken: false
terminationGracePeriodSeconds: 30
volumes:
- name: rootfs
hostPath:
path: /
- name: var-run
hostPath:
path: /var/run
- name: sys
hostPath:
path: /sys
- name: docker
hostPath:
path: /var/lib/docker
pkprzekwas marked this conversation as resolved.
Show resolved Hide resolved
- name: disk
hostPath:
path: /dev/disk
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: cadvisor
namespace: monitoring
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: cadvisor
namespace: monitoring
labels:
prometheus: main
spec:
selector:
matchLabels:
name: cadvisor
endpoints:
- port: http
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
apiVersion: v1
kind: Service
metadata:
name: cadvisor
namespace: monitoring
labels:
name: cadvisor
spec:
type: ClusterIP
ports:
- port: 8080
protocol: TCP
name: http
selector:
name: cadvisor
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
apiVersion: v1
kind: ConfigMap
metadata:
namespace: monitoring
name: grafana
data:
grafana.ini: |-
pkprzekwas marked this conversation as resolved.
Show resolved Hide resolved
[analytics]
reporting_enabled = false
check_for_updates = true

[grafana_net]
url = https://grafana.net

[log]
mode = console

[auth.anonymous]
enabled = true

[paths]
data = /var/lib/grafana/data
logs = /var/log/grafana
plugins = /var/lib/grafana/plugins
provisioning = /etc/grafana/provisioning

[server]
root_url = "https://my-grafana.example.pvt"
[auth.anonymous]
enabled = false
[metrics]
enabled = true
disable_total_stats = false
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: v1
kind: ConfigMap
metadata:
namespace: monitoring
name: dashboards
data:
dashboardproviders.yaml: |-
apiVersion: 1
providers:
- disableDeletion: false
editable: false
folder: Kubernetes
name: kubernetes
options:
path: /var/lib/grafana/dashboards/kubernetes
orgId: 1
type: file
Loading