Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring: add alert rules and alert handling (deduplicate, group, route, silence, inhibit) #59

Merged
merged 29 commits into from
Jul 11, 2020
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
97b93b2
monitoring: add AlertManager to handle alerts
tlvu Jul 3, 2020
dec745a
monitoring: script to send dummy alert to test alertmanager independe…
tlvu Jul 3, 2020
f6e3c1f
monitoring: add alertmanager template to hopefully get email notif to…
tlvu Jul 3, 2020
4e45d28
monitoring: enable debug logging for alertmanager for alerts received…
tlvu Jul 3, 2020
28f75b2
monitoring: add alert rules to prometheus
tlvu Jul 3, 2020
915456f
monitoring: remove my redundant self crafted High CPU usage alert
tlvu Jul 4, 2020
6fcdd25
monitoring: fix HostMemoryUnderMemoryPressure rate range, probably du…
tlvu Jul 4, 2020
fd28fc3
monitoring: fix HostOutOfDiskSpace to include all filesystems, mountp…
tlvu Jul 4, 2020
99652e1
monitoring: fix HostOutOfInodes to include all filesystems, mountpoin…
tlvu Jul 4, 2020
396bab2
monitoring: fix HostUnusualDiskReadLatency and HostUnusualDiskWriteLa…
tlvu Jul 4, 2020
b25e97a
monitoring: fix prometheus rules not clickable
tlvu Jul 6, 2020
7ef5ec4
monitoring: fix prometheus can not scrape anymore, it thinks cadvisor…
tlvu Jul 6, 2020
819f344
monitoring: fix alertmanager wrong self reference url
tlvu Jul 6, 2020
90565ff
monitoring: note node_hwmon_\* metric requirements
tlvu Jul 6, 2020
3a8f3dd
monitoring: note node_systemd_\* metric not working by default
tlvu Jul 6, 2020
0789eaf
monitoring: fix HostNodeOvertemperatureAlarm, node_hwmon_temp_alarm r…
tlvu Jul 6, 2020
6eb43f7
monitoring: node_vmstat_oom_kill (HostOomKillDetected) seems to need …
tlvu Jul 6, 2020
cc62399
monitoring: fix HostEdacCorrectableErrorsDetected, HostEdacUncorrecta…
tlvu Jul 6, 2020
afccd7c
monitoring: remove accidental Postgres, Apache, jvm, Speedtest alerts…
tlvu Jul 6, 2020
acf1b2c
monitoring: remove accidental Postgres, Haproxy, Sidekiq, Consul aler…
tlvu Jul 8, 2020
214d5cb
monitoring: fix ContainerVolumeUsage alert, was giving negative numbe…
tlvu Jul 8, 2020
c67a4a9
monitoring; fix ContainerCpuUsage alert to avoid false positive when …
tlvu Jul 8, 2020
52956d1
monitoring: fix ContainerMemoryUsage alert to avoid false positive
tlvu Jul 8, 2020
606e047
monitoring: fix ContainerKilled alert to avoid false positive with te…
tlvu Jul 8, 2020
53c7e90
monitoring: tweak HostContextSwitching alert for less false postiive,…
tlvu Jul 8, 2020
6c4f3e4
env.local: better example for ALERTMANAGER_ADMIN_EMAIL_RECEIVER and S…
tlvu Jul 8, 2020
45c50b4
monitoring: add README description for the monitoring component
tlvu Jul 11, 2020
c2c1339
README: update reference to monitoring system
tlvu Jul 11, 2020
4f9aa2d
scheduler: add documentation to the new component README for consistency
tlvu Jul 11, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions birdhouse/components/monitoring/.gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
prometheus.yml
grafana_datasources.yml
grafana_dashboards.yml
alertmanager.yml
prometheus.rules
219 changes: 219 additions & 0 deletions birdhouse/components/monitoring/alertmanager.tmpl

Large diffs are not rendered by default.

74 changes: 74 additions & 0 deletions birdhouse/components/monitoring/alertmanager.yml.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# https://prometheus.io/docs/alerting/latest/configuration/
# http://${PAVICS_FQDN}:9093/#/status
global:
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost: '${SMTP_SERVER}'
smtp_from: 'alertmanager@${PAVICS_FQDN}'
smtp_hello: '${PAVICS_FQDN}'
${ALERTMANAGER_EXTRA_GLOBAL}
# Below example of candidates for ALERTMANAGER_EXTRA_GLOBAL
# smtp_auth_username: 'alertmanager'
# smtp_auth_password: 'password'
# smtp_require_tls: false

# The directory from which notification templates are read.
templates:
- '/etc/alertmanager/template/*.tmpl'

# The root route on which each incoming alert enters.
route:
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
#
# To aggregate by all possible labels use '...' as the sole label name.
# This effectively disables aggregation entirely, passing through all
# alerts as-is. This is unlikely to be what you want, unless you have
# a very low alert volume or your upstream notification system performs
# its own grouping. Example: group_by: [...]
group_by: ['alertname', 'cluster', 'service']

# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 30s

# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 5m

# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval: 6h

# A default receiver
receiver: admin-emails

${ALERTMANAGER_EXTRA_ROUTES}

# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is
# already critical.
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
# Apply inhibition if the alertname is the same.
# CAUTION:
# If all label names listed in `equal` are missing
# from both the source and target alerts,
# the inhibition rule will apply!
equal: ['alertname', 'cluster', 'service']

${ALERTMANAGER_EXTRA_INHIBITION}

receivers:
- name: 'admin-emails'
email_configs:
- to: '${ALERTMANAGER_ADMIN_EMAIL_RECEIVER}'

${ALERTMANAGER_EXTRA_RECEIVERS}
28 changes: 28 additions & 0 deletions birdhouse/components/monitoring/docker-compose-extra.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ services:
container_name: prometheus
volumes:
- ./components/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./components/monitoring/prometheus.rules:/etc/prometheus/prometheus.rules:ro
- prometheus_persistence:/prometheus:rw
ports:
- 9090:9090
Expand All @@ -49,6 +50,8 @@ services:
- --web.console.templates=/usr/share/prometheus/consoles
# https://prometheus.io/docs/prometheus/latest/storage/
- --storage.tsdb.retention.time=90d
# wrong default was http://container-hash:9090/
- --web.external-url=http://${PAVICS_FQDN}:9090/
restart: unless-stopped

# https://grafana.com/docs/grafana/latest/installation/docker/
Expand All @@ -68,12 +71,37 @@ services:
- 3001:3000
restart: unless-stopped

# https://github.com/prometheus/alertmanager
# https://prometheus.io/docs/alerting/latest/overview/
# Handle alerts: deduplicate, group, route, silence, inhibit
alertmanager:
image: prom/alertmanager:v0.21.0
container_name: alertmanager
volumes:
- ./components/monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- ./components/monitoring/alertmanager.tmpl:/etc/alertmanager/template/default.tmpl:ro
- alertmanager_persistence:/alertmanager:rw
command:
# restore original CMD from image
- --config.file=/etc/alertmanager/alertmanager.yml
- --storage.path=/alertmanager
# enable debug logging
- --log.level=debug
# wrong default was http://container-hash:9093/
- --web.external-url=http://${PAVICS_FQDN}:9093/
ports:
- 9093:9093
restart: unless-stopped

volumes:
prometheus_persistence:
external:
name: prometheus_persistence
grafana_persistence:
external:
name: grafana_persistence
alertmanager_persistence:
external:
name: alertmanager_persistence

# vi: tabstop=8 expandtab shiftwidth=2 softtabstop=2
Loading