diff --git a/birdhouse/README.md b/birdhouse/README.md index eb92befce..146e3d71a 100644 --- a/birdhouse/README.md +++ b/birdhouse/README.md @@ -35,20 +35,11 @@ below and the variable `AUTODEPLOY_EXTRA_REPOS` in The automatic deployment of the PAVICS platform, of the Jupyter tutorial notebooks and of the automatic deployment mechanism itself can all be -enabled and configured in the `env.local` file (a copy from -[`env.local.example`](env.local.example)). +enabled by following instructions [here](components/README.rst#scheduler). -* Add `./components/scheduler` to `EXTRA_CONF_DIRS`. -* Set `AUTODEPLOY_EXTRA_REPOS`, `AUTODEPLOY_DEPLOY_KEY_ROOT_DIR`, - `AUTODEPLOY_PLATFORM_FREQUENCY`, `AUTODEPLOY_NOTEBOOK_FREQUENCY` as - desired, full documentation in [`env.local.example`](env.local.example). -* Run once [`fix-write-perm`](deployment/fix-write-perm), see doc in script. - -Resource usage monitoring (CPU, memory, ..) for the host and each of the containers -can be enabled by enabling the `./components/monitoring` in `env.local` file. - -* Add `./components/monitoring` to `EXTRA_CONF_DIRS`. -* Change `GRAFANA_ADMIN_PASSWORD` value. +Resource usage monitoring (CPU, memory, ..) and alerting for the host and each +of the containers can be enabled by following instructions +[here](components/README.rst#monitoring). To launch all the containers, use the following command: ``` @@ -94,62 +85,6 @@ postgres instance. See [`scripts/create-wps-pgsql-databases.sh`](scripts/create- * Click "Add User". -## Mostly automated unattended continuous deployment - -***NOTE***: this section about automated unattended continuous deployment is -superseded by the new `./components/scheduler` that can be entirely -enabled/disabled via the `env.local` file. See the part about automatic -deployment of the PAVICS platform in the "Docker instructions" section -above for how to configure it. - -Automated unattended continuous deployment means if code change in the checkout -of this repo, on the same currently checkout branch (ex: config changes, -`docker-compose.yml` changes) a deployment will be performed automatically -without human intervention. - -The trigger for the deployment is new code change on the server on the current -branch (PR merged, push). New code change locally will not trigger deployment -so local development workflow is also supported. - -Note: there are still cases where a human intervention is needed. See note in -script [`deployment/deploy.sh`](deployment/deploy.sh). - -Configure logrotate for all following automations to prevent disk full: -``` -deployment/install-logrotate-config .. $USER -``` - -To enable continuous deployment of PAVICS: - -``` -deployment/install-automated-deployment.sh .. $USER [daily|5-mins] -# read the script for more options/details -``` - -If you want to manually force a deployment of PAVICS (note this might not use -latest version of deploy.sh script): -``` -deployment/deploy.sh . -# read the script for more options/details -``` - -To enable continuous deployment of tutorial Jupyter notebooks: - -``` -deployment/install-deploy-notebook .. $USER -# read the script for more details -``` - -To trigger tutorial Jupyter notebooks deploy manually: -``` -# configure logrotate before because this script will log to -# /var/log/PAVICS/notebookdeploy.log - -deployment/trigger-deploy-notebook -# read the script for more details -``` - - ## Vagrant instructions Vagrant allows us to quickly spin up a VM to easily reproduce the runtime diff --git a/birdhouse/components/README.rst b/birdhouse/components/README.rst new file mode 100644 index 000000000..e36bd0fc7 --- /dev/null +++ b/birdhouse/components/README.rst @@ -0,0 +1,233 @@ +################# +PAVICS Components +################# + + +.. contents:: + + +Scheduler +========= + +This component provides automated unattended continuous deployment for the +"PAVICS stack", for the tutorial notebooks on the Jupyter environment and for the +automated deployment itself. + +It can also be used to schedule other tasks on the PAVICS physical host. + +Everything is dockerized, the deployment runs inside a container that will +update all other containers. + +Automated unattended continuous deployment means if code change in the remote +repo, matching the same currently checkout branch (ex: config changes, +``docker-compose.yml`` changes) a deployment will be performed automatically +without human intervention. + +The trigger for the deployment is new code change on the server on the current +branch (PR merged, push). New code change locally will not trigger deployment +so local development workflow is also supported. + +Multiple remote repos are supported so the "PAVICS stack" can be made of +multiple checkouts for modularity and extensibility. The autodeploy will +trigger if any of the checkouts (configured in ``AUTODEPLOY_EXTRA_REPOS``) is +not up-to-date with its remote repo. + +A suggested "PAVICS stack" is made of at least 2 repos, this repo and another +private repo containing the source controlled ``env.local`` file and any other +docker-compose override for true infrastructure-as-code. + +Note: there are still cases where a human intervention is needed. See note in +script deploy.sh_. + + +Usage +----- + +Given the unattended nature, there is no UI. Logs are used to keep trace. + +- ``/var/log/PAVICS/autodeploy.log`` is for the PAVICS deployment. + +- ``/var/log/PAVICS/notebookdeploy.log`` is for the tutorial notebooks deployment. + +- logrotate is enabled for ``/var/log/PAVICS/*.log`` to avoid filling up the + disk. Any new ``.log`` files in that folder will get logrotate for free. + + +How to Enable the Component +--------------------------- + +- Edit ``env.local`` (a copy of env.local.example_) + + - Add "./components/scheduler" to ``EXTRA_CONF_DIRS``. + - Set ``AUTODEPLOY_EXTRA_REPOS``, ``AUTODEPLOY_DEPLOY_KEY_ROOT_DIR``, + ``AUTODEPLOY_PLATFORM_FREQUENCY``, ``AUTODEPLOY_NOTEBOOK_FREQUENCY`` as desired, + full documentation in env.local.example_. + - Run once fix-write-perm_, see doc in script. + + +Old way to deploy the automatic deployment +------------------------------------------ + +Superseeded by this new ``scheduler`` component. Keeping for reference only. + +Doing it this old way do not need the ``scheduler`` compoment but lose the +ability for the autodeploy system to update itself. + +Configure logrotate for all following automations to prevent disk full:: + + deployment/install-logrotate-config .. $USER + +To enable continuous deployment of PAVICS:: + + deployment/install-automated-deployment.sh .. $USER [daily|5-mins] + # read the script for more options/details + +If you want to manually force a deployment of PAVICS (note this might not use +latest version of deploy.sh script):: + + deployment/deploy.sh . + # read the script for more options/details + +To enable continuous deployment of tutorial Jupyter notebooks:: + + deployment/install-deploy-notebook .. $USER + # read the script for more details + +To trigger tutorial Jupyter notebooks deploy manually:: + + # configure logrotate before because this script will log to + # /var/log/PAVICS/notebookdeploy.log + + deployment/trigger-deploy-notebook + # read the script for more details + +Migrating to the new mechanism requires manual deletion of all the artifacts +created by the old install scripts: ``sudo rm /etc/cron.d/PAVICS-deploy +/etc/cron.hourly/PAVICS-deploy-notebooks /etc/logrotate.d/PAVICS-deploy +/usr/local/sbin/triggerdeploy.sh``. Both can not co-exist at the same time. + + +Comparison between the old and new autodeploy mechanism +------------------------------------------------------- + +Maximum backward-compatibility has been kept with the old install scripts style: + +* Still log to the same existing log files under ``/var/log/PAVICS``. +* Old single ssh deploy key is still compatible, but the new mechanism allows for different ssh deploy keys for each extra repos (again, public repos should use https clone path to avoid dealing with ssh deploy keys in the first place). +* Old install scripts are kept and can still deploy the old way. + +Features missing in old install scripts or how the new mechanism improves on the old install scripts: + +* Autodeploy of the autodeploy itself ! This is the biggest win. Previously, if ``triggerdeploy.sh`` or ``PAVICS-deploy-notebooks`` script changes, they have to be deployed manually. It's very annoying. Now they are volume-mount in so are fresh on each run. +* ``env.local`` now drive absolutely everything, source control that file and we've got a true DevOPS pipeline. +* Configurable platform and notebook autodeploy frequency. Previously, this means manually editing the generated cron file, less ideal. +* Do not need any support on the local host other than ``docker`` and ``docker-compose``. ``cron/logrotate/git/ssh`` versions are all locked-down in the docker images used by the autodeploy. Recall previously we had to deal with git version too old on some hosts. +* Each cron job run in its own docker image meaning the runtime environment is traceable and reproducible. +* The newly introduced scheduler component is made extensible so other jobs can added into it as well (ex: backup), via ``env.local``, which should be source controlled, meaning all surrounding maintenance related tasks can also be traceable and reproducible. + + +Monitoring +========== + +This component provides monitoring and alerting for the PAVICS physical host +and containers. + +Prometheus stack is used: + +* Node-exporter to collect host metrics. +* cAdvisor to collect containers metrics. +* Prometheus to scrape metrics, to store them and to query them. +* AlertManager to manage alerts: deduplicate, group, route, silence, inhibit. +* Grafana to provide visualization dashboard for the metrics. + + +Usage +----- + +- Grafana to view metric graphs: http://PAVICS_FQDN:3001/d/pf6xQMWGz/docker-and-system-monitoring +- Prometheus alert rules: http://PAVICS_FQDN:9090/rules +- AlertManager to manage alerts: http://PAVICS_FQDN:9093 + + +How to Enable the Component +--------------------------- + +- Edit ``env.local`` (a copy of env.local.example_) + + - Add "./components/monitoring" to ``EXTRA_CONF_DIRS`` + - Set ``GRAFANA_ADMIN_PASSWORD`` to login to Grafana + - Set ``ALERTMANAGER_ADMIN_EMAIL_RECEIVER`` for receiving alerts + - Set ``SMTP_SERVER`` for sending alerts + - Optionally set + + - ``ALERTMANAGER_EXTRA_GLOBAL`` to further configure AlertManager + - ``ALERTMANAGER_EXTRA_ROUTES`` to add more routes than email notification + - ``ALERTMANAGER_EXTRA_INHIBITION`` to disable rule from firing + - ``ALERTMANAGER_EXTRA_RECEIVERS`` to add more receivers than the admin emails + + +Grafana Dashboard +----------------- + +.. image:: grafana-dashboard.png + +For host, using Node-exporter to collect metrics: + +- uptime +- number of container +- used disk space +- used memory, available memory, used swap memory +- load +- cpu usage +- in and out network traffic +- disk I/O + +For each container, using cAdvisor to collect metrics: + +- in and out network traffic +- cpu usage +- memory and swap memory usage +- disk usage + +Useful visualisation features: + +- zoom in one graph and all other graph update to match the same "time range" so we can correlate event +- view each graph independently for more details +- mouse over each data point will show value at that moment + + +Prometheus Alert Rules +---------------------- + +.. image:: prometheus-alert-rules.png + + +AlertManager for Alert Dashboard and Silencing +---------------------------------------------- + +.. image:: alertmanager-dashboard.png +.. image:: alertmanager-silence-alert.png + + +Customizing the Component +------------------------- + +- To add more Grafana dashboard, volume-mount more ``*.json`` files to the + grafana container. + +- To add more Prometheus alert rules, volume-mount more ``*.rules`` files to + the prometheus container. + +- To disable existing Prometheus alert rules, add more Alertmanager inhibition + rules using ``ALERTMANAGER_EXTRA_INHIBITION`` via ``env.local`` file. + +- Other possible Alertmanager configs via ``env.local``: + ``ALERTMANAGER_EXTRA_GLOBAL``, ``ALERTMANAGER_EXTRA_ROUTES`` (can route to + Slack or other services accepting webhooks), ``ALERTMANAGER_EXTRA_RECEIVERS``. + + + + +.. _env.local.example: ../env.local.example +.. _fix-write-perm: ../deployment/fix-write-perm +.. _deploy.sh: ../deployment/deploy.sh diff --git a/birdhouse/components/alertmanager-dashboard.png b/birdhouse/components/alertmanager-dashboard.png new file mode 100644 index 000000000..9ea24c000 Binary files /dev/null and b/birdhouse/components/alertmanager-dashboard.png differ diff --git a/birdhouse/components/alertmanager-silence-alert.png b/birdhouse/components/alertmanager-silence-alert.png new file mode 100644 index 000000000..d87e1116b Binary files /dev/null and b/birdhouse/components/alertmanager-silence-alert.png differ diff --git a/birdhouse/components/grafana-dashboard.png b/birdhouse/components/grafana-dashboard.png new file mode 100644 index 000000000..5f83733b1 Binary files /dev/null and b/birdhouse/components/grafana-dashboard.png differ diff --git a/birdhouse/components/monitoring/.gitignore b/birdhouse/components/monitoring/.gitignore index 39987cfbe..2528e8083 100644 --- a/birdhouse/components/monitoring/.gitignore +++ b/birdhouse/components/monitoring/.gitignore @@ -1,3 +1,5 @@ prometheus.yml grafana_datasources.yml grafana_dashboards.yml +alertmanager.yml +prometheus.rules diff --git a/birdhouse/components/monitoring/alertmanager.tmpl b/birdhouse/components/monitoring/alertmanager.tmpl new file mode 100644 index 000000000..b26eebbf2 --- /dev/null +++ b/birdhouse/components/monitoring/alertmanager.tmpl @@ -0,0 +1,219 @@ +{{ define "__alertmanager" }}AlertManager{{ end }} +{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver | urlquery }}{{ end }} + +{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }} +{{ define "__description" }}{{ end }} + +{{ define "__text_alert_list" }}{{ range . }}Labels: +{{ range .Labels.SortedPairs }} - {{ .Name }} = {{ .Value }} +{{ end }}Annotations: +{{ range .Annotations.SortedPairs }} - {{ .Name }} = {{ .Value }} +{{ end }}Source: {{ .GeneratorURL }} +{{ end }}{{ end }} + + +{{ define "slack.default.title" }}{{ template "__subject" . }}{{ end }} +{{ define "slack.default.username" }}{{ template "__alertmanager" . }}{{ end }} +{{ define "slack.default.fallback" }}{{ template "slack.default.title" . }} | {{ template "slack.default.titlelink" . }}{{ end }} +{{ define "slack.default.callbackid" }}{{ end }} +{{ define "slack.default.pretext" }}{{ end }} +{{ define "slack.default.titlelink" }}{{ template "__alertmanagerURL" . }}{{ end }} +{{ define "slack.default.iconemoji" }}{{ end }} +{{ define "slack.default.iconurl" }}{{ end }} +{{ define "slack.default.text" }}{{ end }} +{{ define "slack.default.footer" }}{{ end }} + + +{{ define "pagerduty.default.description" }}{{ template "__subject" . }}{{ end }} +{{ define "pagerduty.default.client" }}{{ template "__alertmanager" . }}{{ end }} +{{ define "pagerduty.default.clientURL" }}{{ template "__alertmanagerURL" . }}{{ end }} +{{ define "pagerduty.default.instances" }}{{ template "__text_alert_list" . }}{{ end }} + + +{{ define "opsgenie.default.message" }}{{ template "__subject" . }}{{ end }} +{{ define "opsgenie.default.description" }}{{ .CommonAnnotations.SortedPairs.Values | join " " }} +{{ if gt (len .Alerts.Firing) 0 -}} +Alerts Firing: +{{ template "__text_alert_list" .Alerts.Firing }} +{{- end }} +{{ if gt (len .Alerts.Resolved) 0 -}} +Alerts Resolved: +{{ template "__text_alert_list" .Alerts.Resolved }} +{{- end }} +{{- end }} +{{ define "opsgenie.default.source" }}{{ template "__alertmanagerURL" . }}{{ end }} + + +{{ define "wechat.default.message" }}{{ template "__subject" . }} +{{ .CommonAnnotations.SortedPairs.Values | join " " }} +{{ if gt (len .Alerts.Firing) 0 -}} +Alerts Firing: +{{ template "__text_alert_list" .Alerts.Firing }} +{{- end }} +{{ if gt (len .Alerts.Resolved) 0 -}} +Alerts Resolved: +{{ template "__text_alert_list" .Alerts.Resolved }} +{{- end }} +AlertmanagerUrl: +{{ template "__alertmanagerURL" . }} +{{- end }} +{{ define "wechat.default.to_user" }}{{ end }} +{{ define "wechat.default.to_party" }}{{ end }} +{{ define "wechat.default.to_tag" }}{{ end }} +{{ define "wechat.default.agent_id" }}{{ end }} + + + +{{ define "victorops.default.state_message" }}{{ .CommonAnnotations.SortedPairs.Values | join " " }} +{{ if gt (len .Alerts.Firing) 0 -}} +Alerts Firing: +{{ template "__text_alert_list" .Alerts.Firing }} +{{- end }} +{{ if gt (len .Alerts.Resolved) 0 -}} +Alerts Resolved: +{{ template "__text_alert_list" .Alerts.Resolved }} +{{- end }} +{{- end }} +{{ define "victorops.default.entity_display_name" }}{{ template "__subject" . }}{{ end }} +{{ define "victorops.default.monitoring_tool" }}{{ template "__alertmanager" . }}{{ end }} + +{{ define "email.default.subject" }}{{ template "__subject" . }}{{ end }} +{{ define "email.default.html" }} + + + + + + +{{ template "__subject" . }} + + + + + + + + + + + +
+
+ + + + + + + +
+ {{ .Alerts | len }} alert{{ if gt (len .Alerts) 1 }}s{{ end }} for {{ range .GroupLabels.SortedPairs }} + {{ .Name }}={{ .Value }} + {{ end }} +
+ + + + + {{ if gt (len .Alerts.Firing) 0 }} + + + + {{ end }} + {{ range .Alerts.Firing }} + + + + {{ end }} + + {{ if gt (len .Alerts.Resolved) 0 }} + {{ if gt (len .Alerts.Firing) 0 }} + + + + {{ end }} + + + + {{ end }} + {{ range .Alerts.Resolved }} + + + + {{ end }} +
+ View in {{ template "__alertmanager" . }} +
+ [{{ .Alerts.Firing | len }}] Firing +
+ Labels
+ {{ range .Labels.SortedPairs }}{{ .Name }} = {{ .Value }}
{{ end }} + {{ if gt (len .Annotations) 0 }}Annotations
{{ end }} + {{ range .Annotations.SortedPairs }}{{ .Name }} = {{ .Value }}
{{ end }} + Source
+
+
+
+
+
+ [{{ .Alerts.Resolved | len }}] Resolved +
+ Labels
+ {{ range .Labels.SortedPairs }}{{ .Name }} = {{ .Value }}
{{ end }} + {{ if gt (len .Annotations) 0 }}Annotations
{{ end }} + {{ range .Annotations.SortedPairs }}{{ .Name }} = {{ .Value }}
{{ end }} + Source
+
+
+ +
+
+ + + + +{{ end }} + +{{ define "pushover.default.title" }}{{ template "__subject" . }}{{ end }} +{{ define "pushover.default.message" }}{{ .CommonAnnotations.SortedPairs.Values | join " " }} +{{ if gt (len .Alerts.Firing) 0 }} +Alerts Firing: +{{ template "__text_alert_list" .Alerts.Firing }} +{{ end }} +{{ if gt (len .Alerts.Resolved) 0 }} +Alerts Resolved: +{{ template "__text_alert_list" .Alerts.Resolved }} +{{ end }} +{{ end }} +{{ define "pushover.default.url" }}{{ template "__alertmanagerURL" . }}{{ end }} diff --git a/birdhouse/components/monitoring/alertmanager.yml.template b/birdhouse/components/monitoring/alertmanager.yml.template new file mode 100644 index 000000000..4221feaea --- /dev/null +++ b/birdhouse/components/monitoring/alertmanager.yml.template @@ -0,0 +1,74 @@ +# https://prometheus.io/docs/alerting/latest/configuration/ +# http://${PAVICS_FQDN}:9093/#/status +global: + # The smarthost and SMTP sender used for mail notifications. + smtp_smarthost: '${SMTP_SERVER}' + smtp_from: 'alertmanager@${PAVICS_FQDN}' + smtp_hello: '${PAVICS_FQDN}' +${ALERTMANAGER_EXTRA_GLOBAL} +# Below example of candidates for ALERTMANAGER_EXTRA_GLOBAL +# smtp_auth_username: 'alertmanager' +# smtp_auth_password: 'password' +# smtp_require_tls: false + +# The directory from which notification templates are read. +templates: +- '/etc/alertmanager/template/*.tmpl' + +# The root route on which each incoming alert enters. +route: + # The labels by which incoming alerts are grouped together. For example, + # multiple alerts coming in for cluster=A and alertname=LatencyHigh would + # be batched into a single group. + # + # To aggregate by all possible labels use '...' as the sole label name. + # This effectively disables aggregation entirely, passing through all + # alerts as-is. This is unlikely to be what you want, unless you have + # a very low alert volume or your upstream notification system performs + # its own grouping. Example: group_by: [...] + group_by: ['alertname', 'cluster', 'service'] + + # When a new group of alerts is created by an incoming alert, wait at + # least 'group_wait' to send the initial notification. + # This way ensures that you get multiple alerts for the same group that start + # firing shortly after another are batched together on the first + # notification. + group_wait: 30s + + # When the first notification was sent, wait 'group_interval' to send a batch + # of new alerts that started firing for that group. + group_interval: 5m + + # If an alert has successfully been sent, wait 'repeat_interval' to + # resend them. + repeat_interval: 6h + + # A default receiver + receiver: admin-emails + +${ALERTMANAGER_EXTRA_ROUTES} + +# Inhibition rules allow to mute a set of alerts given that another alert is +# firing. +# We use this to mute any warning-level notifications if the same alert is +# already critical. +inhibit_rules: +- source_match: + severity: 'critical' + target_match: + severity: 'warning' + # Apply inhibition if the alertname is the same. + # CAUTION: + # If all label names listed in `equal` are missing + # from both the source and target alerts, + # the inhibition rule will apply! + equal: ['alertname', 'cluster', 'service'] + +${ALERTMANAGER_EXTRA_INHIBITION} + +receivers: +- name: 'admin-emails' + email_configs: + - to: '${ALERTMANAGER_ADMIN_EMAIL_RECEIVER}' + +${ALERTMANAGER_EXTRA_RECEIVERS} diff --git a/birdhouse/components/monitoring/docker-compose-extra.yml b/birdhouse/components/monitoring/docker-compose-extra.yml index c593a715f..f27b52644 100644 --- a/birdhouse/components/monitoring/docker-compose-extra.yml +++ b/birdhouse/components/monitoring/docker-compose-extra.yml @@ -38,6 +38,7 @@ services: container_name: prometheus volumes: - ./components/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro + - ./components/monitoring/prometheus.rules:/etc/prometheus/prometheus.rules:ro - prometheus_persistence:/prometheus:rw ports: - 9090:9090 @@ -49,6 +50,8 @@ services: - --web.console.templates=/usr/share/prometheus/consoles # https://prometheus.io/docs/prometheus/latest/storage/ - --storage.tsdb.retention.time=90d + # wrong default was http://container-hash:9090/ + - --web.external-url=http://${PAVICS_FQDN}:9090/ restart: unless-stopped # https://grafana.com/docs/grafana/latest/installation/docker/ @@ -68,6 +71,28 @@ services: - 3001:3000 restart: unless-stopped + # https://github.com/prometheus/alertmanager + # https://prometheus.io/docs/alerting/latest/overview/ + # Handle alerts: deduplicate, group, route, silence, inhibit + alertmanager: + image: prom/alertmanager:v0.21.0 + container_name: alertmanager + volumes: + - ./components/monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro + - ./components/monitoring/alertmanager.tmpl:/etc/alertmanager/template/default.tmpl:ro + - alertmanager_persistence:/alertmanager:rw + command: + # restore original CMD from image + - --config.file=/etc/alertmanager/alertmanager.yml + - --storage.path=/alertmanager + # enable debug logging + - --log.level=debug + # wrong default was http://container-hash:9093/ + - --web.external-url=http://${PAVICS_FQDN}:9093/ + ports: + - 9093:9093 + restart: unless-stopped + volumes: prometheus_persistence: external: @@ -75,5 +100,8 @@ volumes: grafana_persistence: external: name: grafana_persistence + alertmanager_persistence: + external: + name: alertmanager_persistence # vi: tabstop=8 expandtab shiftwidth=2 softtabstop=2 diff --git a/birdhouse/components/monitoring/prometheus.rules.template b/birdhouse/components/monitoring/prometheus.rules.template new file mode 100644 index 000000000..46d1a277a --- /dev/null +++ b/birdhouse/components/monitoring/prometheus.rules.template @@ -0,0 +1,347 @@ +# https://awesome-prometheus-alerts.grep.to/rules +groups: + +- name: NodeUsage + rules: + + - alert: HostOutOfMemory + expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 + for: 5m + labels: + severity: warning + annotations: + summary: "Host out of memory (instance {{ $labels.instance }})" + description: "Node memory is filling up (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostMemoryUnderMemoryPressure + expr: rate(node_vmstat_pgmajfault[2m]) > 1000 + for: 5m + labels: + severity: warning + annotations: + summary: "Host memory under memory pressure (instance {{ $labels.instance }})" + description: "The node is under heavy memory pressure. High rate of major page faults\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostUnusualNetworkThroughputIn + expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100 + for: 5m + labels: + severity: warning + annotations: + summary: "Host unusual network throughput in (instance {{ $labels.instance }})" + description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostUnusualNetworkThroughputOut + expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100 + for: 5m + labels: + severity: warning + annotations: + summary: "Host unusual network throughput out (instance {{ $labels.instance }})" + description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostUnusualDiskReadRate + expr: sum by (instance) (irate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50 + for: 5m + labels: + severity: warning + annotations: + summary: "Host unusual disk read rate (instance {{ $labels.instance }})" + description: "Disk is probably reading too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostUnusualDiskWriteRate + expr: sum by (instance) (irate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50 + for: 5m + labels: + severity: warning + annotations: + summary: "Host unusual disk write rate (instance {{ $labels.instance }})" + description: "Disk is probably writing too much data (> 50 MB/s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostOutOfDiskSpace + expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 + for: 5m + labels: + severity: warning + annotations: + summary: "Host out of disk space (instance {{ $labels.instance }})" + description: "Disk is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostDiskWillFillIn4Hours + expr: predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs"}[1h], 4 * 3600) < 0 + for: 5m + labels: + severity: warning + annotations: + summary: "Host disk will fill in 4 hours (instance {{ $labels.instance }})" + description: "Disk will fill in 4 hours at current write rate\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostOutOfInodes + expr: node_filesystem_files_free / node_filesystem_files * 100 < 10 + for: 5m + labels: + severity: warning + annotations: + summary: "Host out of inodes (instance {{ $labels.instance }})" + description: "Disk is almost running out of available inodes (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostUnusualDiskReadLatency + expr: rate(node_disk_read_time_seconds_total[2m]) / rate(node_disk_reads_completed_total[2m]) > 100 + for: 5m + labels: + severity: warning + annotations: + summary: "Host unusual disk read latency (instance {{ $labels.instance }})" + description: "Disk latency is growing (read operations > 100ms)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostUnusualDiskWriteLatency + expr: rate(node_disk_write_time_seconds_total[2m]) / rate(node_disk_writes_completed_total[2m]) > 100 + for: 5m + labels: + severity: warning + annotations: + summary: "Host unusual disk write latency (instance {{ $labels.instance }})" + description: "Disk latency is growing (write operations > 100ms)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostHighCpuLoad + expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 + for: 5m + labels: + severity: warning + annotations: + summary: "Host high CPU load (instance {{ $labels.instance }})" + description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + # 1000 context switches is an arbitrary number. + # Alert threshold depends on nature of application. + # Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58 + - alert: HostContextSwitching + expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 2000 + for: 5m + labels: + severity: warning + annotations: + summary: "Host context switching (instance {{ $labels.instance }})" + description: "Context switching is growing on node (> 1000 / s)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostSwapIsFillingUp + expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80 + for: 5m + labels: + severity: warning + annotations: + summary: "Host swap is filling up (instance {{ $labels.instance }})" + description: "Swap is filling up (>80%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + +# node_systemd_* not enabled by default due to kernel configuration and +# security settings, +# https://github.com/prometheus/node_exporter#disabled-by-default + + - alert: HostSystemdServiceCrashed + expr: node_systemd_unit_state{state="failed"} == 1 + for: 5m + labels: + severity: warning + annotations: + summary: "Host SystemD service crashed (instance {{ $labels.instance }})" + description: "SystemD service crashed\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + +# node_hwmon_* requires lm_sensors package + + - alert: HostPhysicalComponentTooHot + expr: node_hwmon_temp_celsius > 75 + for: 5m + labels: + severity: warning + annotations: + summary: "Host physical component too hot (instance {{ $labels.instance }})" + description: "Physical hardware component too hot\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostNodeOvertemperatureAlarm + expr: node_hwmon_temp_crit_alarm_celsius == 1 + for: 5m + labels: + severity: critical + annotations: + summary: "Host node overtemperature alarm (instance {{ $labels.instance }})" + description: "Physical node temperature alarm triggered\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostRaidArrayGotInactive + expr: node_md_state{state="inactive"} > 0 + for: 5m + labels: + severity: critical + annotations: + summary: "Host RAID array got inactive (instance {{ $labels.instance }})" + description: "RAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostRaidDiskFailure + expr: node_md_disks{state="fail"} > 0 + for: 5m + labels: + severity: warning + annotations: + summary: "Host RAID disk failure (instance {{ $labels.instance }})" + description: "At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostKernelVersionDeviations + expr: count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1 + for: 5m + labels: + severity: warning + annotations: + summary: "Host kernel version deviations (instance {{ $labels.instance }})" + description: "Different kernel versions are running\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + +# node_vmstat_oom_kill seems to need kernel 4.10.15 and newer +# https://github.com/prometheus/node_exporter/pull/874#issuecomment-377333109 + + - alert: HostOomKillDetected + expr: increase(node_vmstat_oom_kill[5m]) > 0 + for: 5m + labels: + severity: warning + annotations: + summary: "Host OOM kill detected (instance {{ $labels.instance }})" + description: "OOM kill detected\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostEdacCorrectableErrorsDetected + expr: increase(node_edac_correctable_errors_total[5m]) > 0 + for: 5m + labels: + severity: info + annotations: + summary: "Host EDAC Correctable Errors detected (instance {{ $labels.instance }})" + description: "{{ $labels.instance }} has had {{ printf \"%.0f\" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostEdacUncorrectableErrorsDetected + expr: node_edac_uncorrectable_errors_total > 0 + for: 5m + labels: + severity: warning + annotations: + summary: "Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }})" + description: "{{ $labels.instance }} has had {{ printf \"%.0f\" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostNetworkReceiveErrors + expr: increase(node_network_receive_errs_total[5m]) > 0 + for: 5m + labels: + severity: warning + annotations: + summary: "Host Network Receive Errors (instance {{ $labels.instance }})" + description: "{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last five minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + + - alert: HostNetworkTransmitErrors + expr: increase(node_network_transmit_errs_total[5m]) > 0 + for: 5m + labels: + severity: warning + annotations: + summary: "Host Network Transmit Errors (instance {{ $labels.instance }})" + description: "{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} transmit errors in the last five minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + + +- name: ContainerUsage + rules: + + - alert: ContainerKilled + expr: time() - container_last_seen{name!~"deploy_tutorial_notebooks|notebookdeploy|autodeploy|deploy_README_ipynb"} > 60 + for: 5m + labels: + severity: warning + annotations: + summary: "Container killed (instance {{ $labels.instance }})" + description: "A container has disappeared\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + - alert: ContainerCpuUsage + expr: (sum(rate(container_cpu_usage_seconds_total{name=~".+"}[3m])) BY (instance, name) * 100) > 80 + for: 5m + labels: + severity: warning + annotations: + summary: "Container CPU usage (instance {{ $labels.instance }})" + description: "Container CPU usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + - alert: ContainerMemoryUsage + expr: (sum(container_memory_usage_bytes{name=~".+"}) BY (instance, name) / sum(container_spec_memory_limit_bytes{name=~".+"} > 0) BY (instance, name) * 100) > 80 + for: 5m + labels: + severity: warning + annotations: + summary: "Container Memory usage (instance {{ $labels.instance }})" + description: "Container Memory usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + - alert: ContainerVolumeUsage + expr: (1 - sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance)) * 100 > 80 + for: 5m + labels: + severity: warning + annotations: + summary: "Container Volume usage (instance {{ $labels.instance }})" + description: "Container Volume usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + - alert: ContainerVolumeIoUsage + expr: (sum(container_fs_io_current) BY (instance, name) * 100) > 80 + for: 5m + labels: + severity: warning + annotations: + summary: "Container Volume IO usage (instance {{ $labels.instance }})" + description: "Container Volume IO usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" + + - alert: ContainerHighThrottleRate + expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1 + for: 5m + labels: + severity: warning + annotations: + summary: "Container high throttle rate (instance {{ $labels.instance }})" + description: "Container is being throttled\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" diff --git a/birdhouse/components/monitoring/prometheus.yml.template b/birdhouse/components/monitoring/prometheus.yml.template index be6130d0d..0398c4e8e 100644 --- a/birdhouse/components/monitoring/prometheus.yml.template +++ b/birdhouse/components/monitoring/prometheus.yml.template @@ -17,3 +17,13 @@ scrape_configs: static_configs: - targets: - ${PAVICS_FQDN}:9100 + +rule_files: +- "/etc/prometheus/*.rules" + +alerting: + alertmanagers: + - scheme: http + static_configs: + - targets: + - "${PAVICS_FQDN}:9093" diff --git a/birdhouse/components/prometheus-alert-rules.png b/birdhouse/components/prometheus-alert-rules.png new file mode 100644 index 000000000..fcd5a0d83 Binary files /dev/null and b/birdhouse/components/prometheus-alert-rules.png differ diff --git a/birdhouse/env.local.example b/birdhouse/env.local.example index 3569a98a6..216f8d4fc 100644 --- a/birdhouse/env.local.example +++ b/birdhouse/env.local.example @@ -208,6 +208,14 @@ export POSTGRES_MAGPIE_PASSWORD=postgres-qwerty # # Below are Mandatory if monitoring component is enabled: export GRAFANA_ADMIN_PASSWORD=changeme! +#export ALERTMANAGER_ADMIN_EMAIL_RECEIVER="user1@example.com,user2@example.com" +#export SMTP_SERVER="smtp.example.com:25" + +# Below are optional for monitoring component +#export ALERTMANAGER_EXTRA_GLOBAL="" +#export ALERTMANAGER_EXTRA_ROUTES="" +#export ALERTMANAGER_EXTRA_INHIBITION="" +#export ALERTMANAGER_EXTRA_RECEIVERS="" ############################################################################# # Emu optional vars diff --git a/birdhouse/pavics-compose.sh b/birdhouse/pavics-compose.sh index 55af30480..87289ecf8 100755 --- a/birdhouse/pavics-compose.sh +++ b/birdhouse/pavics-compose.sh @@ -25,6 +25,8 @@ VARS=' $POSTGRES_PAVICS_PASSWORD $POSTGRES_MAGPIE_USERNAME $POSTGRES_MAGPIE_PASSWORD + $ALERTMANAGER_ADMIN_EMAIL_RECEIVER + $SMTP_SERVER ' # list of vars to be substituted in template but they do not have to be set in @@ -48,6 +50,10 @@ OPTIONAL_VARS=' $AUTODEPLOY_EXTRA_SCHEDULER_JOBS $GENERIC_BIRD_PORT $GENERIC_BIRD_NAME + $ALERTMANAGER_EXTRA_GLOBAL + $ALERTMANAGER_EXTRA_ROUTES + $ALERTMANAGER_EXTRA_INHIBITION + $ALERTMANAGER_EXTRA_RECEIVERS ' # we switch to the real directory of the script, so it still works when used from $PATH @@ -144,6 +150,7 @@ if [ x"$1" = x"up" ]; then docker volume create thredds_persistence # logs, cache docker volume create prometheus_persistence # metrics db docker volume create grafana_persistence # dashboard and config db + docker volume create alertmanager_persistence # storage fi COMPOSE_CONF_LIST="-f docker-compose.yml" diff --git a/birdhouse/scripts/send-dummy-alert.sh b/birdhouse/scripts/send-dummy-alert.sh new file mode 100755 index 000000000..0edf503a3 --- /dev/null +++ b/birdhouse/scripts/send-dummy-alert.sh @@ -0,0 +1,45 @@ +#!/bin/bash +# https://gist.githubusercontent.com/cherti/61ec48deaaab7d288c9fcf17e700853a/raw/a69ddd1d96507f6d94059071d500fe499631e739/alert.sh +# Useful to test receving alert on UI and via email notif. + +name=$RANDOM +url='http://localhost:9093/api/v1/alerts' + +echo "firing up alert $name" + +# change url o +curl -XPOST $url -d "[{ + \"status\": \"firing\", + \"labels\": { + \"alertname\": \"$name\", + \"service\": \"my-service\", + \"severity\":\"warning\", + \"instance\": \"$name.example.net\" + }, + \"annotations\": { + \"summary\": \"High latency is high!\" + }, + \"generatorURL\": \"http://prometheus.int.example.net/\" +}]" + +echo "" + +echo "press enter to resolve alert" +read + +echo "sending resolve" +curl -XPOST $url -d "[{ + \"status\": \"resolved\", + \"labels\": { + \"alertname\": \"$name\", + \"service\": \"my-service\", + \"severity\":\"warning\", + \"instance\": \"$name.example.net\" + }, + \"annotations\": { + \"summary\": \"High latency is high!\" + }, + \"generatorURL\": \"http://prometheus.int.example.net/\" +}]" + +echo ""