Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring: add alert rules and alert handling (deduplicate, group, route, silence, inhibit) #59

Merged
merged 29 commits into from
Jul 11, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
97b93b2
monitoring: add AlertManager to handle alerts
tlvu Jul 3, 2020
dec745a
monitoring: script to send dummy alert to test alertmanager independe…
tlvu Jul 3, 2020
f6e3c1f
monitoring: add alertmanager template to hopefully get email notif to…
tlvu Jul 3, 2020
4e45d28
monitoring: enable debug logging for alertmanager for alerts received…
tlvu Jul 3, 2020
28f75b2
monitoring: add alert rules to prometheus
tlvu Jul 3, 2020
915456f
monitoring: remove my redundant self crafted High CPU usage alert
tlvu Jul 4, 2020
6fcdd25
monitoring: fix HostMemoryUnderMemoryPressure rate range, probably du…
tlvu Jul 4, 2020
fd28fc3
monitoring: fix HostOutOfDiskSpace to include all filesystems, mountp…
tlvu Jul 4, 2020
99652e1
monitoring: fix HostOutOfInodes to include all filesystems, mountpoin…
tlvu Jul 4, 2020
396bab2
monitoring: fix HostUnusualDiskReadLatency and HostUnusualDiskWriteLa…
tlvu Jul 4, 2020
b25e97a
monitoring: fix prometheus rules not clickable
tlvu Jul 6, 2020
7ef5ec4
monitoring: fix prometheus can not scrape anymore, it thinks cadvisor…
tlvu Jul 6, 2020
819f344
monitoring: fix alertmanager wrong self reference url
tlvu Jul 6, 2020
90565ff
monitoring: note node_hwmon_\* metric requirements
tlvu Jul 6, 2020
3a8f3dd
monitoring: note node_systemd_\* metric not working by default
tlvu Jul 6, 2020
0789eaf
monitoring: fix HostNodeOvertemperatureAlarm, node_hwmon_temp_alarm r…
tlvu Jul 6, 2020
6eb43f7
monitoring: node_vmstat_oom_kill (HostOomKillDetected) seems to need …
tlvu Jul 6, 2020
cc62399
monitoring: fix HostEdacCorrectableErrorsDetected, HostEdacUncorrecta…
tlvu Jul 6, 2020
afccd7c
monitoring: remove accidental Postgres, Apache, jvm, Speedtest alerts…
tlvu Jul 6, 2020
acf1b2c
monitoring: remove accidental Postgres, Haproxy, Sidekiq, Consul aler…
tlvu Jul 8, 2020
214d5cb
monitoring: fix ContainerVolumeUsage alert, was giving negative numbe…
tlvu Jul 8, 2020
c67a4a9
monitoring; fix ContainerCpuUsage alert to avoid false positive when …
tlvu Jul 8, 2020
52956d1
monitoring: fix ContainerMemoryUsage alert to avoid false positive
tlvu Jul 8, 2020
606e047
monitoring: fix ContainerKilled alert to avoid false positive with te…
tlvu Jul 8, 2020
53c7e90
monitoring: tweak HostContextSwitching alert for less false postiive,…
tlvu Jul 8, 2020
6c4f3e4
env.local: better example for ALERTMANAGER_ADMIN_EMAIL_RECEIVER and S…
tlvu Jul 8, 2020
45c50b4
monitoring: add README description for the monitoring component
tlvu Jul 11, 2020
c2c1339
README: update reference to monitoring system
tlvu Jul 11, 2020
4f9aa2d
scheduler: add documentation to the new component README for consistency
tlvu Jul 11, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 4 additions & 69 deletions birdhouse/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,20 +35,11 @@ below and the variable `AUTODEPLOY_EXTRA_REPOS` in

The automatic deployment of the PAVICS platform, of the Jupyter tutorial
notebooks and of the automatic deployment mechanism itself can all be
enabled and configured in the `env.local` file (a copy from
[`env.local.example`](env.local.example)).
enabled by following instructions [here](components/README.rst#scheduler).

* Add `./components/scheduler` to `EXTRA_CONF_DIRS`.
* Set `AUTODEPLOY_EXTRA_REPOS`, `AUTODEPLOY_DEPLOY_KEY_ROOT_DIR`,
`AUTODEPLOY_PLATFORM_FREQUENCY`, `AUTODEPLOY_NOTEBOOK_FREQUENCY` as
desired, full documentation in [`env.local.example`](env.local.example).
* Run once [`fix-write-perm`](deployment/fix-write-perm), see doc in script.

Resource usage monitoring (CPU, memory, ..) for the host and each of the containers
can be enabled by enabling the `./components/monitoring` in `env.local` file.

* Add `./components/monitoring` to `EXTRA_CONF_DIRS`.
* Change `GRAFANA_ADMIN_PASSWORD` value.
Resource usage monitoring (CPU, memory, ..) and alerting for the host and each
of the containers can be enabled by following instructions
[here](components/README.rst#monitoring).

To launch all the containers, use the following command:
```
Expand Down Expand Up @@ -94,62 +85,6 @@ postgres instance. See [`scripts/create-wps-pgsql-databases.sh`](scripts/create-
* Click "Add User".


## Mostly automated unattended continuous deployment

***NOTE***: this section about automated unattended continuous deployment is
superseded by the new `./components/scheduler` that can be entirely
enabled/disabled via the `env.local` file. See the part about automatic
deployment of the PAVICS platform in the "Docker instructions" section
above for how to configure it.

Automated unattended continuous deployment means if code change in the checkout
of this repo, on the same currently checkout branch (ex: config changes,
`docker-compose.yml` changes) a deployment will be performed automatically
without human intervention.

The trigger for the deployment is new code change on the server on the current
branch (PR merged, push). New code change locally will not trigger deployment
so local development workflow is also supported.

Note: there are still cases where a human intervention is needed. See note in
script [`deployment/deploy.sh`](deployment/deploy.sh).

Configure logrotate for all following automations to prevent disk full:
```
deployment/install-logrotate-config .. $USER
```

To enable continuous deployment of PAVICS:

```
deployment/install-automated-deployment.sh .. $USER [daily|5-mins]
# read the script for more options/details
```

If you want to manually force a deployment of PAVICS (note this might not use
latest version of deploy.sh script):
```
deployment/deploy.sh .
# read the script for more options/details
```

To enable continuous deployment of tutorial Jupyter notebooks:

```
deployment/install-deploy-notebook .. $USER
# read the script for more details
```

To trigger tutorial Jupyter notebooks deploy manually:
```
# configure logrotate before because this script will log to
# /var/log/PAVICS/notebookdeploy.log

deployment/trigger-deploy-notebook
# read the script for more details
```


## Vagrant instructions

Vagrant allows us to quickly spin up a VM to easily reproduce the runtime
Expand Down
233 changes: 233 additions & 0 deletions birdhouse/components/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
#################
PAVICS Components
#################


.. contents::


Scheduler
=========

This component provides automated unattended continuous deployment for the
"PAVICS stack", for the tutorial notebooks on the Jupyter environment and for the
automated deployment itself.

It can also be used to schedule other tasks on the PAVICS physical host.

Everything is dockerized, the deployment runs inside a container that will
update all other containers.

Automated unattended continuous deployment means if code change in the remote
repo, matching the same currently checkout branch (ex: config changes,
``docker-compose.yml`` changes) a deployment will be performed automatically
without human intervention.

The trigger for the deployment is new code change on the server on the current
branch (PR merged, push). New code change locally will not trigger deployment
so local development workflow is also supported.

Multiple remote repos are supported so the "PAVICS stack" can be made of
multiple checkouts for modularity and extensibility. The autodeploy will
trigger if any of the checkouts (configured in ``AUTODEPLOY_EXTRA_REPOS``) is
not up-to-date with its remote repo.

A suggested "PAVICS stack" is made of at least 2 repos, this repo and another
private repo containing the source controlled ``env.local`` file and any other
docker-compose override for true infrastructure-as-code.

Note: there are still cases where a human intervention is needed. See note in
script deploy.sh_.


Usage
-----

Given the unattended nature, there is no UI. Logs are used to keep trace.

- ``/var/log/PAVICS/autodeploy.log`` is for the PAVICS deployment.

- ``/var/log/PAVICS/notebookdeploy.log`` is for the tutorial notebooks deployment.

- logrotate is enabled for ``/var/log/PAVICS/*.log`` to avoid filling up the
disk. Any new ``.log`` files in that folder will get logrotate for free.


How to Enable the Component
---------------------------

- Edit ``env.local`` (a copy of env.local.example_)

- Add "./components/scheduler" to ``EXTRA_CONF_DIRS``.
- Set ``AUTODEPLOY_EXTRA_REPOS``, ``AUTODEPLOY_DEPLOY_KEY_ROOT_DIR``,
``AUTODEPLOY_PLATFORM_FREQUENCY``, ``AUTODEPLOY_NOTEBOOK_FREQUENCY`` as desired,
full documentation in env.local.example_.
- Run once fix-write-perm_, see doc in script.


Old way to deploy the automatic deployment
------------------------------------------

Superseeded by this new ``scheduler`` component. Keeping for reference only.

Doing it this old way do not need the ``scheduler`` compoment but lose the
ability for the autodeploy system to update itself.

Configure logrotate for all following automations to prevent disk full::

deployment/install-logrotate-config .. $USER

To enable continuous deployment of PAVICS::

deployment/install-automated-deployment.sh .. $USER [daily|5-mins]
# read the script for more options/details

If you want to manually force a deployment of PAVICS (note this might not use
latest version of deploy.sh script)::

deployment/deploy.sh .
# read the script for more options/details

To enable continuous deployment of tutorial Jupyter notebooks::

deployment/install-deploy-notebook .. $USER
# read the script for more details

To trigger tutorial Jupyter notebooks deploy manually::

# configure logrotate before because this script will log to
# /var/log/PAVICS/notebookdeploy.log

deployment/trigger-deploy-notebook
# read the script for more details

Migrating to the new mechanism requires manual deletion of all the artifacts
created by the old install scripts: ``sudo rm /etc/cron.d/PAVICS-deploy
/etc/cron.hourly/PAVICS-deploy-notebooks /etc/logrotate.d/PAVICS-deploy
/usr/local/sbin/triggerdeploy.sh``. Both can not co-exist at the same time.


Comparison between the old and new autodeploy mechanism
-------------------------------------------------------

Maximum backward-compatibility has been kept with the old install scripts style:

* Still log to the same existing log files under ``/var/log/PAVICS``.
* Old single ssh deploy key is still compatible, but the new mechanism allows for different ssh deploy keys for each extra repos (again, public repos should use https clone path to avoid dealing with ssh deploy keys in the first place).
* Old install scripts are kept and can still deploy the old way.

Features missing in old install scripts or how the new mechanism improves on the old install scripts:

* Autodeploy of the autodeploy itself ! This is the biggest win. Previously, if ``triggerdeploy.sh`` or ``PAVICS-deploy-notebooks`` script changes, they have to be deployed manually. It's very annoying. Now they are volume-mount in so are fresh on each run.
* ``env.local`` now drive absolutely everything, source control that file and we've got a true DevOPS pipeline.
* Configurable platform and notebook autodeploy frequency. Previously, this means manually editing the generated cron file, less ideal.
* Do not need any support on the local host other than ``docker`` and ``docker-compose``. ``cron/logrotate/git/ssh`` versions are all locked-down in the docker images used by the autodeploy. Recall previously we had to deal with git version too old on some hosts.
* Each cron job run in its own docker image meaning the runtime environment is traceable and reproducible.
* The newly introduced scheduler component is made extensible so other jobs can added into it as well (ex: backup), via ``env.local``, which should be source controlled, meaning all surrounding maintenance related tasks can also be traceable and reproducible.


Monitoring
==========

This component provides monitoring and alerting for the PAVICS physical host
and containers.

Prometheus stack is used:

* Node-exporter to collect host metrics.
* cAdvisor to collect containers metrics.
* Prometheus to scrape metrics, to store them and to query them.
* AlertManager to manage alerts: deduplicate, group, route, silence, inhibit.
* Grafana to provide visualization dashboard for the metrics.


Usage
-----

- Grafana to view metric graphs: http://PAVICS_FQDN:3001/d/pf6xQMWGz/docker-and-system-monitoring
- Prometheus alert rules: http://PAVICS_FQDN:9090/rules
- AlertManager to manage alerts: http://PAVICS_FQDN:9093


How to Enable the Component
---------------------------

- Edit ``env.local`` (a copy of env.local.example_)

- Add "./components/monitoring" to ``EXTRA_CONF_DIRS``
- Set ``GRAFANA_ADMIN_PASSWORD`` to login to Grafana
- Set ``ALERTMANAGER_ADMIN_EMAIL_RECEIVER`` for receiving alerts
- Set ``SMTP_SERVER`` for sending alerts
- Optionally set

- ``ALERTMANAGER_EXTRA_GLOBAL`` to further configure AlertManager
- ``ALERTMANAGER_EXTRA_ROUTES`` to add more routes than email notification
- ``ALERTMANAGER_EXTRA_INHIBITION`` to disable rule from firing
- ``ALERTMANAGER_EXTRA_RECEIVERS`` to add more receivers than the admin emails


Grafana Dashboard
-----------------

.. image:: grafana-dashboard.png

For host, using Node-exporter to collect metrics:

- uptime
- number of container
- used disk space
- used memory, available memory, used swap memory
- load
- cpu usage
- in and out network traffic
- disk I/O

For each container, using cAdvisor to collect metrics:

- in and out network traffic
- cpu usage
- memory and swap memory usage
- disk usage

Useful visualisation features:

- zoom in one graph and all other graph update to match the same "time range" so we can correlate event
- view each graph independently for more details
- mouse over each data point will show value at that moment


Prometheus Alert Rules
----------------------

.. image:: prometheus-alert-rules.png


AlertManager for Alert Dashboard and Silencing
----------------------------------------------

.. image:: alertmanager-dashboard.png
.. image:: alertmanager-silence-alert.png


Customizing the Component
-------------------------

- To add more Grafana dashboard, volume-mount more ``*.json`` files to the
grafana container.

- To add more Prometheus alert rules, volume-mount more ``*.rules`` files to
the prometheus container.

- To disable existing Prometheus alert rules, add more Alertmanager inhibition
rules using ``ALERTMANAGER_EXTRA_INHIBITION`` via ``env.local`` file.

- Other possible Alertmanager configs via ``env.local``:
``ALERTMANAGER_EXTRA_GLOBAL``, ``ALERTMANAGER_EXTRA_ROUTES`` (can route to
Slack or other services accepting webhooks), ``ALERTMANAGER_EXTRA_RECEIVERS``.




.. _env.local.example: ../env.local.example
.. _fix-write-perm: ../deployment/fix-write-perm
.. _deploy.sh: ../deployment/deploy.sh
Binary file added birdhouse/components/alertmanager-dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added birdhouse/components/grafana-dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions birdhouse/components/monitoring/.gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
prometheus.yml
grafana_datasources.yml
grafana_dashboards.yml
alertmanager.yml
prometheus.rules
Loading