Date: 2020-05-26
Accepted
We need to be able to monitor our cluster / apps, and be notified by automated alerts if problems occur.
We will base our solution around Azure Monitor and its related services. As a managed service, it means:
- Scaling, updates and data storage are taken care of as part of the solution, resulting in lower running costs.
- State remains outside the cluster. Together with GitOps, which stores Kubernetes resource manifests in an external git repository, this makes it's easy to recreate the cluster at any point in time.
We will use Prometheus to gather metrics for the following reasons:
- Both Linkerd and Istio (our potential service mesh choices) have an internal Prometheus instance.
- Proven and popular, wide compatibility, Cloud Native Computing Foundation project.
- We can leverage some of the knowledge and existing work done by the MHRA team.
- Close integration with Grafana should we decide to use it (pre-existing dashboard templates for exporters etc).
Prometheus support within Azure Monitor for Containers means that no separate Prometheus server is required but allows for scraping /metrics
endpoints and making use of Prometheus queries for custom alerts etc.
An example of Prometheus Query (PromQL) can be found here. https://github.com/MHRA/deployments/blob/e42c0a9ee320294c1d691392c5b8525703cdd524/observability/prometheus-configmap.yaml#L15
Instrumentation for our apps will follow the OpenTelemetry API.
Will be consumed by Azure Application Insights or Jaeger. A spike will happen at a later date to determine what we require here.
Azure Monitor pulls in logs via its Metrics API, meaning:
- We can use Log Analytics in the Azure portal to write log queries and interactively analyze log data.
- We can use the Application Insights analytics console in the Azure portal to write log queries and interactively analyze log data from Application Insights.
Alerts will be configured through Azure Monitor.
Azure uses the Kusto query language. An example can be found here.
Action Groups can be associated with alerts to handle SMS / email notifications for people on support duty.
Note: Prometheus also has an βAlert Managerβ which can handle some / much of the above. We should be able to compare the two fairly easily once up and running. Secondly, there may be additional value in combining the tech issues with the existing user support system. A spike to investigate this has been added to the backlog.
Azure Dashboard offers basic functionality and may be sufficient, however Grafana may well be worth setting up due to the availability of pre-existing templates for a number of our needs (eg node exporter dashboard), as well as its greater flexibility. Spike to follow.
No particular risks at this stage to mitigate from our above choices - obviously we are still at a very early stage and have ample room to manoeuvre. Pricing for Azure monitoring can be found here.