Collection of terraform modules to deploy the prometheus ecosystem to Cloud foundry.
- Prometheus exporters are deployed as paas applications to provide prometheus metrics for paas applications and services, paas billing, redis, postgres...
- Internal applications metrics can be exposed too
- Prometheus collects the metrics: applications, databases, cpu, memory, cost, custom metrics, etc.
- Metrics are then persisted to InfluxDB
- Metrics-based alerts can be created in prometheus and processed by alertmanager to send to Slack, email, pagerduty, etc
- Finally, the metrics are available in grafana to build dashboards, help troubleshooting and create alerts.
The prometheus_all module is a good starting point as it includes all the other modules. Check the variables in prometheus_all for a description of all configuration options.
github.com/DFE-Digital/cf-monitoring
- Prerequisites
- prometheus_all
- Minimal configuration
- Use a specific cf-monitoring version
- Retention policy
- Enable specific modules
- Grafana
- PostgreSQL
- Generic PostgreSQL alerting
- Generic Application alerting
- Redis Services
- External exporters
- Internal applications
- alertmanager
- Dockerhub pull rate limit
- By default, the influxdb database service must be present (as it is on GOV.UK PaaS). If not, another backend can be used and the influxdb module disabled.
- The paas-prometheus-exporter requires a cf username and password to connect and read metrics.
It is recommended to create a service account and set it up as
SpaceAuditor
on each monitored space as well asBillingManager
on the whole organisation. - Terraform (Tested with version 0.14)
- Terraform cloudfoundry provider
Wrapper module abstracting all the other modules. It should be sufficient for most use cases but underlying modules can also be used directly.
The prometheus_all
module creates two instances of the Prometheus application:
- Scraper: it has the configuration required to collect metrics from their given locations and raise alerts sending them to the alertmanager
- Read Only: used by Grafana as a data source and prevents large queries from impacting the metric collection process of the scraper
module prometheus_all {
source = "git::https://github.com/DFE-Digital/cf-monitoring.git//prometheus_all"
monitoring_instance_name = "teaching-vacancies"
monitoring_org_name = "dfe"
monitoring_space_name = "teaching-vacancies-monitoring"
paas_exporter_username = var.paas_exporter_username
paas_exporter_password = var.paas_exporter_password
grafana_admin_password = var.grafana_admin_password
}
The git reference can be changed. For example for the dev
branch:
source = "git::https://github.com/DFE-Digital/cf-monitoring.git//prometheus_all?ref=dev"
The default retention policy in influxdb is 30 days. After which all the metrics are deleted. It is possible to keep some metrics for 12 months using influxdb downsampling and enable yearly prometheus.
-
Install influxdb client
1.8
from https://portal.influxdata.com/downloads/ -
Install cf conduit plugin (min version 0.13):
cf install-plugin conduit
-
Connect to the influxdb instance:
cf conduit <influxdb instance> -- influx
-
Create the
one_year
retention policyCREATE RETENTION POLICY one_year on defaultdb DURATION 52w REPLICATION 1
-
Create the continuous query to aggregate data automatically. For the billing data enter:
CREATE CONTINUOUS QUERY cost_1y ON defaultdb BEGIN SELECT max(value) AS value INTO defaultdb.one_year.cost FROM defaultdb.default_retention_policy.cost GROUP BY time(1d),* END
Prometheus-yearly is an extra prometheus instance reading data from the one_year retention policy in influxdb. It is disabled by default. To enable it set enable_prometheus_yearly
to true:
module prometheus_all {
source = "git::https://github.com/DFE-Digital/cf-monitoring.git//prometheus_all"
...
enable_prometheus_yearly = true
}
It is possible to include modules selectively to help onboarding to prometheus_all step-by-step. See the list of modules in enabled_modules.
module prometheus_all {
source = "git::https://github.com/DFE-Digital/cf-monitoring.git//prometheus_all"
enabled_modules = ["prometheus", "influxdb"]
monitoring_instance_name = "teaching-vacancies"
monitoring_org_name = "dfe"
monitoring_space_name = "teaching-vacancies-monitoring"
}
By default authentication is only via username/password for the admin account. Autentication via Google single-sign-on can be configured. It provides readonly access to users by default. Additional permissions are not persisted.
It provides several datasources:
- prometheus: for Cloud Foundry metrics and any other prometheus exporter
- influxdb: for influxDB internal metrics as well as all the metrics above, using the influxDB query language
- elasticsearch (optional): to query elasticsearch, extract data and generate metrics
A number of Grafana dashboards are included and are usable out-of-the-box to monitor your apps and services. By default it shows all your resources, then you can filter them via drop-down menus.
You can add your own dashboards via the grafana_json_dashboards
parameter.
See Grafana README
Basic metrics are available in the CF databases
dashboard. The PostgreSQL advanced
dashboard provides more advanced metrics via the postgres_prometheus_exporter
module.
See postgres_prometheus_exporter README
Generic Postgres alerting can be enabled for selected databases.
This will add alerts that will trigger as specified below
- Memory avail < 512MB
- Cpu usage > 60%
- Storage space avail < 1GB
PreReqs.
- Monitoring must be configured for the postgres instances
- Alerting must already be configured for your service (alertmanager)
Set the following variables in tf or env.tfvars.json file as per your configuration to enable generic alerting.
postgres_dashboard_url (string): the grafana url for the cf-databases dashboard
alertable_postgres_services (map): a map of the postgres instances to have alerting enabled, and optional alert thresholds. If any thresholds are not listed they will default as below
- max_cpu = 60 (%)
- min_mem = 1 (in Gb)
- min_stg = 1 (in Gb)
e.g. (for json format)
"postgres_dashboard_url": "https://grafana-service.london.cloudapps.digital/d/azzzBNMz"
"alertable_postgres_services": {
"bat-qa/apply-postgres-qa": {
"max_cpu": 65,
"min_mem": 0.5,
"min_stg": 2
},
"bat-qa/register-postgres-qa": {
},
"bat-qa/teacher-training-api-postgres-qa": {
"min_mem": 0.5
}
}
Generic Application alerting can be enabled for selected apps.
Set the following variables in tf or env.tfvars.json file as per your configuration to enable generic alerting.
apps_dashboard_url (string): the grafana url for the cf-apps dashboard
alertable_apps (map): a map of the app instances to have alerting enabled, and optional alert thresholds. If any thresholds are not listed they will default as below
- max_cpu = 50 (%)
- max_mem = 60 (%)
- max_disk = 60 (%)
- max_crash_count = 1
- max_elevated_req_failure_count = 0.1 (10%)
- response_threshold = 1 (second)
PreReqs.
- Monitoring must be configured for the app instances
- Alerting must already be configured for your service (alertmanager)
e.g. (for json format)
"apps_dashboard_url": "https://grafana-service.london.cloudapps.digital/d/azzzBNMz"
"alertable_apps": {
"tra-dev/find-a-lost-trn-dev": {
},
"tra-dev/qualified-teachers-api-dev": {
"response_threshold": 5
}
}
If your application uses Redis you may want to include a Redis metrics exporter for each instance of Redis you use. This is accomplished by passing in an array of strings. Each string takes the form
of "space/service"
, for example:
redis_services = [ "get_into_teaching/redis_service_one" , "get_into_teaching/redis_service_two" , ... ]
List of external endpoints which can be queried via /metrics
. Can be used for apps deployed to Cloud foundry or any external services.
They must be accessible via https.
Pass a list of applications deployed to Cloud Foundry and prometheus will find each individual instance and scrape metrics from them. The format is:
["<app1_name>.<internal_domain>[:port]", "<app2_name>.<internal_domain>[:port]"]
If the port is not specified, the default Cloud Foundry port will be used (8080).
Internal routing must be configured so that prometheus can access them.
prometheus_all
outputs both prometheus app name and id to help create the network policy.
To allow useful aggregation and optimise time series storage, the applications should decorate the metrics with a label called app_instance
representing the id of the Cloud Foundry app instance. It can be obtained at runtime from the CF_INSTANCE_INDEX
environment variable.
For ruby applications, the yabeda is a powerful framework to expose custom metrics and provides a lot of metrics out of the box such as yabeda-rails and yabeda-sidekiq.
It is recommended to decorate the yabeda metrics as such:
if ENV.key?('VCAP_APPLICATION')
vcap_config = JSON.parse(ENV['VCAP_APPLICATION'])
Yabeda.configure do
default_tag :app, vcap_config['name']
default_tag :app_instance, ENV['CF_INSTANCE_INDEX']
default_tag :organisation, vcap_config['organization_name']
default_tag :space, vcap_config['space_name']
end
end
A default configuration is provided but it doesn't send any notification. You can configure slack to publish to a webhook or provide your own configuration.
Deploying apps that depends on Dockerhub image pull can result in failure because of error You have reached your pull rate limit if not authenticated to Dockerhub.
Dockerhub credentials can be passed into the modules as follows:
docker_credentials = {
username = var.dockerhub_username
password = var.dockerhub_password
}