-
Notifications
You must be signed in to change notification settings - Fork 45
Setting up cluster monitoring
This is a set of instructions for setting up a basic Collectd->Graphite->Grafana monitoring system for a DCOS cluster.
Graphite is the center of the monitoring system. It gives Collectd a place to store data, and Grafana a place to pull from.
The Docker image chosen for graphite is located at nickstenning/graphite. This is a basic image consisting of graphite, the carbon backend, and NGINX webserver to provide data through a web interface.
the following ports need to be mapped:
- 80 - for the web interface and API
- 2003 - for recieving data from Collectd
you will want to map the following volumes as well:
- /var/lib/graphite/conf - for persistent configuration
- /var/lib/graphite/storage/whisper - the actual data being collected
you can start a new instance in DCOS with the following JSON
{
"id": "/graphite",
"backoffFactor": 1.15,
"backoffSeconds": 1,
"container": {
"portMappings": [
{
"containerPort": 80,
"hostPort": 0,
"labels": {
"VIP_0": "/graphite2:80"
},
"protocol": "tcp",
"servicePort": 10154
},
{
"containerPort": 2003,
"hostPort": 0,
"labels": {
"VIP_1": "/graphite2:2003"
},
"protocol": "tcp",
"servicePort": 10155
}
],
"type": "DOCKER",
"volumes": [
{
"containerPath": "/var/lib/graphite/conf",
"hostPath": "<your/host-path/here>",
"mode": "RW"
},
{
"containerPath": "/var/lib/graphite/storage/whisper",
"hostPath": "<your/host-path/here>",
"mode": "RW"
}
],
"docker": {
"image": "nickstenning/graphite",
"forcePullImage": false,
"privileged": false,
"parameters": []
}
},
"cpus": 4,
"disk": 0,
"instances": 1,
"maxLaunchDelaySeconds": 3600,
"mem": 2056,
"gpus": 0,
"networks": [
{
"mode": "container/bridge"
}
],
"requirePorts": false,
"upgradeStrategy": {
"maximumOverCapacity": 1,
"minimumHealthCapacity": 1
},
"killSelection": "YOUNGEST_FIRST",
"unreachableStrategy": {
"inactiveAfterSeconds": 0,
"expungeAfterSeconds": 0
},
"healthChecks": [],
"fetch": [],
"constraints": []
}
In DCOS you'll want to make sure to check "Enable load balanced service address" to get a service address for both ports. The CPU and Memory required will vary on how many nodes you plan to monitor, but i found 4 CPU and 2GB was plenty for 16 nodes.
You can now verify graphite is up and running by going to ip_address:80/dashboard.
To adjust retention settings, open up storage-schemas.conf located wherever you set up you volume for "/var/lib/graphite/conf" add the following in between the [carbon] and [default_1min_for_1day] blocks
[collectd]
pattern = ^collectd.*
retentions = 10s:1h,1m:1d,10m:2w
this will retain 10s data points for 1 hour, 1m data points for 1d, and 10m data points for 2 weeks for any data coming from a collectd source.
after adjusting the conf file restart the graphite instance.
ollectd collects data based on which plugins you choose to enable. there are plugins for everything you would ever want, and even more for stuff you wouldnt. in this case, we are only going to monitor CPU, memory, and disk.
here is our basic collectd conf.
FQDNLookup false
Interval 10
Timeout 2
ReadThreads 5
LoadPlugin cpu
LoadPlugin disk
LoadPlugin memory
LoadPlugin write_graphite
<Plugin disk>
Disk "/^[hs]d[a-f][0-9]?$/"
IgnoreSelected false
</Plugin>
<Plugin "write_graphite">
<Node "endpoint">
Host "${EP_HOST}"
Port "${EP_PORT}"
Protocol "tcp"
LogSendErrors true
EscapeCharacter "_"
Prefix "${PREFIX}"
</Node>
</Plugin>
a simple installation script for centos 7 follows
#!/bin/bash
EP_HOST=<the host given by DCOS>
EP_PORT=2003
PREFIX="collectd."
yum -y install epel-release
yum -y install collectd collectd-utils
sed -e "s/\${EP_HOST}/$EP_HOST/" -e "s/\${EP_PORT}/$EP_PORT/" -e "s/\${PREFIX}/$PREFIX/" collectd.conf > /etc/collectd.conf
systemctl enable collectd
systemctl start collectd
This assumes the script file and collectd.conf are in the same directory.
Simply run this script on any node you wish to monitor and it will start reporting data to graphite right away!
You can confirm data is coming in by going back to :80/dashboard. if the collectd. prefix shows up then you know it is working.
Grafana docker image info here https://hub.docker.com/r/grafana/grafana/
You'll want to expose port 3000 for web access.
For DCOS here is an example configuration:
{
"id": "/grafana",
"backoffFactor": 1.15,
"backoffSeconds": 1,
"container": {
"portMappings": [
{
"containerPort": 3000,
"hostPort": 3000,
"labels": {
"VIP_0": "/grafana:3000"
},
"protocol": "tcp",
"servicePort": 10153
}
],
"type": "DOCKER",
"volumes": [
{
"containerPath": "/var/lib/grafana",
"hostPath": "hostpath",
"mode": "RW"
}
],
"docker": {
"image": "grafana/grafana",
"forcePullImage": false,
"privileged": false,
"parameters": []
}
},
"cpus": 2,
"disk": 0,
"instances": 1,
"maxLaunchDelaySeconds": 3600,
"mem": 512,
"gpus": 0,
"networks": [
{
"mode": "container/bridge"
}
],
"requirePorts": false,
"upgradeStrategy": {
"maximumOverCapacity": 1,
"minimumHealthCapacity": 1
},
"killSelection": "YOUNGEST_FIRST",
"unreachableStrategy": {
"inactiveAfterSeconds": 0,
"expungeAfterSeconds": 0
},
"healthChecks": [],
"fetch": [],
"constraints": []
}
Once it's up and running you can access the web ui via ip:3000 and use admin/admin to log in.
Next we need to set up a data source. under configuration-> datasources create a new data source. Name it whatever you want, but set the type to graphite. For URL give it the load balanced URL from DCOS. Keep access as proxy.
Hit save & test and if everything works OK you will see a green data source is working box.
From here you should create a new dashboard. Click on the + symbol on the left and select dashboard. click graph, and a new empty graph will appear. Click "panel title" and select edit. Set the datasource to the graphite datasource we just set up.
Now you can start drilling down into which metric you want displayed. For example, to show all disk write ops of a node, you would use collectd..disk-.disk_ops.write, and then add a sum function to add all the data points together.
- Home
- What's New
-
In-depth Topics
- Enable Scale to run CUDA GPU optimized algorithms
- Enable Scale to store secrets securely
- Test Scale's scan capability on the fly
- Test Scale's workspace broker capability on the fly
- Scale Performance Metrics
- Private docker repository configuration
- Setting up Automated Snapshots for Elasticsearch
- Setting up Cluster Monitoring
- Developer Notes