-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus and grafana improvements based on load testing experience #501
Prometheus and grafana improvements based on load testing experience #501
Conversation
Build Failed 😱 Build Id: dbf35576-d5a4-46d2-8b28-65f1b16c51ed Build Logs
|
1 similar comment
Build Failed 😱 Build Id: dbf35576-d5a4-46d2-8b28-65f1b16c51ed Build Logs
|
af3f2c9
to
7f95a5d
Compare
Build Failed 😱 Build Id: c23aea82-bea8-4324-be28-87c74b38554a Build Logs
|
Build Succeeded 👏 Build Id: 8c8e2af7-6e86-4630-bb79-8550962f9e6e The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
Build Succeeded 👏 Build Id: 55c119a3-2784-4f15-9fcb-a2a5db15b9b3 The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome ! Should we update the docs, with the gcloud command I really like it and also update the command to install prom.
Should we also edit https://github.com/GoogleCloudPlatform/agones/blob/master/build/gke-test-cluster/cluster.yml.jinja to include the Prometheus+grafana nodePool by default? Or maybe as a conditional, defaulting to on? |
How about we add both nodepools and makefile targets in separate PR? This way we'll know that Agones works with and without special nodepools. |
7f95a5d
to
c3a5389
Compare
Build Failed 😱 Build Id: 304fc0fc-5059-4af8-9785-903a755257e1 Build Logs
|
At the very least, this should have documentation in the metrics page for best practices with prom. I'd still be inclined to include the node pool in the deployment manager script (not for e2e), so we can test both? For personal testing I'm always going to default to wanting to have this node pool. I assume everyone else will too? |
c3a5389
to
cc29603
Compare
Build Succeeded 👏 Build Id: d359f464-792d-425f-ae2c-c21ebca71b40 The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
f66938f
to
0fdc72a
Compare
updated DM template and docs |
Build Succeeded 👏 Build Id: 8f20f52a-144f-46ce-aa30-2d2809994767 The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
Build Failed 😱 Build Id: 9d617a68-dac7-4456-99bb-2d4e4910c00a Build Logs
|
0fdc72a
to
6b80915
Compare
Build Succeeded 👏 Build Id: f96a4d64-baa3-4c59-869e-5279751d6a8d The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
Build Failed 😱 Build Id: 8f4554b5-fdd1-4461-90c0-f5ebcf1b9242 Build Logs
|
@@ -119,6 +119,17 @@ helm install --wait --name prom stable/prometheus --namespace metrics \ | |||
or `make kind-setup-prometheus` and `make minikube-setup-prometheus` for {{< ghlink href="/build/README.md#running-a-test-kind-cluster" branch="master" >}}Kind{{< /ghlink >}} | |||
and {{< ghlink href="/build/README.md#running-a-test-minikube-cluster" branch="master" >}}Minikube{{< /ghlink >}}. | |||
|
|||
For resiliency it is recommended to run Prometheus on a dedicated node which is separate from nodes where Game Servers are scheduled. If you use `make setup-prometheus` to set up Prometheus, it will schedule Prometheus pods on nodes tainted with `stable.agones.dev/agones-metrics=true:NoExecute` and labeled with `stable.agones.dev/agones-metrics=true` if available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kuqd remind me what's our standard here, for stating how we want to do this? Point to the yaml in GitHub?
Not sure if this should be a feature
shortcode here to hide this until 0.8.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update line 113 with the new command using the yaml and here just explain the same way with a link to your yaml config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new command being
helm upgrade --install --wait prom stable/prometheus --namespace metrics \
--set alertmanager.enabled=false,pushgateway.enabled=false \
--set kubeStateMetrics.enabled=false,nodeExporter.enabled=false \
--set pushgateway.enabled=false \
--set server.global.scrape_interval=30s,server.persistentVolume.enabled=true,server.persistentVolume.size=64Gi -f ./build/prometheus.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. I think once this is there, this is good for approval 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Build Succeeded 👏 Build Id: 7a06a9ab-b6b3-4e78-8d38-f0ccdf705f5b The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
6b80915
to
344db01
Compare
Build Succeeded 👏 Build Id: 00d8e3a5-0111-4291-8b43-6dc3c7392fd1 The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
- made prometheus PVC size configurable on the command line - moved majority of Prometheus config overrides to separate yaml file. - removed scraping of container stats from prometheus config, otherwise clusters of 10K pods are very quickly consuming tons of space - added taints and tolerations to prometheus and grafana They will now will prefer (but not require) to be scheduled on nodes labeled with `stable.agones.dev/agones-metrics: true`. They will also tolerate taint `stable.agones.dev/agones-metrics=true:NoExecute`. Creating node pool dedicated for monitoring is as simple as: ``` gcloud container node-pools create agones-metrics ... \ --node-taints stable.agones.dev/agones-metrics=true:NoExecute \ --node-labels stable.agones.dev/agones-metrics=true \ --num-nodes=1 ```
344db01
to
5445101
Compare
Build Succeeded 👏 Build Id: f38a53dd-e879-4320-aa9c-ab84ece2baa3 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website will exist for the next 30 days: To install this version:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
clusters of 10K pods are very quickly consuming tons of space
They will now will prefer (but not require) to be scheduled on nodes labeled
with
stable.agones.dev/agones-metrics: true
. They will also toleratetaint
stable.agones.dev/agones-metrics=true:NoExecute
.Creating node pool dedicated for monitoring is as simple as: