This repository has been archived by the owner on Jul 21, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 14
Autoscale deployments based on PubSub queue depth #484
Milestone
Comments
If I understand this right, we should be able to do this without any code changes to |
tgeoghegan
added a commit
that referenced
this issue
Mar 11, 2021
We got paged on 2021/3/10 because the `us-ut/intake-batch-apple` deployment wasn't keeping up with the incoming tasks because of a spike in traffic. I resolved that by manually scaling the deployment up to 8 workers, from 5. We observed another similar spike in incoming ingestion batches around 4 AM PST 2021/3/11, so I want to keep the 8 workers until we get around to autoscaling the workers (#484).
tgeoghegan
added a commit
that referenced
this issue
Mar 18, 2021
To spare ourselves the hassle of manually resizing deployments, we configure Kubernetes to automatically resize `intake-batch` and `aggregate` worker pools. This requires deploying the Stackdriver custom metrics adapter to make PubSub metrics visible to a Kubernetes Horizontal Pod Autoscaler, then configuring an HPA for each deployment that consults the PubSub `num_undelivered_messages` metric. See `terraform/README.md` for more details and discussion of config parameter choices. This commit also modifies the `integration-tester` deployment so that it emits more batches, forcing dev and staging clusters to exercise the autoscaling feature. We also amend the alert for `intake-batch` task queue size: we no longer expect that queue to periodically empty since we have configured Kubernetes to keep it at a steady state of ~150 messages. Resolves #484
tgeoghegan
added a commit
that referenced
this issue
Mar 18, 2021
To spare ourselves the hassle of manually resizing deployments, we configure Kubernetes to automatically resize `intake-batch` and `aggregate` worker pools. This requires deploying the Stackdriver custom metrics adapter to make PubSub metrics visible to a Kubernetes Horizontal Pod Autoscaler, then configuring an HPA for each deployment that consults the PubSub `num_undelivered_messages` metric. See `terraform/README.md` for more details and discussion of config parameter choices. This commit also modifies the `integration-tester` deployment so that it emits more batches, forcing dev and staging clusters to exercise the autoscaling feature. We also amend the alert for `intake-batch` task queue size: we no longer expect that queue to periodically empty since we have configured Kubernetes to keep it at a steady state of ~150 messages. Resolves #484
tgeoghegan
added a commit
that referenced
this issue
Mar 22, 2021
To spare ourselves the hassle of manually resizing deployments, we configure Kubernetes to automatically resize `intake-batch` and `aggregate` worker pools. This requires deploying the Stackdriver custom metrics adapter to make PubSub metrics visible to a Kubernetes Horizontal Pod Autoscaler, then configuring an HPA for each deployment that consults the PubSub `num_undelivered_messages` metric. See `terraform/README.md` for more details and discussion of config parameter choices. This commit also modifies the `integration-tester` deployment so that it emits more batches, forcing dev and staging clusters to exercise the autoscaling feature. We also amend the alert for `intake-batch` task queue size: we no longer expect that queue to periodically empty since we have configured Kubernetes to keep it at a steady state of ~150 messages. Resolves #484
tgeoghegan
added a commit
that referenced
this issue
Apr 2, 2021
To spare ourselves the hassle of manually resizing deployments, we configure Kubernetes to automatically resize `intake-batch` and `aggregate` worker pools. This requires deploying the Stackdriver custom metrics adapter to make PubSub metrics visible to a Kubernetes Horizontal Pod Autoscaler, then configuring an HPA for each deployment that consults the PubSub `num_undelivered_messages` metric. See `terraform/README.md` for more details and discussion of config parameter choices. This commit also modifies the `integration-tester` deployment so that it emits more batches, forcing dev and staging clusters to exercise the autoscaling feature. We also amend the alert for `intake-batch` task queue size: we no longer expect that queue to periodically empty since we have configured Kubernetes to keep it at a steady state of ~150 messages. Resolves #484
Reopening since I had to revert the associated PR. |
We also need to figure out a solution that works on AWS with SQS, since we operate an instance on AWS, and of course to help out our colleagues at NCI. SQS of course has a rich set of metrics in CloudWatch, but we have to figure out what metrics adapter is needed to expose it to a Kubernetes HPA. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Occasionally we get pages because we have not provisioned enough workers to keep up with incoming work. If it's just a matter of insufficient worker capacity, these are easy to resolve: just scale up the relevant deployment to more replicas. But we can make Kubernetes do this for us automatically if we can make it aware of PubSub subscription queue depth.
Kubernetes has a notion of a horizontal pod autoscaler which can add more replicas to a pod in response to observed metrics. Google Kubernetes Engine provides a custom metrics adapter that makes PubSub metrics from Stackdriver visible to the Kubernetes autoscaler.
We should deploy the customs metrics adapter, and then configure HPAs to automatically scale the
intake-batch-*
andaggregate-*
worker deployment sizes.The text was updated successfully, but these errors were encountered: