Skip to content
This repository has been archived by the owner on Jul 21, 2023. It is now read-only.

Autoscale deployments based on PubSub queue depth #484

Closed
tgeoghegan opened this issue Mar 11, 2021 · 3 comments · Fixed by #507 or #1042
Closed

Autoscale deployments based on PubSub queue depth #484

tgeoghegan opened this issue Mar 11, 2021 · 3 comments · Fixed by #507 or #1042
Assignees

Comments

@tgeoghegan
Copy link
Contributor

Occasionally we get pages because we have not provisioned enough workers to keep up with incoming work. If it's just a matter of insufficient worker capacity, these are easy to resolve: just scale up the relevant deployment to more replicas. But we can make Kubernetes do this for us automatically if we can make it aware of PubSub subscription queue depth.

Kubernetes has a notion of a horizontal pod autoscaler which can add more replicas to a pod in response to observed metrics. Google Kubernetes Engine provides a custom metrics adapter that makes PubSub metrics from Stackdriver visible to the Kubernetes autoscaler.

We should deploy the customs metrics adapter, and then configure HPAs to automatically scale the intake-batch-* and aggregate-* worker deployment sizes.

@tgeoghegan
Copy link
Contributor Author

If I understand this right, we should be able to do this without any code changes to facilitator or workflow-manager. The solution I sketched out is specific to GCP and PubSub as it relies on the Google-provided metrics adapter. I'm sure something analogous is supported on AWS/EKS but I haven't looked into it.

tgeoghegan added a commit that referenced this issue Mar 11, 2021
We got paged on 2021/3/10 because the `us-ut/intake-batch-apple`
deployment wasn't keeping up with the incoming tasks because of a spike
in traffic. I resolved that by manually scaling the deployment up to 8
workers, from 5. We observed another similar spike in incoming ingestion
batches around 4 AM PST 2021/3/11, so I want to keep the 8 workers until
we get around to autoscaling the workers (#484).
tgeoghegan added a commit that referenced this issue Mar 18, 2021
To spare ourselves the hassle of manually resizing deployments, we
configure Kubernetes to automatically resize `intake-batch` and
`aggregate` worker pools. This requires deploying the Stackdriver custom
metrics adapter to make PubSub metrics visible to a Kubernetes
Horizontal Pod Autoscaler, then configuring an HPA for each deployment
that consults the PubSub `num_undelivered_messages` metric. See
`terraform/README.md` for more details and discussion of config
parameter choices.

This commit also modifies the `integration-tester` deployment so that it
emits more batches, forcing dev and staging clusters to exercise the
autoscaling feature. We also amend the alert for `intake-batch` task
queue size: we no longer expect that queue to periodically empty since
we have configured Kubernetes to keep it at a steady state of ~150
messages.

Resolves #484
tgeoghegan added a commit that referenced this issue Mar 18, 2021
To spare ourselves the hassle of manually resizing deployments, we
configure Kubernetes to automatically resize `intake-batch` and
`aggregate` worker pools. This requires deploying the Stackdriver custom
metrics adapter to make PubSub metrics visible to a Kubernetes
Horizontal Pod Autoscaler, then configuring an HPA for each deployment
that consults the PubSub `num_undelivered_messages` metric. See
`terraform/README.md` for more details and discussion of config
parameter choices.

This commit also modifies the `integration-tester` deployment so that it
emits more batches, forcing dev and staging clusters to exercise the
autoscaling feature. We also amend the alert for `intake-batch` task
queue size: we no longer expect that queue to periodically empty since
we have configured Kubernetes to keep it at a steady state of ~150
messages.

Resolves #484
tgeoghegan added a commit that referenced this issue Mar 22, 2021
To spare ourselves the hassle of manually resizing deployments, we
configure Kubernetes to automatically resize `intake-batch` and
`aggregate` worker pools. This requires deploying the Stackdriver custom
metrics adapter to make PubSub metrics visible to a Kubernetes
Horizontal Pod Autoscaler, then configuring an HPA for each deployment
that consults the PubSub `num_undelivered_messages` metric. See
`terraform/README.md` for more details and discussion of config
parameter choices.

This commit also modifies the `integration-tester` deployment so that it
emits more batches, forcing dev and staging clusters to exercise the
autoscaling feature. We also amend the alert for `intake-batch` task
queue size: we no longer expect that queue to periodically empty since
we have configured Kubernetes to keep it at a steady state of ~150
messages.

Resolves #484
tgeoghegan added a commit that referenced this issue Apr 2, 2021
To spare ourselves the hassle of manually resizing deployments, we
configure Kubernetes to automatically resize `intake-batch` and
`aggregate` worker pools. This requires deploying the Stackdriver custom
metrics adapter to make PubSub metrics visible to a Kubernetes
Horizontal Pod Autoscaler, then configuring an HPA for each deployment
that consults the PubSub `num_undelivered_messages` metric. See
`terraform/README.md` for more details and discussion of config
parameter choices.

This commit also modifies the `integration-tester` deployment so that it
emits more batches, forcing dev and staging clusters to exercise the
autoscaling feature. We also amend the alert for `intake-batch` task
queue size: we no longer expect that queue to periodically empty since
we have configured Kubernetes to keep it at a steady state of ~150
messages.

Resolves #484
@tgeoghegan
Copy link
Contributor Author

Reopening since I had to revert the associated PR.

@tgeoghegan tgeoghegan reopened this Apr 2, 2021
@tgeoghegan tgeoghegan added this to the Spring 2021 reliability milestone Apr 5, 2021
@tgeoghegan
Copy link
Contributor Author

We also need to figure out a solution that works on AWS with SQS, since we operate an instance on AWS, and of course to help out our colleagues at NCI. SQS of course has a rich set of metrics in CloudWatch, but we have to figure out what metrics adapter is needed to expose it to a Kubernetes HPA.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.