Autoscale deployments based on PubSub queue depth #484

tgeoghegan · 2021-03-11T17:57:10Z

Occasionally we get pages because we have not provisioned enough workers to keep up with incoming work. If it's just a matter of insufficient worker capacity, these are easy to resolve: just scale up the relevant deployment to more replicas. But we can make Kubernetes do this for us automatically if we can make it aware of PubSub subscription queue depth.

Kubernetes has a notion of a horizontal pod autoscaler which can add more replicas to a pod in response to observed metrics. Google Kubernetes Engine provides a custom metrics adapter that makes PubSub metrics from Stackdriver visible to the Kubernetes autoscaler.

We should deploy the customs metrics adapter, and then configure HPAs to automatically scale the intake-batch-* and aggregate-* worker deployment sizes.

The text was updated successfully, but these errors were encountered:

tgeoghegan · 2021-03-11T17:59:00Z

If I understand this right, we should be able to do this without any code changes to facilitator or workflow-manager. The solution I sketched out is specific to GCP and PubSub as it relies on the Google-provided metrics adapter. I'm sure something analogous is supported on AWS/EKS but I haven't looked into it.

We got paged on 2021/3/10 because the `us-ut/intake-batch-apple` deployment wasn't keeping up with the incoming tasks because of a spike in traffic. I resolved that by manually scaling the deployment up to 8 workers, from 5. We observed another similar spike in incoming ingestion batches around 4 AM PST 2021/3/11, so I want to keep the 8 workers until we get around to autoscaling the workers (#484).

To spare ourselves the hassle of manually resizing deployments, we configure Kubernetes to automatically resize `intake-batch` and `aggregate` worker pools. This requires deploying the Stackdriver custom metrics adapter to make PubSub metrics visible to a Kubernetes Horizontal Pod Autoscaler, then configuring an HPA for each deployment that consults the PubSub `num_undelivered_messages` metric. See `terraform/README.md` for more details and discussion of config parameter choices. This commit also modifies the `integration-tester` deployment so that it emits more batches, forcing dev and staging clusters to exercise the autoscaling feature. We also amend the alert for `intake-batch` task queue size: we no longer expect that queue to periodically empty since we have configured Kubernetes to keep it at a steady state of ~150 messages. Resolves #484

tgeoghegan · 2021-04-02T18:49:18Z

Reopening since I had to revert the associated PR.

tgeoghegan · 2021-09-22T19:19:02Z

We also need to figure out a solution that works on AWS with SQS, since we operate an instance on AWS, and of course to help out our colleagues at NCI. SQS of course has a rich set of metrics in CloudWatch, but we have to figure out what metrics adapter is needed to expose it to a Kubernetes HPA.

tgeoghegan mentioned this issue Mar 11, 2021

terraform: provision 8 us-ut-apple intake workers #485

Merged

tgeoghegan mentioned this issue Mar 22, 2021

terraform: autoscale k8s deployments #507

Merged

tgeoghegan closed this as completed in #507 Mar 22, 2021

tgeoghegan mentioned this issue Apr 2, 2021

terraform: autoscale k8s deployments #536

Closed

tgeoghegan reopened this Apr 2, 2021

tgeoghegan added this to the Spring 2021 reliability milestone Apr 5, 2021

branlwyd self-assigned this Sep 29, 2021

branlwyd mentioned this issue Oct 18, 2021

Add autoscaling for intake/aggregate tasks. #1042

Merged

branlwyd closed this as completed in #1042 Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscale deployments based on PubSub queue depth #484

Autoscale deployments based on PubSub queue depth #484

tgeoghegan commented Mar 11, 2021

tgeoghegan commented Mar 11, 2021

tgeoghegan commented Apr 2, 2021

tgeoghegan commented Sep 22, 2021

Autoscale deployments based on PubSub queue depth #484

Autoscale deployments based on PubSub queue depth #484

Comments

tgeoghegan commented Mar 11, 2021

tgeoghegan commented Mar 11, 2021

tgeoghegan commented Apr 2, 2021

tgeoghegan commented Sep 22, 2021