Add autoscaling for intake/aggregate tasks. #1042

branlwyd · 2021-10-18T23:26:42Z

Autoscaling is based on the queue depth: if there is a queue, we'll
scale up. The current policy is to stabilize over 5 minutes, then
add/remove one replica per minute.

Fixes #484.

Autoscaling is based on the queue depth: if there is a queue, we'll scale up. The current policy is to stabilize over 5 minutes, then add/remove one replica per minute.

codecov-commenter · 2021-10-18T23:29:51Z

Codecov Report

Merging #1042 (ecdf560) into main (d8d61f4) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #1042   +/-   ##
=======================================
  Coverage   59.33%   59.33%           
=======================================
  Files          35       35           
  Lines        6596     6596           
=======================================
  Hits         3914     3914           
  Misses       2632     2632           
  Partials       50       50

Flag	Coverage Δ
deploy_operator_tests	`76.78% <ø> (ø)`
deploy_tool_tests	`38.46% <ø> (ø)`
facilitator_tests	`64.09% <ø> (ø)`
manifest_updater_tests	`6.81% <ø> (ø)`
task_replayer_tests	`∅ <ø> (∅)`
workflow_manager_tests	`38.07% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d8d61f4...ecdf560. Read the comment docs.

This is required to use the "kubernetes_manifest" resource.

tgeoghegan

A few nits but I'm really happy with how simple the target rule is. I'll be curious to see how this contends with a production locality like us-ca-apple.

tgeoghegan · 2021-10-18T23:57:14Z

terraform/modules/custom_metrics/custom_metrics.tf

+## AWS-specific resources
+##
+resource "kubernetes_cluster_role" "external_metrics_reader" {
+  count = var.use_aws ? 1 : 0 # This cluster role already exists in GKE deployments.


Oof, gotta love these little impedance mismatches between EKS and GKE.

Yup. TBH I'm surprised the setup for extracting metrics from CloudWatch/Stackdriver is so ... manual, considering how (presumably) useful it would be to so many k8s deployments.

tgeoghegan · 2021-10-18T23:57:59Z

terraform/modules/custom_metrics/custom_metrics.tf

+    }
+    selector = { app = "custom-metrics-adapter" }
+  }
+}


nit: missing newline

tgeoghegan · 2021-10-18T23:59:40Z

terraform/modules/custom_metrics/custom_metrics.tf

+    port {
+      name        = "http"
+      port        = 80
+      target_port = 8080
+    }


I think it doesn't matter because we have no public internet access to any of this, but I'm curious: why does this metrics adapter have to listen on both HTTP and HTTPS? I forget how exactly this works, but I'm guessing the service exists at all so that the k8s API server can scrape metrics from the adapter, but I would have guessed it would use either always HTTPS or always HTTP.

I believe this page gives a description of the two ports. tl;dr is: the secure port (443) is used for "normal" serving, uses TLS, has all the needed authn/authz functionality, etc; the insecure port (80) is used for turnup, bypasses a lot of security, but only binds to localhost (ie loopback) by default.

Given that last bit, I suspect that we don't actually need to expose port 80, because the metrics server won't be listening over the network on that port anyway. I tried removing it from my dev-env deployment and things seem to work OK (at least, the metric adapter logs look fine, and the horizontal-pod-autoscalers can still get their metrics). So I dropped port 80 entirely.

[and to close the loop on why it was there in the first place: I copied it blindly from Amazon's suggested YAML config]

tgeoghegan · 2021-10-19T00:05:43Z

terraform/modules/kubernetes/kubernetes.tf

@@ -558,6 +657,96 @@ resource "kubernetes_deployment" "aggregate" {
    }
  }
 }
+
+resource "kubernetes_manifest" "aggregate_queue_depth_metric" {


There's no GCP equivalent to this. Does the GCP metrics adapter automagically create the manifest?

It does: specifically it creates metrics called something like pubsub.googleapis.com|subscription|num_undelivered_messages, and the specific pubsub subscription to examine is decided by using a selector, e.g. matching on resource.labels.subscription_id in this case. You can see an example in the config for the autoscalers. I guess GCP's method requires a little less plumbing to set up, while AWS' method potentially saves on scraping unused metrics?

It's sort of unfortunate that the setup is so different between the two different cloud providers, but I think this difference is baked into the respective metric scrapers that we're now running. I thought about creating a module for this to abstract away the details, but right now we have exactly one use case (implemented on two different deployments), so the proper general abstraction isn't clear to me. If we somehow end up needing to wire this up in many more places, we'd probably want an abstraction.

This reminds me of network configuration, which mostlu Just Works in GKE, but EKS requires you to jump through a ton of hoops to construct subnets and security groups and so on. Anyway, I agree with your approach here.

* Stop exposing port 80 for AWS metrics server. (it's not required.) * Don't specify a replica count for HPA-controlled deployments. (it's not required) * Switch from scaling on the total number of available messages, to scaling on the number of available messages per replica.

branlwyd · 2021-10-19T23:43:31Z

BTW, I wanted to address the use of insecure_skip_tls_verify in the API service definitions.

This initially got pulled in because it was specified in both Google's & Apple's recommended configuration for their custom metric adapters.

The best resource I found addressing the use of insecure_skip_tls_verify was this Github issue, where one of the authors of the metrics server suggests that:

This flag controls authentication of the metrics server to the API server.
Removing this flag would require us running our own private CA, creating a cert for the metrics server, and then configuring both it & the API server to use the cert. (and configuring the API server may not be possible at all in "Kubernetes-as-a-Service" offerings such as EKS/GKE)
The downside of specifying this flag is that an attacker in a position to act as a MITM in front of the metrics server could impersonate the metrics server, potentially serving incorrect metrics which could cause incorrect scaling decisions by the HPAs.

Given the relatively mild downside of leaving this flag in place, and the relatively high complexity of removing this flag, I decided leaving it in place was acceptable for an initial revision. But I do think it's worth raising for posterity & in case folks disagree.

tgeoghegan

I appreciate the thorough research here! I agree with you on the question of insecure_skip_tls_verify. This seems safe to me since our application's deployment is on a purely private network.

branlwyd added 3 commits October 15, 2021 16:39

WIP commit

654c465

Merge branch 'main' into autoscaling

e4d4a43

Add autoscaling for intake/aggregate tasks.

cf900b6

Autoscaling is based on the queue depth: if there is a queue, we'll scale up. The current policy is to stabilize over 5 minutes, then add/remove one replica per minute.

branlwyd requested a review from tgeoghegan October 18, 2021 23:26

Upgrade Github actions' Terraform 0.14.4 -> 0.14.8.

64a8595

This is required to use the "kubernetes_manifest" resource.

tgeoghegan suggested changes Oct 19, 2021

View reviewed changes

branlwyd requested a review from tgeoghegan October 19, 2021 23:43

tgeoghegan approved these changes Oct 20, 2021

View reviewed changes

tgeoghegan mentioned this pull request Oct 20, 2021

terraform: autoscale k8s deployments #536

Closed

branlwyd merged commit 42dd9da into main Oct 20, 2021

branlwyd deleted the autoscaling branch October 20, 2021 20:26

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add autoscaling for intake/aggregate tasks. #1042

Add autoscaling for intake/aggregate tasks. #1042

branlwyd commented Oct 18, 2021 •

edited

Loading

codecov-commenter commented Oct 18, 2021 •

edited

Loading

tgeoghegan left a comment

tgeoghegan Oct 18, 2021

branlwyd Oct 19, 2021

tgeoghegan Oct 18, 2021

branlwyd Oct 19, 2021

tgeoghegan Oct 18, 2021

branlwyd Oct 19, 2021

tgeoghegan Oct 19, 2021

branlwyd Oct 19, 2021 •

edited

Loading

tgeoghegan Oct 19, 2021

branlwyd commented Oct 19, 2021

tgeoghegan left a comment

Add autoscaling for intake/aggregate tasks. #1042

Add autoscaling for intake/aggregate tasks. #1042

Conversation

branlwyd commented Oct 18, 2021 • edited Loading

codecov-commenter commented Oct 18, 2021 • edited Loading

Codecov Report

tgeoghegan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

branlwyd Oct 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

branlwyd commented Oct 19, 2021

tgeoghegan left a comment

Choose a reason for hiding this comment

branlwyd commented Oct 18, 2021 •

edited

Loading

codecov-commenter commented Oct 18, 2021 •

edited

Loading

branlwyd Oct 19, 2021 •

edited

Loading