Skip to content
This repository has been archived by the owner on Jul 21, 2023. It is now read-only.

Add autoscaling for intake/aggregate tasks. #1042

Merged
merged 5 commits into from
Oct 20, 2021
Merged

Add autoscaling for intake/aggregate tasks. #1042

merged 5 commits into from
Oct 20, 2021

Conversation

branlwyd
Copy link
Member

@branlwyd branlwyd commented Oct 18, 2021

Autoscaling is based on the queue depth: if there is a queue, we'll
scale up. The current policy is to stabilize over 5 minutes, then
add/remove one replica per minute.

Fixes #484.

Autoscaling is based on the queue depth: if there is a queue, we'll
scale up. The current policy is to stabilize over 5 minutes, then
add/remove one replica per minute.
@codecov-commenter
Copy link

codecov-commenter commented Oct 18, 2021

Codecov Report

Merging #1042 (ecdf560) into main (d8d61f4) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1042   +/-   ##
=======================================
  Coverage   59.33%   59.33%           
=======================================
  Files          35       35           
  Lines        6596     6596           
=======================================
  Hits         3914     3914           
  Misses       2632     2632           
  Partials       50       50           
Flag Coverage Δ
deploy_operator_tests 76.78% <ø> (ø)
deploy_tool_tests 38.46% <ø> (ø)
facilitator_tests 64.09% <ø> (ø)
manifest_updater_tests 6.81% <ø> (ø)
task_replayer_tests ∅ <ø> (∅)
workflow_manager_tests 38.07% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d8d61f4...ecdf560. Read the comment docs.

This is required to use the "kubernetes_manifest" resource.
Copy link
Contributor

@tgeoghegan tgeoghegan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits but I'm really happy with how simple the target rule is. I'll be curious to see how this contends with a production locality like us-ca-apple.

## AWS-specific resources
##
resource "kubernetes_cluster_role" "external_metrics_reader" {
count = var.use_aws ? 1 : 0 # This cluster role already exists in GKE deployments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, gotta love these little impedance mismatches between EKS and GKE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. TBH I'm surprised the setup for extracting metrics from CloudWatch/Stackdriver is so ... manual, considering how (presumably) useful it would be to so many k8s deployments.

}
selector = { app = "custom-metrics-adapter" }
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing newline

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 388 to 392
port {
name = "http"
port = 80
target_port = 8080
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't matter because we have no public internet access to any of this, but I'm curious: why does this metrics adapter have to listen on both HTTP and HTTPS? I forget how exactly this works, but I'm guessing the service exists at all so that the k8s API server can scrape metrics from the adapter, but I would have guessed it would use either always HTTPS or always HTTP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this page gives a description of the two ports. tl;dr is: the secure port (443) is used for "normal" serving, uses TLS, has all the needed authn/authz functionality, etc; the insecure port (80) is used for turnup, bypasses a lot of security, but only binds to localhost (ie loopback) by default.

Given that last bit, I suspect that we don't actually need to expose port 80, because the metrics server won't be listening over the network on that port anyway. I tried removing it from my dev-env deployment and things seem to work OK (at least, the metric adapter logs look fine, and the horizontal-pod-autoscalers can still get their metrics). So I dropped port 80 entirely.

[and to close the loop on why it was there in the first place: I copied it blindly from Amazon's suggested YAML config]

@@ -558,6 +657,96 @@ resource "kubernetes_deployment" "aggregate" {
}
}
}

resource "kubernetes_manifest" "aggregate_queue_depth_metric" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no GCP equivalent to this. Does the GCP metrics adapter automagically create the manifest?

Copy link
Member Author

@branlwyd branlwyd Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does: specifically it creates metrics called something like pubsub.googleapis.com|subscription|num_undelivered_messages, and the specific pubsub subscription to examine is decided by using a selector, e.g. matching on resource.labels.subscription_id in this case. You can see an example in the config for the autoscalers. I guess GCP's method requires a little less plumbing to set up, while AWS' method potentially saves on scraping unused metrics?

It's sort of unfortunate that the setup is so different between the two different cloud providers, but I think this difference is baked into the respective metric scrapers that we're now running. I thought about creating a module for this to abstract away the details, but right now we have exactly one use case (implemented on two different deployments), so the proper general abstraction isn't clear to me. If we somehow end up needing to wire this up in many more places, we'd probably want an abstraction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me of network configuration, which mostlu Just Works in GKE, but EKS requires you to jump through a ton of hoops to construct subnets and security groups and so on. Anyway, I agree with your approach here.

* Stop exposing port 80 for AWS metrics server. (it's not required.)
* Don't specify a replica count for HPA-controlled deployments. (it's
  not required)
* Switch from scaling on the total number of available messages, to
  scaling on the number of available messages per replica.
@branlwyd
Copy link
Member Author

BTW, I wanted to address the use of insecure_skip_tls_verify in the API service definitions.

This initially got pulled in because it was specified in both Google's & Apple's recommended configuration for their custom metric adapters.

The best resource I found addressing the use of insecure_skip_tls_verify was this Github issue, where one of the authors of the metrics server suggests that:

  1. This flag controls authentication of the metrics server to the API server.
  2. Removing this flag would require us running our own private CA, creating a cert for the metrics server, and then configuring both it & the API server to use the cert. (and configuring the API server may not be possible at all in "Kubernetes-as-a-Service" offerings such as EKS/GKE)
  3. The downside of specifying this flag is that an attacker in a position to act as a MITM in front of the metrics server could impersonate the metrics server, potentially serving incorrect metrics which could cause incorrect scaling decisions by the HPAs.

Given the relatively mild downside of leaving this flag in place, and the relatively high complexity of removing this flag, I decided leaving it in place was acceptable for an initial revision. But I do think it's worth raising for posterity & in case folks disagree.

Copy link
Contributor

@tgeoghegan tgeoghegan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the thorough research here! I agree with you on the question of insecure_skip_tls_verify. This seems safe to me since our application's deployment is on a purely private network.

@branlwyd branlwyd merged commit 42dd9da into main Oct 20, 2021
@branlwyd branlwyd deleted the autoscaling branch October 20, 2021 20:26
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Autoscale deployments based on PubSub queue depth
3 participants