Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hitting rate limits because ssm calls in runners.ts are not staggered or spread out #1841

Closed
devshah1 opened this issue Mar 10, 2022 · 11 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@devshah1
Copy link

devshah1 commented Mar 10, 2022

Issue
We ran into rate limits yesterday when the pool configuration adjusts the size of EC2 instances by scaling up.

After digging into the code, I saw that the calls to the ssm putParameter method are not staggered or spread out in runners.ts.

Concurrent calls to this method triggers the rate limits since the parameter store only allows 40 requests per second by default.

How to reproduce?
If the pool adjusts more than 40 instances at a time, you'll see throttling errors from the AWS SDK.

Proposed resolution
Use waits to stagger calls to the ssm parameter store in the specified file.

Root cause file reference:
https://github.com/philips-labs/terraform-aws-github-runner/blob/develop/modules/runners/lambdas/runners/src/aws/runners.ts#L205

@devshah1 devshah1 changed the title Rate limiting because ssm calls in pool config are not staggered or spread out Hitting rate limits because ssm calls in runners.ts are not staggered or spread out Mar 10, 2022
@npalm npalm added bug Something isn't working help wanted Extra attention is needed labels Mar 10, 2022
@devsh4
Copy link
Contributor

devsh4 commented Mar 12, 2022

@npalm I added a fix for this potential blocker we're facing, please review the PR when you get a chance.

@npalm
Copy link
Member

npalm commented Mar 12, 2022

According this AWS docs the rate limit should be 100 per second for non mutating and 200 for mutation API calls. Several other API calles are done before parameters are writeen, and as well other lamba's can consume from the same bucket for API rate limits.

Also during the boot process the instances is retrieving via API calls config via SSM. Those could potential have the same issue. By introducing a delay we should keep in mind that parameters should be created before the instances starts retrieving them. Typically at leas 25 seconds. Furthermore, the Lambda have default time out. So that should match as well.

in PR #1843 a wait of 25 ms is introduced after each call. Which limits the put parameter to call no more than 40 times the API in a second. Sound to me reasonable.

@devshah1 can you test the PR?

@devsh4
Copy link
Contributor

devsh4 commented Mar 12, 2022

@npalm As far as the rate limit for AWS goes I confirmed with the AWS team that the current rate limit was 100 req/s and this is after increasing the throughput for our account.

Good point about the other calls that happen underneath however I monitored the behavior extensively for our workloads and whenever the pool topped up > 50 EC2 instances (this number changed to 100 once we increased the throughput), the SSM calls met with throttling which means the parameters weren't added to the store and thus EC2 instances couldn't fetch the parameters from SSM on boot. This led to a scenario where we had 2-3x EC2 instances but only half of them got registered successfully as GH runners.

I'll test out the changes in PR #1843 and get back with the results.

(Sorry for posting from 2 accounts, I opened the issue with my work account)

@devsh4
Copy link
Contributor

devsh4 commented Mar 12, 2022

@npalm Weirdly getting the below error when testing my changes with 25ms delay. The ec2 instance is throwing an error when trying to delete the ssm parameter during boot, however it is able to fetch the parameters successfully -

An error occurred (ParameterNotFound) when calling the DeleteParameter operation

I also checked that there is enough gap between when the parameter is created and when the ec2 instance boots (about ~1 minute). Would you know why this could happen?

@devsh4
Copy link
Contributor

devsh4 commented Mar 12, 2022

I figured out what was happening with the deletion error, the start-runner.sh doesn't have a thorough check to ensure that the parameter was retrieved successfully. This is the reason deletion fails with ParameterNotFound error because the current fix I have is not creating the parameter properly and thus it's never retrieved.

@devsh4
Copy link
Contributor

devsh4 commented Mar 13, 2022

@npalm I updated the fix to be simpler as passing the call in setTimeout wasn't working as intended.

I tested this with production level workload where each time the pool adjust function starts from 0 & goes up till n. The fix works perfectly and registers the Github runners properly without hitting the rate limits.

Test scenarios with varying pool size:

  1. Pool size: 1 runner
  2. Pool size: 5 runners
  3. Pool size: 25 runners
  4. Pool size: 50 runners
  5. Pool size: 100 runners
  6. Pool size: 150 runners
  7. Pool size: 250 runners

@npalm
Copy link
Member

npalm commented Apr 13, 2022

Closing the issue, feel free to re-open when needed

@npalm npalm closed this as completed Apr 13, 2022
@devsh4
Copy link
Contributor

devsh4 commented Aug 2, 2022

@npalm Can you please reopen this issue? I am working on reopening the original PR. Thanks!

@devsh4
Copy link
Contributor

devsh4 commented Aug 2, 2022

@npalm Please review the fix when you have time and check the related PR I had opened a couple months ago for more context.

@npalm
Copy link
Member

npalm commented Aug 2, 2022

@devsh4 can you re-open the PR, or recreate. It is closed and needs to re-based

@devsh4
Copy link
Contributor

devsh4 commented Aug 2, 2022

@npalm I have opened a new PR #2305, check the latest comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants