-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hitting rate limits because ssm calls in runners.ts are not staggered or spread out #1841
Comments
@npalm I added a fix for this potential blocker we're facing, please review the PR when you get a chance. |
According this AWS docs the rate limit should be 100 per second for non mutating and 200 for mutation API calls. Several other API calles are done before parameters are writeen, and as well other lamba's can consume from the same bucket for API rate limits. Also during the boot process the instances is retrieving via API calls config via SSM. Those could potential have the same issue. By introducing a delay we should keep in mind that parameters should be created before the instances starts retrieving them. Typically at leas 25 seconds. Furthermore, the Lambda have default time out. So that should match as well. in PR #1843 a wait of 25 ms is introduced after each call. Which limits the put parameter to call no more than 40 times the API in a second. Sound to me reasonable. @devshah1 can you test the PR? |
@npalm As far as the rate limit for AWS goes I confirmed with the AWS team that the current rate limit was 100 req/s and this is after increasing the throughput for our account. Good point about the other calls that happen underneath however I monitored the behavior extensively for our workloads and whenever the pool topped up > 50 EC2 instances (this number changed to 100 once we increased the throughput), the SSM calls met with throttling which means the parameters weren't added to the store and thus EC2 instances couldn't fetch the parameters from SSM on boot. This led to a scenario where we had 2-3x EC2 instances but only half of them got registered successfully as GH runners. I'll test out the changes in PR #1843 and get back with the results. (Sorry for posting from 2 accounts, I opened the issue with my work account) |
@npalm Weirdly getting the below error when testing my changes with 25ms delay. The ec2 instance is throwing an error when trying to delete the ssm parameter during boot, however it is able to fetch the parameters successfully -
I also checked that there is enough gap between when the parameter is created and when the ec2 instance boots (about ~1 minute). Would you know why this could happen? |
I figured out what was happening with the deletion error, the |
@npalm I updated the fix to be simpler as passing the call in I tested this with production level workload where each time the pool adjust function starts from 0 & goes up till Test scenarios with varying pool size:
|
Closing the issue, feel free to re-open when needed |
@npalm Can you please reopen this issue? I am working on reopening the original PR. Thanks! |
@npalm Please review the fix when you have time and check the related PR I had opened a couple months ago for more context. |
@devsh4 can you re-open the PR, or recreate. It is closed and needs to re-based |
@npalm I have opened a new PR #2305, check the latest comment. |
Issue
We ran into rate limits yesterday when the pool configuration adjusts the size of EC2 instances by scaling up.
After digging into the code, I saw that the calls to the ssm
putParameter
method are not staggered or spread out inrunners.ts
.Concurrent calls to this method triggers the rate limits since the parameter store only allows 40 requests per second by default.
How to reproduce?
If the pool adjusts more than 40 instances at a time, you'll see throttling errors from the AWS SDK.
Proposed resolution
Use waits to stagger calls to the ssm parameter store in the specified file.
Root cause file reference:
https://github.com/philips-labs/terraform-aws-github-runner/blob/develop/modules/runners/lambdas/runners/src/aws/runners.ts#L205
The text was updated successfully, but these errors were encountered: