-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(logger): rate limit based on peer address #1351
feat(logger): rate limit based on peer address #1351
Conversation
cargo-shuttle/src/lib.rs
Outdated
Err(err) => { | ||
debug!(error = %err, "failed to parse logs"); | ||
|
||
let message = if line.contains("rate limit") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too happy about how I handle this case (string matching), but it should be rare (it will only happen when the application is being rate limited and the user tries to request logs, and it won't always fail). Suggestions on how to improve the error handling here are most welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about having the error be an actual JSON as well ? You could deserialize this as an enum (it's either a log line from the app or a control message), and match on that instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I don't think we can make an enum that matches on either log line or control message, since it would break the logs stream of all users who don't upgrade. But just sending a JSON error works. New clients will be able to deserialize the error, older clients will receive an error that suggests upgrading their CLI.
Here is the error they get in the rare case that the logger client is rate limited + their logs request got rate limited:
Error: 429 Too Many Requests
Message: your application is producing too many logs. Interactions with the shuttle logger service will be rate limited
…te-limits-the-amount-of-logs-send
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some notes 😄
deployer/src/handlers/error.rs
Outdated
@@ -26,6 +26,8 @@ pub enum Error { | |||
Internal(#[from] anyhow::Error), | |||
#[error("Missing header: {0}")] | |||
MissingHeader(String), | |||
#[error("{0}. Retry the command in a few minutes")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is rate limiting telling the user to retry a command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this is returned if the deployer's logger client is being rate limited and the user calls the get_logs or get_logs_stream endpoints. Since this API is also consumed from the console, instructing to run a command is a bit misleading, I can rename it to Retry the request...
.
logs.into_inner() | ||
.log_items | ||
.into_iter() | ||
.map(|l| l.to_log_item_with_id(deployment_id)) | ||
.collect(), | ||
)) | ||
} else { | ||
Err(Error::NotFound("deployment not found".to_string())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the not found case no longer relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think it ever was. The https://github.com/shuttle-hq/shuttle/blob/main/logger/src/lib.rs#L89-L100 call can fail in the following ways:
- can't extract the claim from extensions, this shouldn't happen since we check the claim in the deployer scopedlayers. This would return internal error
- claim is missing the logs scope, shouldn't happen because we check it in deployer scopedlayer. This would return permission denied, I can add a match for this to the deployer handler, even though it is currently impossible, if we make changes in the deployer that may change.
- the db query fails, this would return internal error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, after thinking about this some more, if the caller has access to this endpoint, but the logger client doesn't have access to get_logs, I think it makes sense to keep this as is, returning an internal error to the user from the deployer.
logger/tests/integration_tests.rs
Outdated
let governor_config = GovernorConfigBuilder::default() | ||
.per_millisecond(500) | ||
.burst_size(6) | ||
.use_headers() | ||
.key_extractor(TonicPeerIpKeyExtractor) | ||
.finish() | ||
.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory we could make changes to the main code (rate limiter) and not catch it in the test because of this repeat. Maybe there should be a helper function in the main code that returns the governor? Or something that adds it to the server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, I tried to extract the service builder into a utility function, but I couldn't get it to compile with the complicated types. I tried again now to just extract the config creation into a util, but that was also difficult due to some private types. I will add constants for these values, it's not the cleanest solution, but it should do the trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description of change
Add rate limiting by peer address to the logger server. This sets the sustained load LPS limit for projects to 512, but allowing bursts of up to 256 * 6. The rate limiter will regenerate a slot every 0.5s.
How has this been tested? (if applicable)
Tested by running projects locally that went over the limit. I also created an integration test to verify the limits work as expected, and the returned tonic status and headers are as expected when rate limiting is applied.