-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add delays to network retries. #11881
Conversation
r? @epage (rustbot has picked a reviewer for you, use r? to override) |
} else { | ||
let min_timeout = Duration::new(1, 0); | ||
let timeout = self.set.multi.get_timeout()?.unwrap_or(min_timeout); | ||
let timeout = timeout.min(min_timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this end up waiting longer than delay
? Do we care?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to wait here past the deadline for retrying a sleeping request. However, this is at most 1 second, so it shouldn't matter too much as the exact time something is retried isn't that important. This should usually return within much less than a second (usually 200ms), and I believe even that long of a delay happens when there is 0 network traffic.
use std::time::{Duration, Instant}; | ||
|
||
pub struct SleepTracker<T> { | ||
heap: BinaryHeap<Sleeper<T>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a min heap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is correct because the custom PartialEq
implementation reverses the comparison of the timestamps. Another option is to use Reverse
, but since it already needed a custom impl (to pick just one field), I figured it wasn't necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That all seams good. Should we have a comment on the BinaryHeap
and the PartialEq
pointing out this important detail?
Also did we want to add some limit on the total number of retries system wide? Like if 20 requests need a retry then the registry is probably down/overlodad and retrying is just making it harder for the registry to come back up. |
Can you say more about how you would envision that working? I'm not sure how that would work beyond what is implemented here. If it is currently being overwhelmed, cargo will back off and smear the requests over some period of time. If there is some hard limit for proactively giving up, I think that could be difficult to tune. Cargo inherently sends hundreds of requests all at once. In a situation where the server is having a temporary upset, it may very well recover within the existing retry schedule, even with hundreds of requests. The current schedule is geared around the behaviors that I'm familiar with. Most recoverable transient errors can recover within a very short period of time. The second cliff is the set of errors that are recoverable within a moderately short period of time. Beyond that, most recovery won't happen for minutes/hours/days, which is longer than I think Cargo should bother waiting. One of the hypothetical scenarios this is intended for is a temporary spike that triggers a rate limit. After waiting for Cloudfront's 10s negative TTL, things should be cleared up if there isn't a sustained barrage of traffic. There could be a slower resume, with some intelligence of "ok, requests are succeeding again, let them go through faster" vs "requests are still failing, let's slow down even more". This would be similar to a slow-start algorithm, but only engaging when errors are encountered. However, I'm concerned about the complexity of implementing such a solution. |
All I had in mind was a global
That is a problem for the "simple" thing I had in mind. It's entirely possible that cargo is started 200 requests all of which will fail and all of which will succeed on their retry. :-/ Perhaps every time we make a "first request" foreign asset we increase the
Crates.io is set up with infrastructure that tends to either be up or down, not infrastructure that might be running at degraded service. Not all registries will work that way. For example a private registry might need to do a database call for every request that comes in. One common pattern that happens with systems is that they are running entirely smoothly close to capacity. Then a some small hiccup happens, say a small number of requests take longer to process than they should. Connections start timing out before a response is ready. By the time the server is ready to start processing the next request it's even further into the connections timeout. This isn't entirely tragic, there is excess capacity that the server can use to catch up. It was running at say 90% of theoretical capacity, but thanks to the backlog is now running flat out and only getting 99% requests in on time. The 1% of Clients start receiving connection timeout errors and retry. The server is now receiving 1% more traffic, and it was already effectively overloaded. So now 2% requests get time outs. 2% of Clients retry. So the server is even more overloaded and 4% retry. This pattern continues until no actual work is getting done. As the queue builds up, the requests the server are picking up have been waiting longer and are closer to timing out. The recurrence relation is something like There is also a problem with nested retries. If cargo will retry all of its requests three times but cargo is run in a CI environment that will retry three times, then there is nine times as much traffic hitting the server when it's down than when it's running smoothly. Lots of systems would consider 9x normal traffic as a DDOS level. If cargo limits the total number of retries to 10% of base then even with a CI retry there is only a 3.3x.
This got me wondering, when we have this 503 issue is Cloudfront returning 503 for all endpoints or only specific packages? If most packages are still cashed correctly and working, but a handful are being retried, then even my simple suggestion should work well. I just generated a bunch of words, but I don't know what we should do next. Sorry. Did anything I say make any sense? |
Based on my reading of the documentation, I think it is not a simple answer. I believe the rate limit on S3 is per prefix. The prefix is everything up to the last slash. But the docs also say that the partitioning is adjusted dynamically based on the traffic patterns. So if everything is partitioned completely, then I think the rate limit would be based on the first 4 letters (that is, I'm pretty sure Cloudfront's cache will be per endpoint. But these are just guesses based on the available docs. |
Lets not delay this, desperately needed, PR for the "cap on retries" functionality. We can move it to an issue and come back to it latter. |
@epage I was wondering if you've had a chance to look at this? I think this is a relatively high priority change that I would like to see get into beta to help mitigate some risk with the registry change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I saw @Eh2406 was looking at this was was leaving it to him.
I have some nits but ok merging as-is
This is intended to help grow with more stuff.
This allows tests to generate dynamic URLs for custom responders.
@bors r+ |
☀️ Test successful - checks-actions |
Update cargo 9 commits in 145219a9f089f8b57c09f40525374fbade1e34ae..0e474cfd7b16b018cf46e95da3f6a5b2f1f6a9e7 2023-03-27 01:56:36 +0000 to 2023-03-31 23:15:58 +0000 - Add delays to network retries. (rust-lang/cargo#11881) - Add a note to `cargo logout` that it does not revoke the token. (rust-lang/cargo#11919) - Sync external-tools JSON docs. (rust-lang/cargo#11918) - Drop derive feature from serde in cargo-platform (rust-lang/cargo#11915) - Disable test_profile test on windows-gnu (rust-lang/cargo#11916) - src/doc/src/reference/build-scripts.md: a{n =>} benchmark target (rust-lang/cargo#11908) - Documented working directory behaviour for `cargo test`, `cargo bench` and `cargo run` (rust-lang/cargo#11901) - docs(contrib): Link to office hours doc (rust-lang/cargo#11903) - chore: Upgrade to clap v4.2 (rust-lang/cargo#11904)
This adds a delay to network retries to help guard against short-lived transient errors.
The overview of how this works is: Requests that fail are placed into a queue with a timestamp of when they should be retried. The download code then checks for any requests that need to be retried, and re-injects them back into curl.
Care must be taken in the download loops to track which requests are currently in flight plus which requests are waiting to be retried.
This also adds jitter to the retry times so that multiple failed requests don't all flood the server when they are retried. This is a very primitive form of jitter which suffers from clumping, but I think it should be fine.
The max number of retries is raised to 3 in order to bring the total retry time to around 10 seconds. This is intended to address Cloudfront's default negative TTL of 10 seconds. The retry schedule looks like 0.5seconds ± 1 second, then 3.5 seconds then 6.5 seconds, for a total of 10.5 to 11.5 seconds. If the user configures additional retries, each attempt afterwards has a max delay of 10 seconds.
The retry times are currently not user-configurable. This could potentially be added in the future, but I don't think is particularly necessary for now.
There is an unfortunate amount of duplication between the crate download and HTTP index code. I think in the near future we should work on consolidating that code. That might be challenging, since their use patterns are different, but I think it is feasible.