-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linkerd-proxy crashes with "supplied instant is later than self" (AWS EC2/EKS) #7748
Comments
Possibly related to rust-lang/rust#86470 |
When comparing instances, we should use saturating varieties to help ensure that we can't hit panics. This change bans uses of `std::time::Instant::{duration_since, elapsed, sub}` via clippy. Uses are ported to using `Instant::saturating_duration_since`. Related to linkerd/linkerd2#7748 Signed-off-by: Oliver Gould <ver@buoyant.io>
@jberm Can you share the output of |
We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in hyper, but hyperium#2385 reports a similar issue. Even though this is almost definitely a bug in Rust, it seems most prudent to actively avoid the uses of `Instant` that are prone to this bug. This change replaces uses of `Instant::elapsed` and `Instant::sub` with calls to `Instant::saturating_duration_since` to prevent this class of panic.
We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in hyper, but #2385 reports a similar issue. Even though this is almost definitely a bug in Rust, it seems most prudent to actively avoid the uses of `Instant` that are prone to this bug. This change replaces uses of `Instant::elapsed` and `Instant::sub` with calls to `Instant::saturating_duration_since` to prevent this class of panic. These fixes should ultimately be made in the standard library, but this change lets us avoid this problem while we wait for those fixes. See also hyperium/hyper#2746
We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in tower, but we have some potentialy flawed `Instant` arithmetic that could panic in this way. Even though this is almost definitely a bug in Rust, it seems most prudent to actively avoid the uses of `Instant` that are prone to this bug. This change replaces uses of `Instant::elapsed` and `Instant::sub` with calls to `Instant::saturating_duration_since` to prevent this class of panic. These fixes should ultimately be made in the standard library, but this change lets us avoid this problem while we wait for those fixes. See also hyperium/hyper#2746
We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in tower, but we have some potentialy flawed `Instant` arithmetic that could panic in this way. Even though this is almost definitely a bug in Rust, it seems most prudent to actively avoid the uses of `Instant` that are prone to this bug. This change replaces uses of `Instant::elapsed` and `Instant::sub` with calls to `Instant::saturating_duration_since` to prevent this class of panic. These fixes should ultimately be made in the standard library, but this change lets us avoid this problem while we wait for those fixes. See also hyperium/hyper#2746
We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in h2, but there is one use of `Instant::sub` that could panic in this way. Even though this is almost definitely a bug in Rust, it seems most prudent to actively avoid the uses of `Instant` that are prone to this bug. These fixes should ultimately be made in the standard library, but this change lets us avoid this problem while we wait for those fixes. This change replaces uses of `Instant::elapsed` and `Instant::sub` with calls to `Instant::saturating_duration_since` to prevent this class of panic. See also hyperium/hyper#2746
We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in h2, but there is one use of `Instant::sub` that could panic in this way. Even though this is almost definitely a bug in Rust, it seems most prudent to actively avoid the uses of `Instant` that are prone to this bug. These fixes should ultimately be made in the standard library, but this change lets us avoid this problem while we wait for those fixes. This change replaces uses of `Instant::elapsed` and `Instant::sub` with calls to `Instant::saturating_duration_since` to prevent this class of panic. See also hyperium/hyper#2746
We have reports of runtime panics (linkerd/linkerd2#7748) that sound a lot like rust-lang/rust#86470. We don't have any evidence that these panics originate in tower, but we have some potentialy flawed `Instant` arithmetic that could panic in this way. Even though this is almost definitely a bug in Rust, it seems most prudent to actively avoid the uses of `Instant` that are prone to this bug. This change replaces uses of `Instant::elapsed` and `Instant::sub` with calls to `Instant::saturating_duration_since` to prevent this class of panic. These fixes should ultimately be made in the standard library, but this change lets us avoid this problem while we wait for those fixes. See also hyperium/hyper#2746
Rusts's standard library provides an
rust-lang/rust#86470 describes a bug in There are a few ways we can attack this problem:
It will likely take a few weeks for all of these dependencies to release so that we can pick up a change in an edge release. Or we can take git dependencies on these repos to avoid waiting for a proper release. It will probably take a few months until we can use a Rust version that has better guards against this kind of panic. But, still, it would be best if we can help the Rust team nail down more details about the environment where this error occurs. Perhaps its possible we can get the folks working on Amazon Linux involved. |
@jberm What AWS instance types are you using? I would guess either is either m5a or t3a. |
The pods were failing on a t3a.medium instance and only failing on that single instance. Unfortunately we upgraded our AMIs yesterday so I don't have the
Maybe someone at AWS can give you the kernel version for the previous AMI. |
When comparing instances, we should use saturating varieties to help ensure that we can't hit panics. This change bans uses of `std::time::Instant::{duration_since, elapsed, sub}` via clippy. Uses are ported to using `Instant::saturating_duration_since`. Related to linkerd/linkerd2#7748 Signed-off-by: Oliver Gould <ver@buoyant.io> Co-authored-by: Eliza Weisman <eliza@buoyant.io>
We just had the same issue. |
Thanks @virenrshah. We've got a few workarounds that will become available as our dependencies release new versions. In the meantime, you could try engaging AWS support or reprovisioning impacted nodes. I'm told that AWS has reproduced the issue but I'm not aware of how long it will take for fixes to be available on their side. |
Thanks! Anyone know if this is something I can workaround by shifting to a different set of instance types? Looks like both @jberm and I had t3a instance types. |
Same here:
|
do we have any update on this? This is impacting our production clusters and we're even considering uninstalling linkerd until this gets sorted, something I'd love to avoid if there's any known workaround |
tokio & tower have been patched to avoid issues described in linkerd/linkerd2#7748, but they have not yet been released. This change pins these dependencies to Git to pickup the workarounds.
tokio & tower have been patched to avoid issues described in linkerd/linkerd2#7748, but they have not yet been released. This change pins these dependencies to Git to pickup the workarounds. Signed-off-by: Oliver Gould <ver@buoyant.io>
@fcrespofastly As mentioned previously, this is a bug between the Rust standard library and AWS Linux, which has a buggy time source. So it's going to be difficult for us to completely eliminate this issue until it is fixed upstream. That said, we've put in place workarounds in linkerd2-proxy and several ecosystem projects (tokio, tower, hyper) that should reduce the likelihood of encountering this bug. I've put up linkerd/linkerd2-proxy#1497 to take git dependencies while we wait for tokio & tower to do a proper release and I've published a proxy build with these changes. You can use this build by setting namespace/workload annotations: annotations:
config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
config.linkerd.io/proxy-version: instant.495a51ae Or set it globally by upgrading with the appropriate helm values |
hey @olix0r thanks a lot, I knew it was more on Rust and AWS land, but it was also mentioned: We've got a few workarounds that will become available as our dependencies release new versions. Hence I was asking about this. Thanks again! |
We're seeing this on a different rust application, but also on t3.2xlarge |
tokio & tower have been patched to avoid issues described in linkerd/linkerd2#7748, but they have not yet been released. This change pins these dependencies to Git to pickup the workarounds. Signed-off-by: Oliver Gould <ver@buoyant.io>
When comparing instances, we should use saturating varieties to help ensure that we can't hit panics. This change bans uses of `std::time::Instant::{duration_since, elapsed, sub}` via clippy. Uses are ported to using `Instant::saturating_duration_since`. Related to linkerd/linkerd2#7748 Signed-off-by: Oliver Gould <ver@buoyant.io> Co-authored-by: Eliza Weisman <eliza@buoyant.io> (cherry picked from commit bffdb1a) Signed-off-by: Oliver Gould <ver@buoyant.io>
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
Recent versions of the proxy should be immune to this class of panic. |
What is the issue?
Linkerd proxy crashes intermittently with the following error message:
How can it be reproduced?
Deploy linkerd 2.11.1-stable to AWS EKS and wait for crashes.
Logs, error output, etc
output of
linkerd check -o short
Environment
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
No response
The text was updated successfully, but these errors were encountered: