Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close connection on read timeout #1253

Closed
wants to merge 4 commits into from

Conversation

Dav1dde
Copy link
Contributor

@Dav1dde Dav1dde commented Jul 4, 2024

Potential fix for #1252.

This only deals with the timeout case, not other IO errors. For clustered clients, a timeout does not cause a retry, the user probably wanted to limit the time it takes until the request returns.

@nihohit
Copy link
Contributor

nihohit commented Jul 5, 2024

why fail on "would block"? It just means that the socket is currently busy. I'm also not sure about timeouts - they're perfectly safe in a connection that maintains its internal state. I understand that this fixes the situation for the sync connection, but it imparis the async connections.

I'm not convinced that there isn't a simpler solution in this situation. For example, maybe we can just maintain an internal counter of the requests and replies, and drop answers that don't have matching requests.

@Dav1dde
Copy link
Contributor Author

Dav1dde commented Jul 5, 2024

Fail on WOULDBLOCK because that's the behaviour the user intended by setting the timeout.

I'm not convinced that there isn't a simpler solution in this situation. For example, maybe we can just maintain an internal counter of the requests and replies, and drop answers that don't have matching requests.

That's the alternative the connection internally tracks requests/responses and bytes read, it needs to track bytes because the read can fail in the middle of a response (unlikely but possible). Extra work for something the user isn't interested in anymore, but avoids the reconnect.

but it imparis the async connections.

In which way, the code only touches the sync parts?

@nihohit
Copy link
Contributor

nihohit commented Jul 5, 2024

Fail on WOULDBLOCK because that's the behaviour the user intended by setting the timeout.

Are you certain that WOULDBLOCK only happens by user settings? I think it can also happen if the OS can't push more bytes to the socket's buffer.

In which way, the code only touches the sync parts?

you're touching types.rs, which affect all connections.

That's the alternative the connection internally tracks requests/responses and bytes read, it needs to track bytes because the read can fail in the middle of a response (unlikely but possible). Extra work for something the user isn't interested in anymore, but avoids the reconnect.

You're right, if the timeout can happen mid-message it requires more work.

So, we need to find the simplest solution that doesn't affect async connections.

@Dav1dde
Copy link
Contributor Author

Dav1dde commented Jul 5, 2024

Are you certain that WOULDBLOCK only happens by user settings?

I went by the recv man page and the TcpStream docs.

Platforms may return a different error code whenever a read times out as a result of setting this option. For example Unix typically returns an error of the kind WouldBlock, but Windows may return TimedOut.

But assuming it can happen for other reasons, at least in the sync code path, it would lead to the same problem.

I think it can also happen if the OS can't push more bytes to the socket's buffer.

Since we're only reading in this codepath, that should be fine.

you're touching types.rs, which affect all connections.

Right, I missed that, thanks!

Dav1dde added a commit to getsentry/relay that referenced this pull request Jul 5, 2024
Updates Redis to PR redis-rs/redis-rs#1253 which
addresses redis-rs/redis-rs#1252 which we've
been seeing occasionally in production.
@Dav1dde
Copy link
Contributor Author

Dav1dde commented Jul 5, 2024

I was looking through the code trying to find where best to make the distinction between async and sync. From what it looks like, the async code uses timeouts on futures, not directly through the socket (which makes sense, the socket is non blocking here anyways). So the change should have no effect on the async codepath.

I haven't really found a good spot how to deal with this though, maybe you have some pointers for me?

@nihohit
Copy link
Contributor

nihohit commented Jul 6, 2024

So the change should have no effect on the async codepath.

Sadly, that's not true - you can get "internal" timeouts in the async cluster connection, or the connection manager. Any change to general error handling is a change that affects those types, too.

maybe you have some pointers for me?

You're the one experiencing the issue, and frankly, I don't have production experience with the sync connection, your guess is as good as mine :). Why isn't the change to the connection sufficient?

But assuming it can happen for other reasons, at least in the sync code path, it would lead to the same problem.

Not sure what you mean here. Are you claiming that any WOULDBLOCK error means that a connection is in a broken state? Because I don't thisnk that's true for async connections.

@Dav1dde
Copy link
Contributor Author

Dav1dde commented Jul 8, 2024

You're the one experiencing the issue, and frankly, I don't have production experience with the sync connection, your guess is as good as mine :). Why isn't the change to the connection sufficient?

You mean why the cluster impl also needs a change? The cluster impl shouldn't retry a timeout, the user set the timeout for a reason and that should be honored. Also some operations shouldn't be retried at all, like INCR, retrying it will just double count, this is even worse for Redis scripts.

Not sure what you mean here. Are you claiming that any WOULDBLOCK error means that a connection is in a broken state? Because I don't thisnk that's true for async connections.

I don't think you'd ever get a WOULDBLOCK from the async connection, tokio doesn't have a way of setting a read timeout on the stream/socket in the first place, you'd always use a tokio::time::Timeout. The socket itself is will be set to nonblocking and the runtime/event loop should deal with this.

@nihohit
Copy link
Contributor

nihohit commented Jul 9, 2024

Also some operations shouldn't be retried at all, like INCR, retrying it will just double count, this is even worse for Redis scripts.

Yes, this is an ongoing source of debate - the operations should be retried if they failed before reaching the server, but not after. ATM we don't have a way to distinguish that.

The cluster impl shouldn't retry a timeout, the user set the timeout for a reason and that should be honored.

fair enough. @jaymell, thoughts?

I don't think you'd ever get a WOULDBLOCK from the async connection, tokio doesn't have a way of setting a read timeout on the stream/socket in the first place, you'd always use a tokio::time::Timeout. The socket itself is will be set to nonblocking and the runtime/event loop should deal with this.

Tokio isn't the only supported runtime, and as mentioned before, WOULDBLOCK can be sent by the OS for nonblocking sockets, regardless of timeouts - I'm not sure how it would be filtered by the runtime. I'd want concrete data on this before changing the behavior.

@Dav1dde
Copy link
Contributor Author

Dav1dde commented Jul 9, 2024

Running this PR in production solved all the inconsistency errors we saw, but there are still very very few EAGAIN errors, I am wondering if that is also caused by the timeout, according to man recv this would make sense: Enables nonblocking operation; if the operation would block, the call fails with the error EAGAIN or EWOULDBLOCK.

Tokio isn't the only supported runtime, and as mentioned before, WOULDBLOCK can be sent by the OS for nonblocking sockets, regardless of timeouts - I'm not sure how it would be filtered by the runtime. I'd want concrete data on this before changing the behavior.

That's when the future becomes pending, but I am no expert in this and just going off general understanding. I dug a bit through tokio and you can see it explicitly being handled (codepath AsyncRead leads to) and documented.

0Calories pushed a commit to getsentry/relay that referenced this pull request Jul 11, 2024
Updates Redis to PR redis-rs/redis-rs#1253 which
addresses redis-rs/redis-rs#1252 which we've
been seeing occasionally in production.
@nihohit
Copy link
Contributor

nihohit commented Aug 13, 2024

Hi, sorry that I was less available. I would like to examine a solution with less splash damage potential - how about #1290?

@Dav1dde
Copy link
Contributor Author

Dav1dde commented Aug 13, 2024

Hey, thanks for picking this up.

I was going to report back some findings and conclusions and never got around. Ee've been running this fork now for about a month in production and it solved all of the initial issues we observed.

I also had some more thoughts about this. Whenever WouldBlock or TimeOut is propagated to the level where this condition now matches, the connection is in an uncertain state, because it is simply not clear how much data was written or read from the connection before failure, the underlying layer has to correctly book keep this and if it does it shouldn't propagate the errors in the first place. Curious if you think that makes sense.

Additionally what I noticed was, we're still getting a (comparatively) very very tiny amount of Resource temporarily unavailable (os error 11) errors from the Redis client. This doesn't seem to lead to any (observed) data corruption like the initially reported issue though (maybe not related at all).

I'll check out your PR now!

@nihohit
Copy link
Contributor

nihohit commented Aug 13, 2024

Additionally what I noticed was, we're still getting a (comparatively) very very tiny amount of Resource temporarily unavailable (os error 11) errors from the Redis client. This doesn't seem to lead to any (observed) data corruption like the initially reported issue though (maybe not related at all).

Well, this is what fails the new tests in my PR on CI (but not locally), so I'd like to understand what's the cause of this.

@Dav1dde
Copy link
Contributor Author

Dav1dde commented Sep 19, 2024

Superseded by #1290

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants