Close connection on read timeout #1253

Dav1dde · 2024-07-04T11:40:08Z

Potential fix for #1252.

This only deals with the timeout case, not other IO errors. For clustered clients, a timeout does not cause a retry, the user probably wanted to limit the time it takes until the request returns.

nihohit · 2024-07-05T06:05:09Z

why fail on "would block"? It just means that the socket is currently busy. I'm also not sure about timeouts - they're perfectly safe in a connection that maintains its internal state. I understand that this fixes the situation for the sync connection, but it imparis the async connections.

I'm not convinced that there isn't a simpler solution in this situation. For example, maybe we can just maintain an internal counter of the requests and replies, and drop answers that don't have matching requests.

Dav1dde · 2024-07-05T07:20:41Z

Fail on WOULDBLOCK because that's the behaviour the user intended by setting the timeout.

I'm not convinced that there isn't a simpler solution in this situation. For example, maybe we can just maintain an internal counter of the requests and replies, and drop answers that don't have matching requests.

That's the alternative the connection internally tracks requests/responses and bytes read, it needs to track bytes because the read can fail in the middle of a response (unlikely but possible). Extra work for something the user isn't interested in anymore, but avoids the reconnect.

but it imparis the async connections.

In which way, the code only touches the sync parts?

nihohit · 2024-07-05T07:30:24Z

Fail on WOULDBLOCK because that's the behaviour the user intended by setting the timeout.

Are you certain that WOULDBLOCK only happens by user settings? I think it can also happen if the OS can't push more bytes to the socket's buffer.

In which way, the code only touches the sync parts?

you're touching types.rs, which affect all connections.

That's the alternative the connection internally tracks requests/responses and bytes read, it needs to track bytes because the read can fail in the middle of a response (unlikely but possible). Extra work for something the user isn't interested in anymore, but avoids the reconnect.

You're right, if the timeout can happen mid-message it requires more work.

So, we need to find the simplest solution that doesn't affect async connections.

Dav1dde · 2024-07-05T07:42:25Z

Are you certain that WOULDBLOCK only happens by user settings?

I went by the recv man page and the TcpStream docs.

Platforms may return a different error code whenever a read times out as a result of setting this option. For example Unix typically returns an error of the kind WouldBlock, but Windows may return TimedOut.

But assuming it can happen for other reasons, at least in the sync code path, it would lead to the same problem.

I think it can also happen if the OS can't push more bytes to the socket's buffer.

Since we're only reading in this codepath, that should be fine.

you're touching types.rs, which affect all connections.

Right, I missed that, thanks!

Updates Redis to PR redis-rs/redis-rs#1253 which addresses redis-rs/redis-rs#1252 which we've been seeing occasionally in production.

Dav1dde · 2024-07-05T09:12:59Z

I was looking through the code trying to find where best to make the distinction between async and sync. From what it looks like, the async code uses timeouts on futures, not directly through the socket (which makes sense, the socket is non blocking here anyways). So the change should have no effect on the async codepath.

I haven't really found a good spot how to deal with this though, maybe you have some pointers for me?

nihohit · 2024-07-06T22:21:42Z

So the change should have no effect on the async codepath.

Sadly, that's not true - you can get "internal" timeouts in the async cluster connection, or the connection manager. Any change to general error handling is a change that affects those types, too.

maybe you have some pointers for me?

You're the one experiencing the issue, and frankly, I don't have production experience with the sync connection, your guess is as good as mine :). Why isn't the change to the connection sufficient?

But assuming it can happen for other reasons, at least in the sync code path, it would lead to the same problem.

Not sure what you mean here. Are you claiming that any WOULDBLOCK error means that a connection is in a broken state? Because I don't thisnk that's true for async connections.

Dav1dde · 2024-07-08T07:46:28Z

You're the one experiencing the issue, and frankly, I don't have production experience with the sync connection, your guess is as good as mine :). Why isn't the change to the connection sufficient?

You mean why the cluster impl also needs a change? The cluster impl shouldn't retry a timeout, the user set the timeout for a reason and that should be honored. Also some operations shouldn't be retried at all, like INCR, retrying it will just double count, this is even worse for Redis scripts.

Not sure what you mean here. Are you claiming that any WOULDBLOCK error means that a connection is in a broken state? Because I don't thisnk that's true for async connections.

I don't think you'd ever get a WOULDBLOCK from the async connection, tokio doesn't have a way of setting a read timeout on the stream/socket in the first place, you'd always use a tokio::time::Timeout. The socket itself is will be set to nonblocking and the runtime/event loop should deal with this.

nihohit · 2024-07-09T04:13:52Z

Also some operations shouldn't be retried at all, like INCR, retrying it will just double count, this is even worse for Redis scripts.

Yes, this is an ongoing source of debate - the operations should be retried if they failed before reaching the server, but not after. ATM we don't have a way to distinguish that.

The cluster impl shouldn't retry a timeout, the user set the timeout for a reason and that should be honored.

fair enough. @jaymell, thoughts?

I don't think you'd ever get a WOULDBLOCK from the async connection, tokio doesn't have a way of setting a read timeout on the stream/socket in the first place, you'd always use a tokio::time::Timeout. The socket itself is will be set to nonblocking and the runtime/event loop should deal with this.

Tokio isn't the only supported runtime, and as mentioned before, WOULDBLOCK can be sent by the OS for nonblocking sockets, regardless of timeouts - I'm not sure how it would be filtered by the runtime. I'd want concrete data on this before changing the behavior.

Dav1dde · 2024-07-09T07:00:02Z

Running this PR in production solved all the inconsistency errors we saw, but there are still very very few EAGAIN errors, I am wondering if that is also caused by the timeout, according to man recv this would make sense: Enables nonblocking operation; if the operation would block, the call fails with the error EAGAIN or EWOULDBLOCK.

Tokio isn't the only supported runtime, and as mentioned before, WOULDBLOCK can be sent by the OS for nonblocking sockets, regardless of timeouts - I'm not sure how it would be filtered by the runtime. I'd want concrete data on this before changing the behavior.

That's when the future becomes pending, but I am no expert in this and just going off general understanding. I dug a bit through tokio and you can see it explicitly being handled (codepath AsyncRead leads to) and documented.

Updates Redis to PR redis-rs/redis-rs#1253 which addresses redis-rs/redis-rs#1252 which we've been seeing occasionally in production.

nihohit · 2024-08-13T09:44:39Z

Hi, sorry that I was less available. I would like to examine a solution with less splash damage potential - how about #1290?

Dav1dde · 2024-08-13T10:17:36Z

Hey, thanks for picking this up.

I was going to report back some findings and conclusions and never got around. Ee've been running this fork now for about a month in production and it solved all of the initial issues we observed.

I also had some more thoughts about this. Whenever WouldBlock or TimeOut is propagated to the level where this condition now matches, the connection is in an uncertain state, because it is simply not clear how much data was written or read from the connection before failure, the underlying layer has to correctly book keep this and if it does it shouldn't propagate the errors in the first place. Curious if you think that makes sense.

Additionally what I noticed was, we're still getting a (comparatively) very very tiny amount of Resource temporarily unavailable (os error 11) errors from the Redis client. This doesn't seem to lead to any (observed) data corruption like the initially reported issue though (maybe not related at all).

I'll check out your PR now!

nihohit · 2024-08-13T10:55:27Z

Additionally what I noticed was, we're still getting a (comparatively) very very tiny amount of Resource temporarily unavailable (os error 11) errors from the Redis client. This doesn't seem to lead to any (observed) data corruption like the initially reported issue though (maybe not related at all).

Well, this is what fails the new tests in my PR on CI (but not locally), so I'd like to understand what's the cause of this.

Dav1dde · 2024-09-19T06:54:40Z

Superseded by #1290

Close connection on read timeout

a8cdd36

Dav1dde force-pushed the dav1d/read-timeouts branch from fcbc1b0 to a8cdd36 Compare July 4, 2024 11:49

Dav1dde added 3 commits July 4, 2024 13:57

handle timedout as well, just in case

31234a4

blpop for tests

1493cbb

clippy

939e5df

Dav1dde mentioned this pull request Jul 4, 2024

fix(redis): Fix Redis data corruption on connection timeout getsentry/relay#3792

Merged

Dav1dde closed this Sep 19, 2024

nihohit mentioned this pull request Sep 19, 2024

Sync connection: Handle timed-out responses by ignoring them. #1290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close connection on read timeout #1253

Close connection on read timeout #1253

Dav1dde commented Jul 4, 2024

nihohit commented Jul 5, 2024 •

edited

Loading

Dav1dde commented Jul 5, 2024 •

edited

Loading

nihohit commented Jul 5, 2024

Dav1dde commented Jul 5, 2024

Dav1dde commented Jul 5, 2024

nihohit commented Jul 6, 2024 •

edited

Loading

Dav1dde commented Jul 8, 2024

nihohit commented Jul 9, 2024

Dav1dde commented Jul 9, 2024

nihohit commented Aug 13, 2024

Dav1dde commented Aug 13, 2024

nihohit commented Aug 13, 2024

Dav1dde commented Sep 19, 2024

Close connection on read timeout #1253

Close connection on read timeout #1253

Conversation

Dav1dde commented Jul 4, 2024

nihohit commented Jul 5, 2024 • edited Loading

Dav1dde commented Jul 5, 2024 • edited Loading

nihohit commented Jul 5, 2024

Dav1dde commented Jul 5, 2024

Dav1dde commented Jul 5, 2024

nihohit commented Jul 6, 2024 • edited Loading

Dav1dde commented Jul 8, 2024

nihohit commented Jul 9, 2024

Dav1dde commented Jul 9, 2024

nihohit commented Aug 13, 2024

Dav1dde commented Aug 13, 2024

nihohit commented Aug 13, 2024

Dav1dde commented Sep 19, 2024

nihohit commented Jul 5, 2024 •

edited

Loading

Dav1dde commented Jul 5, 2024 •

edited

Loading

nihohit commented Jul 6, 2024 •

edited

Loading