-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close connection on read timeout #1253
Conversation
fcbc1b0
to
a8cdd36
Compare
why fail on "would block"? It just means that the socket is currently busy. I'm also not sure about timeouts - they're perfectly safe in a connection that maintains its internal state. I understand that this fixes the situation for the sync connection, but it imparis the async connections. I'm not convinced that there isn't a simpler solution in this situation. For example, maybe we can just maintain an internal counter of the requests and replies, and drop answers that don't have matching requests. |
Fail on
That's the alternative the connection internally tracks requests/responses and bytes read, it needs to track bytes because the read can fail in the middle of a response (unlikely but possible). Extra work for something the user isn't interested in anymore, but avoids the reconnect.
In which way, the code only touches the sync parts? |
Are you certain that WOULDBLOCK only happens by user settings? I think it can also happen if the OS can't push more bytes to the socket's buffer.
you're touching types.rs, which affect all connections.
You're right, if the timeout can happen mid-message it requires more work. So, we need to find the simplest solution that doesn't affect async connections. |
I went by the
But assuming it can happen for other reasons, at least in the sync code path, it would lead to the same problem.
Since we're only reading in this codepath, that should be fine.
Right, I missed that, thanks! |
Updates Redis to PR redis-rs/redis-rs#1253 which addresses redis-rs/redis-rs#1252 which we've been seeing occasionally in production.
I was looking through the code trying to find where best to make the distinction between async and sync. From what it looks like, the async code uses timeouts on futures, not directly through the socket (which makes sense, the socket is non blocking here anyways). So the change should have no effect on the async codepath. I haven't really found a good spot how to deal with this though, maybe you have some pointers for me? |
Sadly, that's not true - you can get "internal" timeouts in the async cluster connection, or the connection manager. Any change to general error handling is a change that affects those types, too.
You're the one experiencing the issue, and frankly, I don't have production experience with the sync connection, your guess is as good as mine :). Why isn't the change to the connection sufficient?
Not sure what you mean here. Are you claiming that any WOULDBLOCK error means that a connection is in a broken state? Because I don't thisnk that's true for async connections. |
You mean why the cluster impl also needs a change? The cluster impl shouldn't retry a timeout, the user set the timeout for a reason and that should be honored. Also some operations shouldn't be retried at all, like
I don't think you'd ever get a |
Yes, this is an ongoing source of debate - the operations should be retried if they failed before reaching the server, but not after. ATM we don't have a way to distinguish that.
fair enough. @jaymell, thoughts?
Tokio isn't the only supported runtime, and as mentioned before, WOULDBLOCK can be sent by the OS for nonblocking sockets, regardless of timeouts - I'm not sure how it would be filtered by the runtime. I'd want concrete data on this before changing the behavior. |
Running this PR in production solved all the inconsistency errors we saw, but there are still very very few
That's when the future becomes pending, but I am no expert in this and just going off general understanding. I dug a bit through tokio and you can see it explicitly being handled (codepath |
Updates Redis to PR redis-rs/redis-rs#1253 which addresses redis-rs/redis-rs#1252 which we've been seeing occasionally in production.
Hi, sorry that I was less available. I would like to examine a solution with less splash damage potential - how about #1290? |
Hey, thanks for picking this up. I was going to report back some findings and conclusions and never got around. Ee've been running this fork now for about a month in production and it solved all of the initial issues we observed. I also had some more thoughts about this. Whenever Additionally what I noticed was, we're still getting a (comparatively) very very tiny amount of I'll check out your PR now! |
Well, this is what fails the new tests in my PR on CI (but not locally), so I'd like to understand what's the cause of this. |
Superseded by #1290 |
Potential fix for #1252.
This only deals with the timeout case, not other IO errors. For clustered clients, a timeout does not cause a retry, the user probably wanted to limit the time it takes until the request returns.