-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
when a store becomes a bottleneck it shouldn't indefinitely stall #154
Comments
any plan on how to approach this problem? |
@Civil: ^^^ |
I'm currently testing if it's possible to do following stuff:
Though MacOS and Linux behaves a little bit different and I'm still trying to find how to make the solution portable. For MacOS it's enough to make it unblocking and to check errno - if it's EWOULDBLOCK then server is slow and can be disabled. But on Linux it results in incomplete write without appropriate errno, so carbon-c-relay reconnects on a next iteration. UPD: I was wrong, linux just much faster and my slow server emulator is not slow enough ;) |
I would just use the stall counter. Keep a counter in the struct, and reset it as soon as you un-stall. (Alternatively, make the stall char more than 1 bit.) Then you can have a stall max that when you encounter that, you simply treat it as dead node. That's the least impact on the code and doesn't duplicate stuff. |
Well, that won't actually fix a problem, because if write() to specific backend is stuck, it'll affect the whole relay until write() times out. The idea of making it non-blocking is to make relay detect that kind of problems as fast as possible and remove the server from the pool for some significant amount of time. |
No, all servers are independent threads, the queue is in the middle to avoid this. |
When a server has a serious problem keeping up, give up after a short while, and just start dropping stuff, not to keep on slowing down an entire chain.
I suggest above commit. It will unbreak the vicious cycle. The idea of stalling is to signal senders they should slow down, without this, massive amounts of data are dropped on bursts. |
regarding your idea:
this way your idea of "removing a server from the pool for some significant amount of time" is already implemented. In fact, it is even better, a server really needs to accept data before it is made "available" again, instead of some random timeout. The value of 4 in the code might need refinement after experimentation, I guess. |
Ok, your commit seems to also solve the problem. Though you are not completely correct about when carbon-c-relay will reconnect back. The conditions itself are correct, but the first one (nothing to send and new connection succeeds) - will happen much more often than you think - basically carbon-c-relay will reconnect almost immediately (several hundred ms's after it drops the server) - so in case when server is too slow, you'll start constantly getting messages in log about carbon-c-relay dropping it (incomplete writes) and adding it back again. That's why I thought about some cool down period of tens seconds at least. |
But, that means the write didn't succeed, and a subsequent connect + write /did/ succeed. If you want to increase the sleep-time inbetween, then that's possible of course. In another bug, someone mentioned this problem (of incomplete writes) is because of ports still floating around. I don't recall how it was connected. As far as I understand, incomplete writes are not a message for a slow server, but for a server that disconnects/closes the connection while writing to it. |
So I guess I should retry instead, instead of bailing out. |
the manual seems to suggest we should retry writes due to flow control things, so do it, as part of an artifact mentioned in issue #154
I'd appreciate some feedback on above patch. I think it should reduce the logspam/reconnects a lot. |
EDIT: Sorry, haven't seen that you are actually sending only what wasn't sent. Then I just need to find a way to test that reliably. |
yeah, I write p, which is incremented with the bytes that were sent. |
We keep on stalling if we can connect but never write to a store. We should probably stall only for a bit, and give up on such store for it slows down the entire stack. We see this happening with stores that are about to die due to raid controller failures.
The text was updated successfully, but these errors were encountered: