when a store becomes a bottleneck it shouldn't indefinitely stall #154

grobian · 2016-03-08T09:21:00Z

We keep on stalling if we can connect but never write to a store. We should probably stall only for a bit, and give up on such store for it slows down the entire stack. We see this happening with stores that are about to die due to raid controller failures.

grobian · 2016-04-15T13:07:54Z

any plan on how to approach this problem?

grobian · 2016-04-15T13:12:31Z

@Civil: ^^^

Civil · 2016-04-15T13:22:57Z

I'm currently testing if it's possible to do following stuff:

Make socket non-blocking
Try to check if it's possible to write to socket or if buffer is full, if it's not - disable server for some period of time.

Though MacOS and Linux behaves a little bit different and I'm still trying to find how to make the solution portable. For MacOS it's enough to make it unblocking and to check errno - if it's EWOULDBLOCK then server is slow and can be disabled. But on Linux it results in incomplete write without appropriate errno, so carbon-c-relay reconnects on a next iteration.

UPD: I was wrong, linux just much faster and my slow server emulator is not slow enough ;)

grobian · 2016-04-15T13:25:37Z

I would just use the stall counter. Keep a counter in the struct, and reset it as soon as you un-stall. (Alternatively, make the stall char more than 1 bit.) Then you can have a stall max that when you encounter that, you simply treat it as dead node. That's the least impact on the code and doesn't duplicate stuff.

Civil · 2016-04-15T13:30:16Z

Well, that won't actually fix a problem, because if write() to specific backend is stuck, it'll affect the whole relay until write() times out. The idea of making it non-blocking is to make relay detect that kind of problems as fast as possible and remove the server from the pool for some significant amount of time.

grobian · 2016-04-15T13:45:05Z

No, all servers are independent threads, the queue is in the middle to avoid this.

When a server has a serious problem keeping up, give up after a short while, and just start dropping stuff, not to keep on slowing down an entire chain.

grobian · 2016-04-15T13:48:21Z

I suggest above commit. It will unbreak the vicious cycle. The idea of stalling is to signal senders they should slow down, without this, massive amounts of data are dropped on bursts.

grobian · 2016-04-15T14:03:16Z

regarding your idea:

servers use an io timeout, to break faster out of a write (10s)
servers are masked into failure when this happens, and only return to "OK" if
- there is nothing to be sent and making a new connection succeed
- data was successfully delivered

this way your idea of "removing a server from the pool for some significant amount of time" is already implemented. In fact, it is even better, a server really needs to accept data before it is made "available" again, instead of some random timeout.

The value of 4 in the code might need refinement after experimentation, I guess.

Civil · 2016-04-15T15:12:37Z

Ok, your commit seems to also solve the problem. Though you are not completely correct about when carbon-c-relay will reconnect back.

The conditions itself are correct, but the first one (nothing to send and new connection succeeds) - will happen much more often than you think - basically carbon-c-relay will reconnect almost immediately (several hundred ms's after it drops the server) - so in case when server is too slow, you'll start constantly getting messages in log about carbon-c-relay dropping it (incomplete writes) and adding it back again. That's why I thought about some cool down period of tens seconds at least.

grobian · 2016-04-16T06:28:02Z

But, that means the write didn't succeed, and a subsequent connect + write /did/ succeed. If you want to increase the sleep-time inbetween, then that's possible of course. In another bug, someone mentioned this problem (of incomplete writes) is because of ports still floating around. I don't recall how it was connected. As far as I understand, incomplete writes are not a message for a slow server, but for a server that disconnects/closes the connection while writing to it.

grobian · 2016-04-16T06:37:19Z

     When using non-blocking I/O on objects, such as sockets, that are subject
     to flow control, write() and writev() may write fewer bytes than
     requested; the return value must be noted, and the remainder of the oper‐
     ation should be retried when possible

So I guess I should retry instead, instead of bailing out.

the manual seems to suggest we should retry writes due to flow control things, so do it, as part of an artifact mentioned in issue #154

grobian · 2016-04-16T07:57:24Z

I'd appreciate some feedback on above patch. I think it should reduce the logspam/reconnects a lot.

Civil · 2016-04-19T07:45:43Z

EDIT: Sorry, haven't seen that you are actually sending only what wasn't sent. Then I just need to find a way to test that reliably.

grobian · 2016-04-19T08:31:57Z

yeah, I write p, which is incremented with the bytes that were sent.

Civil mentioned this issue Apr 15, 2016

Statistic send is broken in current master #166

Closed

grobian added a commit that referenced this issue Apr 15, 2016

server_send: don't indefinitely stall, issue #154

a09ff05

When a server has a serious problem keeping up, give up after a short while, and just start dropping stuff, not to keep on slowing down an entire chain.

grobian closed this as completed Apr 15, 2016

grobian added a commit that referenced this issue Apr 16, 2016

server_queuereader: retry incomplete writes

75dd599

the manual seems to suggest we should retry writes due to flow control things, so do it, as part of an artifact mentioned in issue #154

grobian mentioned this issue Apr 29, 2016

A slow cluster should not affect other clusters #172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when a store becomes a bottleneck it shouldn't indefinitely stall #154

when a store becomes a bottleneck it shouldn't indefinitely stall #154

grobian commented Mar 8, 2016

grobian commented Apr 15, 2016

grobian commented Apr 15, 2016

Civil commented Apr 15, 2016 •

edited

Loading

grobian commented Apr 15, 2016

Civil commented Apr 15, 2016

grobian commented Apr 15, 2016

grobian commented Apr 15, 2016

grobian commented Apr 15, 2016 •

edited

Loading

Civil commented Apr 15, 2016

grobian commented Apr 16, 2016

grobian commented Apr 16, 2016

grobian commented Apr 16, 2016

Civil commented Apr 19, 2016 •

edited

Loading

grobian commented Apr 19, 2016

when a store becomes a bottleneck it shouldn't indefinitely stall #154

when a store becomes a bottleneck it shouldn't indefinitely stall #154

Comments

grobian commented Mar 8, 2016

grobian commented Apr 15, 2016

grobian commented Apr 15, 2016

Civil commented Apr 15, 2016 • edited Loading

grobian commented Apr 15, 2016

Civil commented Apr 15, 2016

grobian commented Apr 15, 2016

grobian commented Apr 15, 2016

grobian commented Apr 15, 2016 • edited Loading

Civil commented Apr 15, 2016

grobian commented Apr 16, 2016

grobian commented Apr 16, 2016

grobian commented Apr 16, 2016

Civil commented Apr 19, 2016 • edited Loading

grobian commented Apr 19, 2016

Civil commented Apr 15, 2016 •

edited

Loading

grobian commented Apr 15, 2016 •

edited

Loading

Civil commented Apr 19, 2016 •

edited

Loading