Handling Unreachable Brokers #9

eapache · 2013-08-13T20:01:48Z

Spinning this off from #7 to track separately.

When the Client receives metadata from a broker it runs that metadata through Client.update(), which tries to connect to any new brokers listed in the metadata. If any of those connections fail, it bails immediately which isn't the right choice. It also potentially leaks broker connections.

The text was updated successfully, but these errors were encountered:

eapache · 2013-08-13T20:07:06Z

I think the solution here is for the failure of Broker.Connect() to put the broker in a special 'unreachable' state. Then Client.update() can ignore broker connection errors (and even run the Connect()s in goroutines to avoid blocking if one of them is slow).

It pushes the errors up the stack a bit at least, so that if one broker is unreachable but that's not one we care about, then we continue to work.

Thoughts, @burke ?

Fixes #9. This ended up being more complicated than I had hoped and touched several different areas. TL;DR is that we now connect to the other brokers in the cluster asynchronously. Errors connecting only show up when somebody tries to use that broker. This is better than the old behaviour since it means that if some brokers in a cluster go down but the topics we care about are still available, we just keep going instead of blowing up for no reason. The complicated part is that simply calling `go broker.Connect()` doesn't do what we want, so I had to write a `broker.AsyncConnect()`. The problem occurs if you've got code like this: go broker.Connect() // do some stuff broker.SendSomeMessage() What can happen is that SendSomeMessage can be run before the Connect() goroutine ever gets scheduled, in which case SendSomeMessage will simply return NotConnected. The desired behaviour is that SendSomeMessage waits for Connect() to finish, which means Connect() has to *synchronously* take the broker lock before it launches the asynchronous connect call. Lots of fun. And bonus change in this commit: rather than special-casing leader == -1 in `client.cachedLeader` and adding a big long comment to the LEADER_NOT_AVAILABLE case explaining the fallthrough statement, just delete that partition from the hash. So much easier to follow, I must have been on crack when I wrote the old way.

eapache mentioned this issue Aug 14, 2013

Rework how the client connects to brokers. #10

Merged

eapache closed this as completed in #10 Aug 14, 2013

eapache mentioned this issue Aug 15, 2013

Better Broker Connection Management #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling Unreachable Brokers #9

Handling Unreachable Brokers #9

eapache commented Aug 13, 2013

eapache commented Aug 13, 2013

Handling Unreachable Brokers #9

Handling Unreachable Brokers #9

Comments

eapache commented Aug 13, 2013

eapache commented Aug 13, 2013