Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling Unreachable Brokers #9

Closed
eapache opened this issue Aug 13, 2013 · 1 comment · Fixed by #10
Closed

Handling Unreachable Brokers #9

eapache opened this issue Aug 13, 2013 · 1 comment · Fixed by #10
Labels

Comments

@eapache
Copy link
Contributor

eapache commented Aug 13, 2013

Spinning this off from #7 to track separately.

When the Client receives metadata from a broker it runs that metadata through Client.update(), which tries to connect to any new brokers listed in the metadata. If any of those connections fail, it bails immediately which isn't the right choice. It also potentially leaks broker connections.

@eapache
Copy link
Contributor Author

eapache commented Aug 13, 2013

I think the solution here is for the failure of Broker.Connect() to put the broker in a special 'unreachable' state. Then Client.update() can ignore broker connection errors (and even run the Connect()s in goroutines to avoid blocking if one of them is slow).

It pushes the errors up the stack a bit at least, so that if one broker is unreachable but that's not one we care about, then we continue to work.

Thoughts, @burke ?

eapache added a commit that referenced this issue Aug 14, 2013
Fixes #9.

This ended up being more complicated than I had hoped and touched several
different areas. TL;DR is that we now connect to the other brokers in the
cluster asynchronously. Errors connecting only show up when somebody tries to
use that broker.

This is better than the old behaviour since it means that if some brokers in a
cluster go down but the topics we care about are still available, we just keep
going instead of blowing up for no reason.

The complicated part is that simply calling `go broker.Connect()` doesn't do
what we want, so I had to write a `broker.AsyncConnect()`. The problem occurs if
you've got code like this:
    go broker.Connect()
    // do some stuff
    broker.SendSomeMessage()
What can happen is that SendSomeMessage can be run before the Connect()
goroutine ever gets scheduled, in which case SendSomeMessage will simply return
NotConnected. The desired behaviour is that SendSomeMessage waits for Connect()
to finish, which means Connect() has to *synchronously* take the broker lock
before it launches the asynchronous connect call. Lots of fun.

And bonus change in this commit: rather than special-casing leader == -1 in
`client.cachedLeader` and adding a big long comment to the LEADER_NOT_AVAILABLE
case explaining the fallthrough statement, just delete that partition from the
hash. So much easier to follow, I must have been on crack when I wrote the old
way.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant