Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up ZooKeeper group data on last member leave #155

Merged
merged 9 commits into from
Nov 1, 2018
Merged

Conversation

horkhe
Copy link
Member

@horkhe horkhe commented Oct 16, 2018

A consumer group data has never been removed from ZooKeeper before, that could cause issues with disposable groups that we use to implement broadcast events. With this PR when the last group member is leaving a group, its data structure in ZooKeeper is deleted. The logic is careful to survive a data race that can occur when the last member leaves but a new one joins the group at the same time.

Besides:

  • dropped dependency on github.com/wvanbergen/kazoo-go in favor of an internal implementation that provides just what we need, makes fewer queries to ZooKeeper, and it is durable to data races;
  • Fixed a long standing Rebalancing fails due to non existent topic #54. Attempts to consume from a missing topics do not disrupt consumption from existing ones anymore.

@thrawn01
Copy link
Contributor

I was unable to get kafka-pixy to clean up the /consumers in any situation.

  • With 1 consumer connected, I shutdown kafka-pixy and /consumers zookeeper entry remains.
  • With 1 consumer connected, I shutdown the consumer and wait for cleanup. /consumers zookeeper is never removed.
$ zkCli --servers localhost -c lsr /consumers                                                                                                                             │  -g, --group      consumer group we are in (Default=kafka-pixy-cli, Env=KAFKA_PIXY_GROUP)
scout                                                                                                                                                                     │  -b, --buffer     how many events to buffer before consumed (Default=0, Env=KAFKA_PIXY_BUFFER)
scout/ids                                                                                                                                                                 │
scout/owners                                                                                                                                                              │
scout/owners/catchall_legacy                                                                                                                                              │thrawn at Derricks-MacBook-Pro in ~
scout/owners/catchall_unicast                                                                                                                                             │$ kafka-pixy-cli consume catchall_legacy -g scout
scout/owners/scout_recount                                                                                                                                                │^[[A^C
scout_5938c8f68bb347db88e531db25b8b678                                                                                                                                    │
scout_5938c8f68bb347db88e531db25b8b678/ids                                                                                                                                │thrawn at Derricks-MacBook-Pro in ~
scout_5938c8f68bb347db88e531db25b8b678/owners                                                                                                                             │$ kafka-pixy-cli consume catchall_legacy -g scout -e localhost:19081
scout_5938c8f68bb347db88e531db25b8b678/owners/catchall_broadcast

Here is the log from when I started a single consumer and then stopped the consumer.
https://gist.github.com/thrawn01/be1388b742efb47c33fe5838682dae74

I didn't have time to dig into why it wasn't cleaning up. If you want I should have more time tomorrow.

@horkhe
Copy link
Member Author

horkhe commented Oct 16, 2018

@thrawn01 Looks like in your case subscriber stopped before partitioncsm released the partition. I wonder how respective test managed to pass. I will look into it tomorrow.

2018-10-16 14:07:19.705718 -05 error </_.0/cons.0/scout_5938c8f68bb347db88e531db25b8b678.0/member.0> "Failed to delete empty group" error="while deleting /consumers/scout_5938c8f68bb347db88e531db25b8b678/owners/catchall_broadcast: zk: node has children" kafka.group="scout_5938c8f68bb347db88e531db25b8b678"
2018-10-16 14:07:19.705753 -05 info </_.0/cons.0/scout_5938c8f68bb347db88e531db25b8b678.0/member.0> Stopped kafka.group="scout_5938c8f68bb347db88e531db25b8b678"
...
2018-10-16 14:07:19.708201 -05 info </_.0/cons.0/scout_5938c8f68bb347db88e531db25b8b678.0/catchall_broadcast.p0.0> "Partition released: via=/_.0/cons.0/scout_5938c8f68bb347db88e531db25b8b678.0/member.0, retries=0, took=2.02916ms" kafka.group="scout_5938c8f68bb347db88e531db25b8b678" kafka.partition=0 kafka.topic="catchall_broadcast"
2018-10-16 14:07:19.708226 -05 info </_.0/cons.0/scout_5938c8f68bb347db88e531db25b8b678.0/catchall_broadcast.p0.0> Stopped kafka.group="scout_5938c8f68bb347db88e531db25b8b678" kafka.partition=0 kafka.topic="catchall_broadcast"

@horkhe
Copy link
Member Author

horkhe commented Oct 17, 2018

@thrawn01 fixed.

Copy link
Contributor

@thrawn01 thrawn01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified the entries are cleaned up after all consumers leave. Reviewed the code, I think I understand partition claims better now. Looks good to me!

@horkhe horkhe merged commit 63886e9 into master Nov 1, 2018
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants