Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some consumers in a group reverted to an older offsets shortly after deploy #198

Open
Hammadk opened this issue Oct 16, 2020 · 10 comments
Open

Comments

@Hammadk
Copy link
Contributor

Hammadk commented Oct 16, 2020

We had an incident where 2 consumers in the same consumer reprocessed messages with older offsets (they moved back around 500,000 messages each). This happened shortly after a deploy. I am posting about this here incase someone has ideas on why this could have happened, or if this could point out a bug in racecar / rdkafka-ruby. This could be a bug with how rebalancing is handled, but I haven't been able to identify the code path that could cause this.

Our Racecar setup

  • Racecar version 2.1.0
  • config.session_timeout is 5.minutes (default is 30 seconds)
  • We do a rolling update on deploys instead of Recreate. We know that the docs suggest to recreate, but we have opted for higher availability at the cost of double processing during rebalancing. This may be the root cause, but it could also be a red herring.
  • We have topic with 8 partitions, and we have a Racecar consumer that only subscribes to this topic.
  • We deploy this consumer with Kubernetes and this consumer has 8 replicas (same as # partitions)

Details of what happened

Last week, we saw that the consumer lag spiked by 1 million shortly after a deploy. We noticed that consumers working on partition 3 and partition 6 were re-processing old messages.

Partition 3 behaviour

  • Pod A was initially working on partition 3
  • A rolling deploying started
  • At 7:18:00 AM Pod A was working on offset 166981170
  • Here Pod B took over, and it started working on a message roughly 500,000k behind
  • Then between 7:18:00 AM and 7:18:02 AM Pod B worked on offset 166429818 -166429915
  • At 7:18:02 AM Pod B logged that it was gracefully shutting down
  • After that, another Pod started working on the partition

Partition 6 behaviour

  • Pod A was initially working on partition 6 on offset 167800011
  • Pod A was shutdown 7:17:49 AM
  • Here Pod B took over, and it started working on a message roughly 500,000k behind
  • Between 7:18:30 AM and 7:18:31 AM Pod B worked on offset 167276532-167276679

For Partition 3, the offset issue happened when another consumer took over. For partition 6, the offset issue happened after a proper shutdown. We have not noticed any broker issues that would cause this, so the current theory is that this is related to our Kafka consumer. Would love some thoughts / ideas

@dasch
Copy link
Contributor

dasch commented Oct 19, 2020

If your session timeout is 5 minutes, that means that a consumer can theoretically process messages for 5 minutes before figuring out that the group has been re-synchronized. At that point, the consumer cannot commit its offsets, so the messages will be re-processed.

What is the offset commit config for your consumers?

@dasch
Copy link
Contributor

dasch commented Oct 19, 2020

Also: by opting to use rolling updates you'll get these sorts of issues. I'm also not quite sure why you'd get more availability – the whole group needs to re-sync every time a member comes or goes, so you'll have a prolonged period of chaos, during which double-processing will be more likely than not.

@Hammadk
Copy link
Contributor Author

Hammadk commented Oct 19, 2020

If your session timeout is 5 minutes that means that a consumer can theoretically process messages for 5 minutes before figuring out that the group has been re-synchronized. At that point, the consumer cannot commit its offsets, so the messages will be re-processed.

The odd thing is that the consumer reverted back to an offset that it successfully processed 17 hours ago. This large change in offset is outside standard re-processing during rebalancing. I wonder if the consumer is going back to the beginning of the partition. Is there a bug or flow that could cause this? Could you point me to the relevant rebalancing logic?

Note that I don't have a good way of identifying the first offset for this partition, at the time of the incident. Technically, our retention policy is 3 days, so if this went back to the beginning, it should have gone even further back.

What is the offset commit config for your consumers?

We don't custom offset commit config, and we use the default offset_commit_interval 10 seconds config from Racecar. Is that the info you were looking for?

session timeout is 5 minutes

I'll remove our override, and change this back to the default of 30 seconds. We added this config for Racecar version < 2.0.0 to avoid cases where the whole batch could not be processed in time.

by opting to use rolling updates you'll get these sorts of issues. I'm also not quite sure why you'd get more availability – the whole group needs to re-sync every time a member comes or goes, so you'll have a prolonged period of chaos, during which double-processing will be more likely than not.

We were expecting to some of the consumers working correctly and committing their offsets, and slight progress, would be worth it. If you think that this makes it more likely to cause bigger issues then I think it makes sense to recreate pods instead.

@dasch
Copy link
Contributor

dasch commented Oct 20, 2020

Try to simplify your config and see if that doesn't resolve the issue. It's tricky to debug these things, unfortunately. The problem could be in Racecar, rdkafka-ruby, or in librdkafka.

@tjwp
Copy link
Contributor

tjwp commented Oct 20, 2020

With the rebalancing during a rolling restart, could something like this be the issue: confluentinc/librdkafka#2782

I have not run the rdkafka based Racecar v2, but from looking at the source it appears that offsets are manually committed as in the steps described in the issue above.

The issue above was fixed in librdkafka v1.4.2, but the current rdkafka gem uses v1.4.0.

@Hammadk
Copy link
Contributor Author

Hammadk commented Oct 20, 2020

Great find @tjwp! Looking through confluentinc/librdkafka#2782 confluentinc/librdkafka#1767 and confluentinc/confluent-kafka-go#209, this does seem like the same issue. I guess now we wait for rdkafka-ruby to update to librdkafka v1.4.2 before this is resolved on racecar as well.

I have removed the session_timeout override in our config (we haven't needed it since racecar version > 2.0.0).

Thanks for looking into it, and please close this unless you want to keep it open as a reference.

@Draiken
Copy link

Draiken commented Aug 3, 2022

We had an incident yesterday with this, but without any deployment. The code was running for weeks (with perhaps the occasional dyno restart).

A single partition was reset and suddenly reported a 1.6M messages lag. I checked and we didn't have a spike of messages being produced, so it was definitely the offset for that partition getting somehow reset to a very old offset.

Given there was no deployment event I suspect it might indeed be confluentinc/librdkafka#2782 or a similar issue caused by rebalancing.

So there's no recourse to this aside from manually resetting the offset back until we get the upstream bug fix onto rdkafka-ruby and finally into racecar?

@mensfeld
Copy link

mensfeld commented Aug 3, 2022

@Draiken but racecar uses 1.9.0 that has offset reset capabilities: https://github.com/edenhill/librdkafka/releases/tag/v1.9.0 and as far as I remember there is an error when offset cannot be stored.

@Draiken
Copy link

Draiken commented Aug 3, 2022

Oh damn, that was my bad. I looked at the commit that closed the parent issue, saw v1.9.2 and assumed it was there only on that release...
You are correct and it's been released on 1.9.0 so I can update racecar and get this fix.

TYVM @mensfeld

@mensfeld
Copy link

mensfeld commented Aug 3, 2022

Oh damn, that was my bad. I looked at the commit that closed the parent issue, saw v1.9.2 and assumed it was there only on that release...

I made the same mistake and started updating rdkafka to 1.9.2 🙄

Patch to this is already in place in rdkafka so after the upgrade you should see it go away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants