Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka controller uses old version of Sarama client with known bug which leads to log truncation error in data plane and triggers re-processing of all topic data from beginning #3909

Closed
fos1be opened this issue May 29, 2024 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@fos1be
Copy link

fos1be commented May 29, 2024

Describe the bug
Knative Eventing control plane version >1.12.0 uses Sarama client version 1.41.2. However, this version of Samara client contains a known bug which creates wrong metadata (for leader epoch) for the initial commit on new consumer groups.

IBM/sarama#2705

The problem has been fixed in Samara client version 1.42.1 which should be bumped as a new version for the control plane. A fix woud be to merge the dependency version from main which is already v1.43.1 https://github.com/knative-extensions/eventing-kafka-broker/blob/main/go.mod#L6

This bug hits the data plane when starting a new consumer group. Since the expected metadata (leader epoch) is not correct according to Kafka protocol, the Kafka client in the data plane recognizes a partition truncation error. This log truncation error occurs after a leader switch for the topic partition occured at least once, because then the committed medata does not match the cluster state.

{"@timestamp":"2024-05-28T09:46:29.893Z","@version":"1","message":"[Consumer clientId=xxx-0.f7215124-d6c3-4eff-9ab7-c79ee947dde0-5, groupId=xxx.active-monitoring-kn-sequence-0.f7215124-d6c3-4eff-9ab7-c79ee947dde0] Truncation detected for partition xxx.active-monitoring-kn-sequence-0-0 at offset FetchPosition{offset=183705, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[xxx.com:9092 (id: 11 rack: 2)], epoch=37}}, resetting offset to the first offset known to diverge FetchPosition{offset=176697, offsetEpoch=Optional[0], currentLeader=LeaderAndEpoch{leader=Optional[xxx.com:9092 (id: 11 rack: 2)], epoch=37}}","logger_name":"org.apache.kafka.clients.consumer.internals.SubscriptionState","thread_name":"vert.x-kafka-consumer-thread-4","level":"INFO","level_value":20000}

The impact is that a new consumer will start consuming all messages from earliest (instead of latest as expected). This results in a high risk for production systems since it generates a huge load and impacts data consistency due to duplicate processing. This can happen even after changing sequence step configs (causing new consumer groups to be created).

Expected behavior
A consumer is created for a new consumer group which consumes messages from latest. No log truncation error occurs which would lead to messages to be processed from earliest.

To Reproduce
Create a new trigger for an existing topic with retained messages. In case the topic partitions have a leader epoch > 0 (which is the case when a leader change happened), the consumer is started with a log truncation error. The consumer offset is then reset to earliest and all messages from the topic are consumed from the beginning.

Knative release version
Tested on Knative Eventing version 1.13.0 with Samara client version 1.41.2
This Samara client version is used in all Knative Eventing versions > >1.12.0

Additional context
Add any other context about the problem here such as proposed priority

@fos1be fos1be added the kind/bug Categorizes issue or PR as related to a bug. label May 29, 2024
@Cali0707
Copy link
Member

Cali0707 commented Jun 5, 2024

@matzew can we close this now?

@matzew
Copy link
Contributor

matzew commented Jun 21, 2024

/close

referenced PRs were fixing the issue

@knative-prow knative-prow bot closed this as completed Jun 21, 2024
Copy link

knative-prow bot commented Jun 21, 2024

@matzew: Closing this issue.

In response to this:

/close

referenced PRs were fixing the issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants