Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Kafka sarama package to 1.41 #3104

Closed
KrylixZA opened this issue Aug 27, 2023 · 7 comments · Fixed by #3108 or #3109
Closed

Update Kafka sarama package to 1.41 #3104

KrylixZA opened this issue Aug 27, 2023 · 7 comments · Fixed by #3108 or #3109
Labels
kind/bug Something isn't working
Milestone

Comments

@KrylixZA
Copy link

Expected Behavior

When consumer pods scale out, pods should automatically be included in the consumer groups and immediately begin consuming once all health checks have passed and the sidecar states that it has joined the consumer group.

Actual Behavior

Often times pods do not ever join the consumer group. These pods reflect as healthy and log that they have joined the consumer group, but actually do not pick up any work. They do not reflect on Kafka as part of the consumer group. This appears to be related to various bugs in the shopify/sarama Go library for Kafka and by all accounts is fixed in version 1.41.0 which was released last week.

Related issues:
IBM/sarama#1516
IBM/sarama#2621

Fix version:
https://github.com/IBM/sarama/releases/tag/v1.41.0

Steps to Reproduce the Problem

  1. Create a topic in Kafka with a high number of partitions (30 or more)
  2. Define a consumer pod that is subscribed to the topic with a low number of pods by default with a flexible HPA config
  3. Use some kind of publisher to generate a high volume of traffic (at least 1000 events per second)
  4. Observe in K8s that pods have scaled out while consumer count in the consumer group does not match the number of pods.

Likely side effect is consumer lag grows dramatically and consumption remains fixed to only a handful of pods. In my case, I am running a topic with 50 partitions, 50 pods, but only have 29 consumers in the consumer group.

Release Note

RELEASE NOTE:

@KrylixZA KrylixZA added the kind/bug Something isn't working label Aug 27, 2023
@KrylixZA
Copy link
Author

KrylixZA commented Aug 29, 2023

And while we at it, the package should probably change from Shopify/sarama to IBM/sarama as mentioned in release v1.41.0.

image

dnwe added a commit to dnwe/components-contrib that referenced this issue Aug 29, 2023
Also pin to v1.37.2 as that shouldn't be necessary since
dapr/docs#3474 added documentation to add a
version pin for Azure EventHubs users.

Note: the module path has changed to github.com/IBM/sarama since
ownership transitioned away from Shopify

Fixes dapr#3104

Signed-off-by: Dominic Evans <dominic.evans@uk.ibm.com>
dnwe added a commit to dnwe/components-contrib that referenced this issue Aug 29, 2023
Also remove old module replace pin to v1.37.2 as that shouldn't be
necessary since dapr/docs#3474 added
documentation to add a config version pin for Azure EventHubs users
instead.

Note: the module path has changed to github.com/IBM/sarama since
ownership transitioned away from Shopify

Fixes dapr#3104

Signed-off-by: Dominic Evans <dominic.evans@uk.ibm.com>
@berndverst
Copy link
Member

We cannot upgrade sarama because it causes issues with Azure EventHubs Kafka compatibility. We will not be upgrading this library ahead of the 1.12 release unless we can find sufficient time to perform manual tests against Azure EventHubs. In particular, this broke SASL Password Auth.

@berndverst
Copy link
Member

See: #2874
and #2755

@berndverst
Copy link
Member

In release notes they state the default version is changing: https://github.com/IBM/sarama/releases/tag/v1.41.0

DefaultVersion V2_1_0_0

This could be a breaking change for Dapr. Fortunately it is not though because we pin version V2_0_0_0 in our component code (unless someone uses a different metadata property).

It's very important to carefully examine the release notes. Dapr integrates with lots of Kafka compatible services such that any change in default behavior could break users. So we must be conservative here.

Unless we evaluate EventHubs compatibility with this new library version we cannot get this into the 1.12 release. We are also already beyond our official code freeze and the team is stretched thin.

Let's see what we can do.

@berndverst
Copy link
Member

For now I will mark this as 1.13 release - but we will see if we can get to it for 1.12

@berndverst berndverst added this to the v1.13 milestone Aug 29, 2023
@berndverst
Copy link
Member

Alright, after looking into this further - let's proceed with the upgrade for 1.12.

@KrylixZA
Copy link
Author

Hey @dnwe / @berndverst.

Just letting you know we've been running the 1.11.3-rc.1 in our dev, test, and staging environment for most of the day with no issues. In fact, our Kafka consumers are noticeable more performant and responsive than before.

Pretty confident there are no regressions to worry about here 👍

Thanks. Appreciate your efforts.

@artursouza artursouza assigned KrylixZA and unassigned KrylixZA Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
2 participants