Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple consumer group offset problems with 0.4.36 #1021

Closed
dselans opened this issue Nov 2, 2022 · 16 comments
Closed

Multiple consumer group offset problems with 0.4.36 #1021

dselans opened this issue Nov 2, 2022 · 16 comments
Assignees
Labels

Comments

@dselans
Copy link

dselans commented Nov 2, 2022

Describe the bug

Several consumer group issues with ReadMessage and FetchMessage when using lib version 0.4.36:

  1. Offsets are not committed at all when CommitMessages is called manually after a successful FetchMessage
  2. Offsets are partially committed when using ReadInterval + CommitInterval set to 0
  3. Downgrading to 0.4.35 fixes both of the issues

Kafka Version

2.4.1.1

To Reproduce

groupID := "foo"

readerConfig := kafka.ReaderConfig{
	Brokers: k.Options.Brokers,
	GroupID: groupID,
	Topic:   topic,
	MaxWait: k.Options.ReaderMaxWait,
	Dialer:  k.Dialer,
	CommitInterval: 0,
}

reader := kafka.NewReader(readerConfig) 

// Try to read some messages
m, err := reader.ReadMessage(context.Background())
if err != nil {
    // err
}

// Close reader, create new one w/ same GroupID
m, err := reader.ReadMessage(context.Background())
if err != nil {
    // err
}

// A dupe read will occur

// If you view kafdrop or use another utility to view lag - the consumer group will likely have lag as well.

// When using FetchMessage and calling CommitMessages - offsets do not appear to be committed at all.

Thank you!

@dselans dselans added the bug label Nov 2, 2022
@moogacs
Copy link
Contributor

moogacs commented Nov 2, 2022

+1, with the new version the consumer group lag increases

@moogacs
Copy link
Contributor

moogacs commented Nov 2, 2022

I am suspecting this PR #947

@rhansen2 rhansen2 self-assigned this Nov 2, 2022
@rhansen2
Copy link
Collaborator

rhansen2 commented Nov 2, 2022

Hi @dselans,

Thanks for reporting these issues. Based on your Kafka version it looks like you're using MSK is that correct? If so are you able to reproduce your issues against non-MSK clusters?

I'm struggling to reproduce this behavior, are you able to provide logs from your readers? They may be helpful in determining what's going on.

https://pkg.go.dev/github.com/segmentio/kafka-go#readme-logging

Thanks!

@RO-29
Copy link

RO-29 commented Nov 2, 2022

I am facing partially the same issue and it caused a problem on our side during this week due to duplicates that were generated in large quantity

offsets are committed partially

If it helps I started receiving multiple warnings also after this upgrade

I am suspecting this PR #947

I agree on this aspect too as I see memberID being introduced in this PR but not entirely sure of the change

 Failed to join group identifier-ingester: [79] Member ID Required: the group member needs to have a valid member id before actually entering a consumer group

@dselans
Copy link
Author

dselans commented Nov 2, 2022

@rhansen2 yes, we're using MSK; we also observed similar issues + duplicates in GCP running in k8s ( kafka:2.8.0-debian-10-r43).

I forgot to mention the dupe issue that @RO-29 posted about. We saw dupes and the offset issues in Kafka 2.8.0 but I assume that's because of the funky offset problems. I think we did not see the dupe issue in MSK but that might be because of insufficient testing.

@dselans
Copy link
Author

dselans commented Nov 2, 2022

@rhansen2 We already moved off of 0.4.36 to .35 so I do not have a way to test it again right away. I can do it if you cannot replicate but it'll have to wait until tomorrow. LMK

@rhansen2
Copy link
Collaborator

rhansen2 commented Nov 2, 2022

@dselans I'm still not having any luck reproducing. Could you share the config for the dialer you're using in your code sample? Maybe there's something there that I'm missing.

@RO-29 @moogacs Are you able to provide any additional context/code samples/logs or a reproduction?

Thanks!

@moogacs
Copy link
Contributor

moogacs commented Nov 2, 2022

my config is likse what @dselans mentioned and on top ReadBatchTimeout: 30 * time.Second

i get errors like

Member ID Required: the group member needs to have a valid member id before actually entering a consumer group

kafka version: 3.2.0

@phoenix2x
Copy link

We had multiple issues with 0.4.36 as well:

  1. Our small DEV cluster got overwhelmed(100% CPU consumption ) with FIND_COORDINATOR requests.
  2. Our larger QA cluster was able to keep up with increased load but we've seen most of the consumer group lags were just growing indefinitely.

Both issues were fixed with the downgrade to 0.4.35
kafka version: 2.2.1

@rhansen2
Copy link
Collaborator

rhansen2 commented Nov 3, 2022

#1022 Contains some potential fixes for some of the issues 0.4.36 but I'm still not able to reproduce any issues with committing offsets in MSK, Strimzi or running Kafka locally. Some of the issues seem related to interaction with other members of consumer groups.

Is anyone using a mixture of clients within a single consumer group?

As always, logs are very helpful for debugging.

Thanks!

@vitaly-m
Copy link

vitaly-m commented Nov 8, 2022

Hello, I have the same problem here some logs, maybe they will help to localize the issue.

[2022-11-08T15:24:58.757][level=INFO]: joined group app-name as member app@app-name-v1-8574bff57b-485fg (github.com/segmentio/kafka-go)-fc9ebfda-bed0-4f88-96a2-3d9b84c9b4ef in generation 3735
[2022-11-08T15:24:58.757][level=INFO]: joinGroup succeeded for response, app-name.  generationID=3735, memberID=app@app-name-v1-8574bff57b-485fg (github.com/segmentio/kafka-go)-fc9ebfda-bed0-4f88-96a2-3d9b84c9b4ef
[2022-11-08T15:24:58.757][level=INFO]: Joined group app-name as member app@app-name-v1-8574bff57b-485fg (github.com/segmentio/kafka-go)-fc9ebfda-bed0-4f88-96a2-3d9b84c9b4ef in generation 3735
[2022-11-08T15:24:58.757][level=INFO]: joined group app-name as member app@app-name-v1-8574bff57b-485fg (github.com/segmentio/kafka-go)-a8cd61a3-dfeb-4591-8849-a3802f339a21 in generation 3735
[2022-11-08T15:24:58.758][level=INFO]: joinGroup succeeded for response, app-name.  generationID=3735, memberID=app@app-name-v1-8574bff57b-485fg (github.com/segmentio/kafka-go)-a8cd61a3-dfeb-4591-8849-a3802f339a21
[2022-11-08T15:24:58.758][level=INFO]: Joined group app-name as member app@app-name-v1-8574bff57b-485fg (github.com/segmentio/kafka-go)-a8cd61a3-dfeb-4591-8849-a3802f339a21 in generation 3735
[2022-11-08T15:24:58.774][level=ERROR]: Failed to sync group app-name: kafka.(*Client).SyncGroup: unexpected EOF
[2022-11-08T15:24:58.774][level=INFO]: Leaving group app-name, member app@app-name-v1-8574bff57b-485fg (github.com/segmentio/kafka-go)-a8cd61a3-dfeb-4591-8849-a3802f339a21
[2022-11-08T15:24:58.774][level=ERROR]: Failed to sync group app-name: kafka.(*Client).SyncGroup: unexpected EOF
[2022-11-08T15:24:58.774][level=INFO]: Leaving group app-name, member app@app-name-v1-8574bff57b-485fg (github.com/segmentio/kafka-go)-fc9ebfda-bed0-4f88-96a2-3d9b84c9b4ef
[2022-11-08T15:24:58.777][level=ERROR]: kafka.(*Client).SyncGroup: unexpected EOF
[2022-11-08T15:24:58.778][level=ERROR]: kafka.(*Client).SyncGroup: unexpected EOF

@vitaly-m
Copy link

vitaly-m commented Nov 8, 2022

I have also tried the fix:

replace github.com/segmentio/kafka-go v0.4.36 => github.com/rhansen2/kafka-go v0.4.17-0.20221107055315-105fbb979829

It works fine for me, here are some logs:

[2022-11-08T15:48:38.981][level=INFO]: entering loop for consumer group, app-name
[2022-11-08T15:48:39.017][level=INFO]: entering loop for consumer group, app-name
[2022-11-08T15:48:46.736][level=INFO]: joined group app-name as member app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-c991f21b-edcc-4dc8-9cc4-a69e67164075 in generation 3958
[2022-11-08T15:48:46.736][level=INFO]: joinGroup succeeded for response, app-name.  generationID=3958, memberID=app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-c991f21b-edcc-4dc8-9cc4-a69e67164075
[2022-11-08T15:48:46.736][level=INFO]: Joined group app-name as member app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-c991f21b-edcc-4dc8-9cc4-a69e67164075 in generation 3958
[2022-11-08T15:48:46.737][level=INFO]: joined group app-name as member app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-7868f911-5e24-4a9a-84bd-e9442ca170b2 in generation 3958
[2022-11-08T15:48:46.737][level=INFO]: joinGroup succeeded for response, app-name.  generationID=3958, memberID=app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-7868f911-5e24-4a9a-84bd-e9442ca170b2
[2022-11-08T15:48:46.737][level=INFO]: Joined group app-name as member app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-7868f911-5e24-4a9a-84bd-e9442ca170b2 in generation 3958
[2022-11-08T15:48:46.797][level=INFO]: joined group app-name as member app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-c991f21b-edcc-4dc8-9cc4-a69e67164075 in generation 3959
[2022-11-08T15:48:46.798][level=INFO]: joinGroup succeeded for response, app-name.  generationID=3959, memberID=app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-c991f21b-edcc-4dc8-9cc4-a69e67164075
[2022-11-08T15:48:46.797][level=INFO]: joined group app-name as member app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-7868f911-5e24-4a9a-84bd-e9442ca170b2 in generation 3959
[2022-11-08T15:48:46.798][level=INFO]: Joined group app-name as member app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-c991f21b-edcc-4dc8-9cc4-a69e67164075 in generation 3959
[2022-11-08T15:48:46.798][level=INFO]: joinGroup succeeded for response, app-name.  generationID=3959, memberID=app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-7868f911-5e24-4a9a-84bd-e9442ca170b2
[2022-11-08T15:48:46.798][level=INFO]: Joined group app-name as member app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-7868f911-5e24-4a9a-84bd-e9442ca170b2 in generation 3959
[2022-11-08T15:48:46.84][level=INFO]: joined group app-name as member app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-c991f21b-edcc-4dc8-9cc4-a69e67164075 in generation 3960
[2022-11-08T15:48:46.84][level=INFO]: joinGroup succeeded for response, app-name.  generationID=3960, memberID=app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-c991f21b-edcc-4dc8-9cc4-a69e67164075
[2022-11-08T15:48:46.84][level=INFO]: Joined group app-name as member app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-c991f21b-edcc-4dc8-9cc4-a69e67164075 in generation 3960
[2022-11-08T15:48:46.841][level=INFO]: joined group app-name as member app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-7868f911-5e24-4a9a-84bd-e9442ca170b2 in generation 3960
[2022-11-08T15:48:46.841][level=INFO]: selected as leader for group, app-name
[2022-11-08T15:48:46.842][level=INFO]: using 'range' balancer to assign group, app-name
[2022-11-08T15:48:46.842][level=INFO]: found member: app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-7868f911-5e24-4a9a-84bd-e9442ca170b2/[]byte(nil)
....
[2022-11-08T15:48:46.843][level=INFO]: found member: app@app-name-v1-7f56b9dfc6-p8n7k (github.com/segmentio/kafka-go)-ca6c9793-6ca0-4637-9383-1f5fb87a3290/[]byte(nil)
[2022-11-08T15:48:46.843][level=INFO]: found topic/partition: namespace-0-topic-1/0
[2022-11-08T15:48:46.843][level=INFO]: found topic/partition: namespace-0-topic-1/1
[2022-11-08T15:48:46.843][level=INFO]: found topic/partition: namespace-0-topic-1/2
[2022-11-08T15:48:46.843][level=INFO]: found topic/partition: namespace-0-topic-2/0
[2022-11-08T15:48:46.843][level=INFO]: found topic/partition: namespace-0-topic-2/1
[2022-11-08T15:48:46.843][level=INFO]: found topic/partition: namespace-0-topic-2/2
...
[2022-11-08T15:48:46.844][level=INFO]: found topic/partition: namespace-9-topic-1/0
[2022-11-08T15:48:46.844][level=INFO]: found topic/partition: namespace-9-topic-1/1
[2022-11-08T15:48:46.844][level=INFO]: found topic/partition: namespace-9-topic-1/2
[2022-11-08T15:48:46.844][level=INFO]: found topic/partition: namespace-9-topic-2/0
[2022-11-08T15:48:46.844][level=INFO]: found topic/partition: namespace-9-topic-2/1
[2022-11-08T15:48:46.844][level=INFO]: found topic/partition: namespace-9-topic-2/2

[2022-11-08T15:48:46.844][level=INFO]: assigned member/topic/partitions app@app-name-v1-5f84c45798-lmn6m (github.com/segmentio/kafka-go)-531f9847-1327-454f-88fc-3bcddc0e196a/namespace-1-topic-2/[0]
...
[2022-11-08T15:48:46.845][level=INFO]: assigned member/topic/partitions app@app-name-v1-b6c7f75cd-mmwtj (github.com/segmentio/kafka-go)-fc9656ad-c952-4cfe-afb0-f62e20b6c246/namespace-9-topic-1/[0 1 2]
[2022-11-08T15:48:46.845][level=INFO]: joinGroup succeeded for response, app-name.  generationID=3960, memberID=app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-7868f911-5e24-4a9a-84bd-e9442ca170b2
[2022-11-08T15:48:46.845][level=INFO]: Joined group app-name as member app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-7868f911-5e24-4a9a-84bd-e9442ca170b2 in generation 3960
[2022-11-08T15:48:46.845][level=INFO]: Syncing 22 assignments for generation 3960 as member app@app-name-v1-d99cb9954-nl4jf (github.com/segmentio/kafka-go)-7868f911-5e24-4a9a-84bd-e9442ca170b2
[2022-11-08T15:48:46.895][level=INFO]: sync group finished for group, app-name
[2022-11-08T15:48:46.895][level=INFO]: sync group finished for group, app-name

@rhansen2
Copy link
Collaborator

rhansen2 commented Nov 8, 2022

Hi @vitaly-m,
Thanks for submitting your logs and testing my branch!

I've merged that branch and released it as https://github.com/segmentio/kafka-go/releases/tag/v0.4.37 . Please let me know if it works for you or if you experience any other issues, thanks!

@mangoleaf
Copy link

v0.4.37 did not resolve this issue for my team. We upgraded our go dependencies and all kafka communication within our k8s cluster has since stopped working. After a lot of investigation we have determined we need to downgrade to v0.4.35 to fix the problem.

I recommend v0.4.36 and v0.4.37 be reverted to bring the latest release back to a reliable working state while this issue is being worked out.

@rhansen2
Copy link
Collaborator

rhansen2 commented Nov 8, 2022

Hi @mangoleaf Are you able to provide any information about the issues you're seeing or how to reproduce them?

@rhansen2
Copy link
Collaborator

rhansen2 commented Nov 9, 2022

Consumer group changes have been reverted in https://github.com/segmentio/kafka-go/releases/tag/v0.4.38

@rhansen2 rhansen2 closed this as completed Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants