-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix brokers continually allocating new Session IDs #1644
Conversation
I've sent a request to our legal department to get the CLA signed. |
d2343ed
to
0c4ec76
Compare
Thanks for contributing. I'm going to need to wait for CLA thing to pass before proceeding but LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KJTsanaktsidis thanks for finding this mistake and submitting the PR 👍
The changes look good to me. I'll raise a separate issue to track fully implementing incremental fetching (KIP-227) in the future. FYI the Kafka APIs page maintained by our own @mimaison is a good source of info when trying to determine what version of each api was supported by which Kafka version.
Note, I had also recently seen apache/kafka#7640 in this area which you might be interested in. I don't believe it has been backported from the (unreleased) 2.5 on trunk currently though.
Ah cool - yeah the kafka APIs page says that 1.1 has fetch API v7, so I think I've got it rigfht in this PR. apache/kafka#7640 is interesting - I believe it would make the effect of this bug worse, because it's evicting the oldest sessions (opened by correctly-behaving consumers) in preference to least-recently-used ones (sessions created by the bug in this PR would not actually be ever used, so we'd like them to get evicted). We've compared our broker logs with & without this patch, and this change definitely makes pretty much all of the FETCH_SESSION_ID_NOT_FOUND errors go away, so I don't think KAFKA-9137 is a huge problem for us except for how it interacts with this bug. Thanks for the reviews! I'll let you know when we've executed the CLA. |
So, bad news guys - I spoke to our legal people and they said that there's a bit of a process to go through, and COVID-19 is definitely slowing things down, so they reckon an ETA of about a month to get the CLA signed. I don't know if you guys want to just wait, or independently fix the problem based on my PR description maybe? It's hard to see how you could actually fix it in a different way to what I did, but as long as the person who fixes it doesn't look at my code, it shouldn't matter if they come up with the same solution, right? Again, really sorry about the legal troubles here. |
Just an update on the CLA process - our legal people have OK'd it, just needs to be signed by leadership in the next couple of days and then we should be good to go! |
Thanks for expediting this, look forward to getting this merged 🎉 |
KIP-227 says that if a client provides a session ID of 0 and a session epoch of 0 in a FetchRequest, the broker should allocate a new sessionID. Then, the consumer can use this ID to avoid repeating the list of partitions to consume on every call. Sarama is currently sending 0 for both values, but totally ignoring the session ID generated by the broker. This means that the broker is allocating a new session ID on every request, filling its session ID cache and leading to FETCH_SESSION_ID_NOT_FOUND logs being written by the broker because other _actual_ session IDs are evicted. Instead, we should set the epoch to -1, which instructs the broker not to allocate a new session ID that's just going to be ignored anyway.
0c4ec76
to
b9b3d09
Compare
🎂 @bai we got the CLA signed, this change should be ready to rock! Thanks for your patience. I amended the commit to tickle CI into running the CLA check again. |
Hey, will there soon be a bugfix release version with this included? Currently this is causing quite some problems for our platform and i prefer referencing SemVer releases over commithashes. Thanks :) |
@bai This is also causing us issues. Is it possible to get a new release out? |
KIP-227 says that if a client provides a session ID of 0 and a session epoch of 0 in a FetchRequest, the broker should allocate a new sessionID. Then, the consumer can use this ID to avoid repeating the list of partitions to consume on every call. I figured this out from the table in KIP-227 here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-227%3A+Introduce+Incremental+FetchRequests+to+Increase+Partition+Scalability#KIP-227:IntroduceIncrementalFetchRequeststoIncreasePartitionScalability-FetchRequestMetadatameaning (let me know if I got this wrong!)
Sarama is currently sending 0 for both values (I validated this with a packet capture), but totally ignoring the session ID generated by the broker. This means that the broker is allocating a new session ID on every request, filling its session ID cache and leading to FETCH_SESSION_ID_NOT_FOUND logs being written by the broker because other actual session IDs are evicted (from other consumers). These look like this
Instead, we should set the epoch to -1, which instructs the broker not to allocate a new session ID.
I'm 90% sure I got the versioning right (using
request.Verison = 7
for broker versions >= 1.1.0) - I got the info for this from here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75957546