-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumer offsets discussion #26
Comments
* Update the "public" offset before yielding the message * Add an option to SimpleConsumer.commit that excludes the current offset Ref #26
@mahendra check out https://github.com/mumrah/kafka-python/tree/issue-26 if you can. If we update the offset before yielding the message, then the only thing we need to be careful about is committing and offset for a message that might not be "done". I've added an option to LMK what you think, and thanks for all the contributions so far! |
@mumrah I just had a look and was a bit confused. So I tried to see if it fixes the problem that I noticed. In the example code that you mentioned, just try this consumer = SimpleConsumer(kafka, "my-group", "my-topic")
# This should print all pending messages
for message in consumer:
print(message)
# No more messages should be printed.
for message in consumer:
print(message) # Last message is printed again
for message in consumer:
print(message) # Last message is printed again The second for loop will repeat the last message. |
@mahendra Is this relying on 0.8.1 offset commits, or just the internal state of SimpleConsumer. I do see how this would a problem using the internal state (notice how the commit function adds one to the current offset). I wonder what is the use case of repeatedly iterating across the consumer like this? |
@mumrah it is relying on the internal state of SimpleConsumer. This is for the case where I need to use just a single consumer instance and keep fetching messages periodically. Creating a new consumer instance every time may be an overkill, specially with having to spawn threads for timer based commits. So, I create a consumer. Iterate over it and get a set of messages. Wait for some time. Then iterate over it again. I had implemented an API for get_messages which implemented this, but I just noticed that you had raised reservations about it An example for this is as a kafka broker for celery. Do have a look at this example. I would even prefer to have a blocking API for get_messages(block=True, timeout=None), if we can pull it off. |
@mumrah I had a look at that code. The code in queue.py refers to standalone kafka queue implementation. What I am trying to do is to make this fit into other queue-ing/messaging frameworks (like kombu - which will make kafka usable under frameworks like celery). I also referred to the following doc on designing a kafka consumer. https://cwiki.apache.org/KAFKA/consumer-group-example.html Quoting one point that is relevant to our discussion
Also, if you look at the consumer code there: ConsumerIterator<byte[], byte[]> it = m_stream.iterator();
while (it.hasNext())
System.out.println(new String(it.next().message()));
System.out.println("Shutting down Thread: " + m_threadNumber); So, going by that logic, we need our iterator (over a consumer stream) to be
In the current implementation, there is a bug if the iterator is use more than once. If someone has a copy of the iterator and uses it once, from the second iteration onward, the last message of the previous iteration repeats. |
@mumrah - as part the multiprocess consumer work, I tried a blocking API for SimpleConsumer without making use of Queues or Events. This has been done by using FetchRequest parameters. The code is available on https://github.com/mahendra/kafka-python/compare/partition
|
The iterator and offset increments have drastically changed since 7 months ago. Should be as expected now. Reopen if you think this is not resolved. |
I think we need to look at the way offsets are advanced and stored by the client, then go fix things rather than little one-off changes. I'm also doing this for my own sake to refresh myself with this code (it's been a while).
Suppose we have the following data in Kafka
And our starting (partition, offset) values are: (0, 4), (1, 3).
When the consumer
__iter__
is called, it sets up a list of sub-iterators for each partition with their current offsets. So at this point, we have__iter_partition__(0, 4)
and__iter_partition__(1, 3)
.Here is an annotated version of the loop:
Values of offset, next_offset, and self.offsets[partition]
From a consumer point of view, the state of
self.offsets
reflects the offset of the previous message. Since the generator will pause execution immediately after theyield
, the state ofself.offsets
will not be updated until the next iteration of the generator. The initial thinking behind this was that you only advance the offsets after the message has been "processed" (whatever that means). However, this means if you commit the offsets inself.offset
, they will lag behind.The text was updated successfully, but these errors were encountered: