-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KafkaProducer produces corrupt "double-compressed" messages on retry when compression is enabled. KafkaConsumer gets "stuck" consuming them #718
Comments
This can happen to a kafka system if there is a message larger than your consumer's max_partition_fetch_bytes . Have you tried tuning that? |
Thanks. The max_partition_fetch_bytes is already set to pretty high (10*1048576), so I don't think that would be the issue, but I will try to increase it even further to verify. |
Unfortunately as expected, the max_partition_fetch_bytes did not make any difference, even increasing it 1000-fold did make the consumer read the message from the partition. So with having: The issue is the same as above. |
Can you get debug logs for a consumer in this "stuck" state? |
Sure, here comes the debug log of the consumer when it is in this "stuck" state. It seems it cannot find the message it is looking for (message # 812899910).
|
This is useful, thanks! I'll investigate. |
This is strange -- it appears that the compressed messageset that we get back from the broker has packed incorrect message offsets. The logs suggest that the outer offset is correct:
But the inner offsets appear to be 0:
You are running 0.9 brokers with the v0 message format, correct? If so, compressed messages should always include absolute offsets as calculated by the broker. What are you using to produce these messages? |
Yes, I am having 0.9.0.1 brokers from Confluent (confluent-kafka-2.11.7-0.9.0.1-1.noarch). Not sure about the v0 message format. How could I check that? I use kafka-python to produce the messages, see (simplified) code below:
|
Can you try grabbing the specific message using something like this: https://gist.github.com/dpkp/452ea2080d54bb615ae4779851ada689 Point that at the host / port for the leader of the topic-partition that has the "stuck" message and send me the output. |
I hope you don't mind but I have emailed the output to you. The same happening across multiple topics now - CPU is at 100%, but kafka-python gets stuck reading the messages at a given point and can't move any further. |
Thanks -- specifically, which offset are you seeing as "stuck" ? |
It is 813721754 |
I have the same issue and as I can see all requests goes out with default group_id kafka-python-default-group, despite I set another one - zinc_group: May this be the reason why we don't see any new messages? I use kafka-python 1.2.1 |
I believe my issue was related to snappy compression. Since I turned snappy compression off in the producers a few days ago, the issue is gone. |
Great, I also found the issue. Now it works. Thanks. |
I haven't been able to identify the root cause here. The debug logs show that it is related to the fetcher discarding compressed messages that have offsets that appear too low. In normal operation this can happen if you request a message offset that falls in the middle of a compressed message set. In that case the fetch correctly scans through the compressed message set, dropping messages until it gets to the requested offset. If messages are not compressed, this code is not necessary because kafka can always return the exact message offset requested. But something is happening here that is causing the compressed message set scan to fail. I assume this bug is on the client side, but I have been unable to track it down. Help appreciated! If anyone can provide a reproducible test case, that would also be very useful. |
Could it be related to this? https://issues.apache.org/jira/browse/KAFKA-3789 Although that seems to only apply to Kafka 0.10, so maybe isn't applicable. |
I saw this behavior today. I'm using snappy compression. In my case, the problem message was not decoded during kafka-python 1.2.1, python-snappy 0.5, kafka 0.9.0.1 |
Please disregard what I said about problems in decompression -- the message eventually show up decompressed. I haven't quite wrapped my head around the recursive _unpack_message_set calls and the relative vs. absolute offsets, but my stuck message ends up being yielded as a ConsumerRecord with a (relative) offset 0, and the scan at https://github.com/dpkp/kafka-python/blob/master/kafka/consumer/fetcher.py#L440 is checking for an absolute:
Hope this is useful. |
Thanks for the details. The _unpack_message_set should only be recursive when dealing with compressed messages. To encode "compressed" messages, kafka writes the messages into a message set structure, encodes the message set into bytes, compresses those bytes, writes the compressed bytes as a new, single message with the compression flag set, and then writes this new message as the only item in a new message set. It is this new "compressed" message set that is sent to and received from the broker. So in practice decoding a compressed messageset requires decoding the wrapper message set, getting the underlying message, decompressing the message into a new messageset, and decoding the new messageset into a list of uncompressed messages. The relative / absolute offset change happened in kafka 0.10 clients/brokers. Older messages had encoded the underlying messages with absolute offsets. But this required the broker to decompress / recompress every compressed messageset as it was produced. The new approach only writes the absolute offset to the outer messageset, and uses relative offsets in the inner messageset in order to avoid the broker decompress/recompress step when handling produced messages. I'm not sure whether that leaves you more confused or less, but hopefully this adds some useful background. |
Yes, that helps -- explains why I had a compressed message even after I decompressed it once. I can poke around in the debugger again to try to understand how that 0 offset is getting out to that level. |
Thanks -- it could be a bug in _unpack_message_set or it could be that there is an edge case where a compressed message set gets returned with 0 as the first offset and maybe the java client handles this case? I tried to look at raw data from zoltan to see if it was the 0 offset case, but the data looked normal and I was unable to reproduce in a repl. If you can get this to happen in a debug session, please post details! I have not been able to reproduce so far. |
Here's a narrative from my debug session: # we get to the top level _unpack_message_set with the right offset
> msg.is_compressed()
True
> offset
1057322
> relative_offset
0
# after mset = msg.decompress()
> mset[0][0]
1057322
# call _unpack_message_set a second time,
>msg.is_compressed()
True
> offset
1057322
> relative_offset
0
# after msg.decompress()
ipdb> mset[0][0]
0
# _unpack_message_set third time around
> msg.is_compressed()
False
> offset
0
> relative_offset
0
# finally we yield ConsumerRecord(... offset + relative_offset) <-- with the undesirable zero Whereas in the normal, working case: # _unpack_message_set
> msg.is_compressed()
True
> offset
1057321
> mset[0][0]
1057321
# _unpack_message_set again
ipdb> msg.is_compressed()
False
> offset
1057321
> relative_offset
0
# yield ConsumerRecord(... offset + relative_offset) <-- 1057321 I haven't busted into the raw response yet... |
I got the same issue. after I changed " log.cleanup.policy = compact" in the server.properties to "log.cleanup.policy = delete" . Work well now! |
Hey, seems I got a same problem.
Here's my consumer consumer = KafkaConsumer(
bootstrap_servers = 'xxx',
group_id = 'my.group',
max_partition_fetch_bytes = 1048576*1000,
)
consumer.assign([
TopicPartition(topic=my_topic, partition=0),
])
consumer.seek(TopicPartition(topic=my_topic, partition=0), 706)
for msg in consumer:
print msg.topic, msg.partition, msg.offset, msg.key The "stuck" offset is And here comes the log (I replaced the https://gist.github.com/ayiis/b5d5738722a0bfb184bf21c8230f4776 ~PS I upgrade kafka-python to https://gist.github.com/ayiis/45e7489f7e8792888f3f799fc9652ad0 ~PS I upgrade kafka-python to The offset First I got |
Are any of you consuming from topics that are being produced by MirrorMaker ? |
I'm still unable to reproduce this issue, but it would be great if anyone that can could test against PR #755 |
I tried PR 755 and it worked fine! Now I can consume btw, I use Thank you. ~PS I looked into the
is exactly what has happened. Still dont know why. but i guess my message in kafka is somehow broken or invalid. |
I reproduced this one. In my case, it was caused by producer in a bad network.
for x in xrange(100):
try:
producer = KafkaProducer(
bootstrap_servers = 'xxx',
compression_type = 'gzip',
retries = 5,
retry_backoff_ms = 1000,
)
print 'create producer success'
# Start sending RST to this producer whatever producer.send(), to cause a network RESET.
# To cause producer retries
# PS: This will not cause [BrokerConnection reset] Exception
time.sleep(5)
record_metadata = producer.send(topic = 'test.ay.1', value = 'Hello kafka!' + str(x)).get()
producer.flush()
producer.close()
except Exception, e:
print 'Exception:', e Keep responsing Stop sending |
aha! that makes sense. KafkaProducer may actually be the culprit here by double-compressing messages if there is a failure + retry. |
I can reproduce -- excellent work, @ayiis ! |
I was able to force this behavior in KafkaProducer by hacking the retry code (this forces retries even on success):
using a producer configured like this: I will submit a patch to fix KafkaProducer. The current behavior of KafkaConsumer is to skip these double-compressed messages. I will think about whether we can or should attempt to decompress them now that we know how they are constructed. Unfortunately these messages are likely incompatible with other clients and so I'm actually hesitant to add code to handle them here. So I am currently leaning towards skip w/ warning. Thoughts? |
Yeah, if a message is malformed, discarding while making noise about it seems appropriate. |
Quite agree with |
I made some changes to the consumer side of compressed message sets in #755 . I have tried a few different approaches and at this point I am leaning towards making the default behavior to return the corrupt message to the user. I added a configuration parameter named Does this sound like a reasonable solution? |
I can abide that; returning the inner compressed data would be what you'd expect kafka-python to do if the producer bug was in some other library's producer. |
For these double-compressed messages, sure the consumer could return the corrupt message or has a setting to suppress them (step them over). |
Absolutely! I've already landed to master a fix for producer.
|
That would be great! |
Fixes to both KafkaProducer and KafkaConsumer have been released in 1.2.5 -- please reopen if this issue resurfaces! |
Thanks again to everyone for all the hard work tracking this one down!! |
This is an interesting one.
In all of our topics every day a handful of partitions get "stuck".
Basically the reading of the partition stops at a given message and kafka-python reports that there are no more messages in the given partition (just like it would have consumed all messages), while there are unconsumed messages.
The only way to get the consumers moving again is to manually seek the offset forward by stepping over the "stuck" messages and then works again for a few million records and then get stuck again at some later offset.
I have multiple consumers consuming from the same topic and they all get stuck at the same messages of the same topics. Random number of partitions are affected day-to-day.
We are using Kafka broker version 0.9.0.1, kafka-python 1.2.1 (had the same issue with 1.1.1).
The consumer code is very simple (the below code is trying to read only partition #1, which is currently "stuck"):
The above code prints "Completed", but not the messages, while there is a 5M offset lag in partition 1, so there would be plenty of messages to read.
After seeking the consumer offset forward the code works again until it doesn't get "stuck" again.
The text was updated successfully, but these errors were encountered: