-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely high memory usage for zstd decompression #1831
Comments
Also tagging @klauspost in case you have any insight about this. It seems like sarama switched from using Also I forgot to mention, we also have applications that interact with kafka through |
It will allocate here if the dst slice is capacity too small. If the compression was done by If the encode doesn't contain frame content size and a 0 capacity slice is passed, it will allocate Either way it will hand back control if the dst slice when done, so it seems like it is kept referenced internally. @JoshKCarroll Could you add |
Thank you for this information! I will see if we can get this trace and report back. For context, I believe in all our cases where we see this behavior, the compression would have been done by In our particular case the impact of the issue is not huge since we only have this interaction on a couple of topics, and so we just switched those to snappy. But, still interesting and happy to see whether we can help debug. |
It could be that the compressed data doesn't contain the uncompressed size and that sarama gives a slice with 0 byte capacity, so 1MB is allocated for each output. If sarama want to limit that they can allocate a different size. |
@klauspost that’s a good thought. I see the rfkafka producer has a conditional around whether to include uncompressed size based on the linked zstd version https://github.com/edenhill/librdkafka/blob/3b73ceb2049777331f3e544270bd8d0cdcd41de5/src/rdkafka_zstd.c#L149-L156 |
Actually it could be tweaked to allocate "compressed size * 2" or at most 1MiB up front. That should make this case behave more predictable. It may have an unreasonably large window size for the content. |
Added klauspost/compress#295 |
@BradLugo The PR above has been released as |
Hi just following up. The v1.11.3 version did not stem this issue. Things still have the same basic ram explosion all in
this issue is effected by the sarama setting of If the the value Consumer.Fetch.Default is in the 4Mb range and the message volume in the 1000s/second one goes from 0 -> 30Gb is a matter of seconds. If the the value Consumer.Fetch.Default is in the 8kb range and the message volume in the 1000s/second one goes from 0 -> 30Gb is a matter of 15 min. This issue does seem to effect the Producer side of things as well, not as violently, but we have shown at a message rate of ~100/second that the ram requirements have increased dramatically (from 200Mb to 2-4Gb) and constantly increases. Using the confluent go lib (libkafkard) does not have this issue under the same conditions (100k/msg/sec consuming zstd compressed messages takes ~500-1Gb or ram total) |
@wyndhblb Then I think you are looking in the wrong place. Something else is holding on to the references of the output. |
@klauspost I'll clarify a bit The issue really appears when consuming a "full backlog" (say many millions of messages, like a full topic read). if the consumed messages are simply /dev/null'ed the issue still appears in full force This issue is still very present and only under zstd gzip, snappy, lz4, none do not suffer this same grand increase in memory usage and all pprof investigations point to
I'm still unclear if it's the zstd decoder or sarama holding on to slices somewhere, just pointing out this "issue" is not resolved by simply using the v1.11.3 release of the the compression lib. |
more followup. this issue appears to be a GC one. Memory is not leaking, but the GC cannot keep up in normal circumstances. Some testing specs
some results @ sarama's default settings (1Mb fetch size) w/ GOGC=100
@ sarama's default settings (1Mb fetch size) w/ GOGC=10
@ sarama's default settings but with 8kb Fetch Size w/ GOGC=100
@ sarama's default settings but with 8kb Fetch Size w/ GOGC=10
just in case this is relevent the "producer" of these messages using the jvm kafka lib 2.6.0 using zstd the `"com.github.luben:zstd-jni:1.4.4-7.zstd" lib |
This has given me some interesting times today. We use Kafka to transport systems metrics and consume them with Telegraf into InfluxDB; We changed one of our metrics producers to use zstd compression, and Telegraf has exploded. We'll revert the compression, and maybe change to lz4, but in the meantime I have a topic full of compressed messages that I can't consume. The only direct control that we have in Telegraf to change Sarama's behaviour is to select the Kafka version support. Is there any non-programmatic way to change settings such as the above (Fetch Size, GOGC)? |
@hackery so @wyndhblb saw an improvement (I believe) after @klauspost releases v1.11.5 of compress containing klauspost/compress#306 — which I believe we've been using since Sarama v1.28.0 Can you confirm which version of Sarama you were testing with? It looked like from your linked issue that Telegraf was at v1.27.1 at the time, and looking at the main branch it is currently at v1.27.2 https://github.com/influxdata/telegraf/blob/3eebfd2f0fd4d1768936ae98f601c827f6a271a2/go.mod#L35 |
@dnwe sorry, missed seeing your update; thanks for the pointer, we'll take note in our upgrade plans. I had to do a bit of code and update spelunking to find what version we were actually using at the time, which was 1.13.3, using Sarama 1.24.0. Looks like the first version of Telegraf using >=1.28.0 is 1.20.3. We're currently on 1.20.2, which is the latest in our repo. |
The memory usage using zstd is still quite high (with Sarama 1.32 and compress 1.15.9). In my application, I see 1GB allocated during zstd's |
This comment was marked as outdated.
This comment was marked as outdated.
@JoshuaC215 where are we with this issue today? |
Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. |
Versions
When I ran with same config / same cluster / same data on
sarama 1.21.0 / Golang 1.12 (CGO)
I did NOT encounter this issue.(I also tested briefly with
sarama 1.27.0 / Go 1.15 (CGO)
and saw the same issue)Configuration
What configuration values are you using for Sarama and Kafka?
Using defaults from here, except for setting
ChannelBufferSize
to a variety of values including 256, 16, 4, 1, 0 - see same result regardless.Logs
Unable to grab logs easily since using sarama in an existing dockerized open source application. Depending on the need may be able to delve further to get them.
logs: CLICK ME
Problem Description
We are using sarama within the benthos (v3.22) stream processing tool for over a year. We have run it for an extended period in our environment with compression = None and Snappy extensively and never had any issues.
Recently we switched over to using zstd compression in our main data pipeline. Under normal load (5-10MB/s on the topic) we see no issue, our application runs under 200MB memory usage.
When running under a consumer backlog, the exact same config will suddenly start using 2.3GB+ memory. We run it on nodes with only 4GB memory and kubernetes rapidly kills the pod under this state.
Looking at a memory trace (again this is within a few seconds of launch) we see all the memory used by zstd decoder.
Memory traces
When we switch to older sarama version (via benthos v1.20), the issue goes away. When we switch to messages being written to the topic with snappy instead of zstd, the issue goes away.
The text was updated successfully, but these errors were encountered: