compress/flate: very memory intensive #32371

nhooyr · 2019-05-31T19:41:30Z

func BenchmarkCompress(b *testing.B) {
	b.Run("writer", func(b *testing.B) {
		b.Run("flate", func(b *testing.B) {
			b.ReportAllocs()

			for i := 0; i < b.N; i++ {
				flate.NewWriter(nil, flate.BestSpeed)
			}
		})
		b.Run("gzip", func(b *testing.B) {
			b.ReportAllocs()

			for i := 0; i < b.N; i++ {
				gzip.NewWriterLevel(nil, zlib.BestSpeed)
			}
		})
	})
	b.Run("reader", func(b *testing.B) {
		b.Run("flate", func(b *testing.B) {
			b.ReportAllocs()

			for i := 0; i < b.N; i++ {
				flate.NewReader(nil)
			}
		})
		b.Run("gzip", func(b *testing.B) {
			b.ReportAllocs()

			bb := &bytes.Buffer{}
			gzip.NewWriter(bb).Write(nil)
			b.ResetTimer()
			for i := 0; i < b.N; i++ {
				gzip.NewReader(bb)
			}
		})
	})
}

If you run that, you'll get:

$ go test -bench=Compress -run=^$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkCompress/writer/flate-8   	   10000	    135813 ns/op	 1200010 B/op	      15 allocs/op
BenchmarkCompress/writer/gzip-8    	20000000	        73.1 ns/op	     176 B/op	       1 allocs/op
BenchmarkCompress/reader/flate-8   	  300000	      3785 ns/op	   44672 B/op	       6 allocs/op
BenchmarkCompress/reader/gzip-8    	10000000	       230 ns/op	     704 B/op	       1 allocs/op
PASS
ok  	nhooyr.io/websocket	6.674s

1.2 MB per writer and 45 KB per reader is a lot, especially for usage with WebSockets where most messages are rather small, often on average 512 bytes. Why is compress/flate allocating so much and is there a way to reduce it?

gzip (along with zlib though I didn't include it in the benchmark) use much less memory.

josharian · 2019-06-01T03:05:13Z

On my phone, but most gzip (and maybe flate?) types have a Reset method that you can call to allow their re-use. That should help considerably. If there’s a reason that you can’t use Reset, or a critical type is missing a Reset option, please detail that.

nhooyr · 2019-06-01T03:22:54Z

On my phone, but most gzip (and maybe flate?) types have a Reset method that you can call to allow their re-use. That should help considerably. If there’s a reason that you can’t use Reset, or a critical type is missing a Reset option, please detail that.

Yes they both do have a Reset method and it does help.

However, the amount of memory used by flate is still very excessive compared to compress/gzip or compress/zlib.

klauspost · 2019-06-01T09:13:53Z

However, the amount of memory used by flate is still very excessive compared to compress/gzip or compress/zlib.

Since both gzip and zlib is using deflate and store the *flate.Writer, your benchmark is misleading. gzip and zlib doesn't allocate the compressor until data is written to the stream, so if you do a single write, you will see the same numbers. And for practical use it will be the same.

But, let's put this into context. The deflate compression does do a lot of upfront allocations. The allocations are done so standard operation can be done without additional allocations, and why 'Reset' is available to reuse.

As I wrote on the gorilla ticket: For compression level 1 (fastest), this also means a lot of un-needed allocations. I made an experiment a couple of years ago to be more selective about allocations. This is mainly for use when using Reset isn't used, but it will have the side effect of less allocations.
Unfinished PR here: klauspost/compress#70 - note that "level 2" is equivalent to stdlib level 1.

A simpler optimization could be: For level 1, 0 and -2 the following arrays in the compressor are not needed: hashHead (512KB) hashPrev (128KB). It should be rather simple do add these to a struct that is only allocated when needed. hashMatch (1KB) is also not needed, but now we are getting to the small things.

nhooyr · 2019-06-01T09:45:04Z

Since both gzip and zlib is using deflate and store the *flate.Writer, your benchmark is misleading. gzip and zlib doesn't allocate the compressor until data is written to the stream, so if you do a single write, you will see the same numbers. And for practical use it will be the same.

Makes sense.

But, let's put this into context. The deflate compression does do a lot of upfront allocations. The allocations are done so standard operation can be done without additional allocations, and why 'Reset' is available to reuse.

For writing such small messages, I think 1.2 MB is a very large price to pay. I'm not an expert in compression algorithms but can't the buffers just be adjusted to grow as needed dynamically instead of always allocating so much?

A simpler optimization could be: For level 1, 0 and -2 the following arrays in the compressor are not needed: hashHead (512KB) hashPrev (128KB). It should be rather simple do add these to a struct that is only allocated when needed. hashMatch (1KB) is also not needed, but now we are getting to the small things.

That would make a massive difference but still leaves 560 KB per writer. That still seems very excessive to me for the WebSocket use case.

klauspost · 2019-06-01T09:45:08Z

Here is a simpler version of the PR above: klauspost/compress#107

It is fairly risk-free (compared to the other), so it should be feasible for a merge soon.

klauspost · 2019-06-01T10:00:36Z

can't the buffers just be adjusted to grow as needed dynamically

That would come at a massive performance cost. The big allocations a hash table and chain table. The hash table is sort of a map[uint16]int32 lookup, but it is sparsely populated, since this allows very fast lookups with no bounds checks since the compiler knows its size.

In the stdlib "level 1" uses its own (smaller, 128KB) hash table, so the allocations made for the more expensive levels are not used.

That would make a massive difference but still leaves 560 KB per writer. That still seems very excessive to me for the WebSocket use case.

Let's break down the rest:

64KB is allocated in d.window. This is to accumulate input until we have enough for a block. This could be allocating less up front, but would mean allocations as content gets written.
64KB is allocated for 'tokens', the output of the compression stage. This could be less, but would also result in allocations during compression.
Huffman trees and histograms are needed no matter how big your input is. Only "level -2" (HuffmanOnly) could use slightly less space.

Finally the last thing I can think of is the output buffer which is 256 bytes. Not much to gain there.

Let me see if I can fix up your benchmark to get some real numbers.

klauspost · 2019-06-01T10:08:51Z

The io.Writer interface kind of makes more detailed optimizations hard. Even if you only send a few bytes, we have no way of knowing that you will not be sending more.

A reasonable addition would be a Encode(src, dst []byte) []byte that allows you to send the entire content you want to have compressed. That would allow the compressor to chose a suitable compression scheme and also share compressors.

klauspost · 2019-06-01T11:14:47Z

I have updated klauspost/compress#107 with the actual numbers remaining and here is a gist to a realistic benchmark: https://gist.github.com/klauspost/f5df3a3522ac4bcb3bcde448872dffe6

Most of the remaining allocations are for the Huffman table generators, which is pretty unavoidable no matter your input size. Again, note that "level 2" in my lib is the "level 1" in stdlib. So yes, the baseline for stdlib is about 540K. If you switch to my lib, it is about 340KB for level 1.

nhooyr · 2019-11-22T00:26:00Z

Some exciting updates from @klauspost in gorilla/websocket#203 (comment)

nhooyr mentioned this issue May 31, 2019

Increase in memory usage when compression is on gorilla/websocket#203

Open

dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 6, 2019

dmitshur added this to the Go1.14 milestone Jun 6, 2019

cyriltovena mentioned this issue Jun 20, 2019

Compression improvement. grafana/loki#689

Closed

rsc modified the milestones: Go1.14, Backlog Oct 9, 2019

nhooyr mentioned this issue Apr 14, 2020

compress/flate: Allow resetting writer with new dictionary #36919

Open

MariusVanDerWijden mentioned this issue Mar 24, 2021

Frequent OOMs and High CPU Usage While Serving Eth Calls ethereum/go-ethereum#22567

Closed

gwenaskell mentioned this issue Sep 17, 2021

remote.Write unbounded memory use google/go-containerregistry#1123

Closed

seankhliao mentioned this issue Apr 22, 2023

compress/flate, archive/zip: large memory allocations reading flate zip archive with many files #59774

Closed

du5 mentioned this issue Sep 9, 2023

Geth memory leak on json-rpc response gzip compression bnb-chain/bsc#1801

Closed

egonelbre mentioned this issue Dec 11, 2023

eventkit is allocating lots of heap storj/eventkit#23

Closed

bcmills added the Performance label Jan 16, 2024

andrein mentioned this issue Jun 17, 2024

Websocket with compression enabled causes high memory usage nats-io/nats-server#5553

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compress/flate: very memory intensive #32371

compress/flate: very memory intensive #32371

nhooyr commented May 31, 2019 •

edited

Loading

josharian commented Jun 1, 2019

nhooyr commented Jun 1, 2019

klauspost commented Jun 1, 2019 •

edited

Loading

nhooyr commented Jun 1, 2019

klauspost commented Jun 1, 2019

klauspost commented Jun 1, 2019 •

edited

Loading

klauspost commented Jun 1, 2019

klauspost commented Jun 1, 2019 •

edited

Loading

nhooyr commented Nov 22, 2019

compress/flate: very memory intensive #32371

compress/flate: very memory intensive #32371

Comments

nhooyr commented May 31, 2019 • edited Loading

josharian commented Jun 1, 2019

nhooyr commented Jun 1, 2019

klauspost commented Jun 1, 2019 • edited Loading

nhooyr commented Jun 1, 2019

klauspost commented Jun 1, 2019

klauspost commented Jun 1, 2019 • edited Loading

klauspost commented Jun 1, 2019

klauspost commented Jun 1, 2019 • edited Loading

nhooyr commented Nov 22, 2019

nhooyr commented May 31, 2019 •

edited

Loading

klauspost commented Jun 1, 2019 •

edited

Loading

klauspost commented Jun 1, 2019 •

edited

Loading

klauspost commented Jun 1, 2019 •

edited

Loading