Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use const-generics #19

Closed
gendx opened this issue Dec 10, 2019 · 4 comments · Fixed by #79
Closed

Use const-generics #19

gendx opened this issue Dec 10, 2019 · 4 comments · Fixed by #79
Labels
blocked performance Make lzma-rs faster!

Comments

@gendx
Copy link
Owner

gendx commented Dec 10, 2019

For now, the DecoderState and BitTree structs use memory allocations (in Vec) to store their state, despite having sizes known at compile time. This is in the most part to avoid code duplication.

Once const generics are stable enough, they should be used instead to remove the need for dynamic allocation, and make the whole state struct more compact and contiguous (therefore potentially more cache-friendly).

This should hopefully improve performance a bit.

@gendx gendx added performance Make lzma-rs faster! blocked labels Dec 10, 2019
@npmccallum
Copy link

Const generics are now stable in 1.51.

@gendx
Copy link
Owner Author

gendx commented May 3, 2021

I had a brief look, the min_const_generics feature was indeed stabilized in 1.51, but we still need the more complex const_evaluatable_checked feature due to the 1 << num_bits expression here.

In the meantime, it might be worth experimenting with it on nightly to see if benchmarks would improve with const generics.

@chyyran
Copy link
Contributor

chyyran commented Aug 6, 2022

I've been doing some experiments in https://github.com/chyyran/lzma-rs/commits/feature-perf-experiments and found that const generics in BitTree and RangeCoder doesn't actually contribute that much performance benefit, except for large dictionaries.

Without const generics (but with some additional optimizations) SnowflakePowered@ecc71a0

running 8 tests
test compress_65536                  ... bench:   1,372,700 ns/iter (+/- 32,501)
test compress_empty                  ... bench:         713 ns/iter (+/- 35)
test compress_hello                  ... bench:       1,021 ns/iter (+/- 50)
test decompress_after_compress_65536 ... bench:   2,463,569 ns/iter (+/- 77,148)
test decompress_after_compress_empty ... bench:       1,720 ns/iter (+/- 146)
test decompress_after_compress_hello ... bench:       2,219 ns/iter (+/- 50)
test decompress_big_file             ... bench:   3,959,745 ns/iter (+/- 3,155,824)
test decompress_huge_dict            ... bench:       2,221 ns/iter (+/- 136)

With const generics SnowflakePowered@ad7d096

running 8 tests
test compress_65536                  ... bench:   1,379,341 ns/iter (+/- 160,163)
test compress_empty                  ... bench:         711 ns/iter (+/- 57)
test compress_hello                  ... bench:       1,016 ns/iter (+/- 30)
test decompress_after_compress_65536 ... bench:   2,442,160 ns/iter (+/- 125,797)
test decompress_after_compress_empty ... bench:         544 ns/iter (+/- 110)
test decompress_after_compress_hello ... bench:       1,003 ns/iter (+/- 47)
test decompress_big_file             ... bench:   3,939,209 ns/iter (+/- 177,650)
test decompress_huge_dict            ... bench:       1,065 ns/iter (+/- 40)

Probably worth pursuing in the future because of that benefit but SnowflakePowered@ad7d096 uses kind of a nasty workaround to get around the lack of #[feature(generic_const_exprs)] by putting 1 << N into a const fn and passing it in at the constructor site.

@chyyran
Copy link
Contributor

chyyran commented Aug 9, 2022

Vec2D from #77 lets us easily experiment with different backing stores for the literal_probs array which is the main allocation.
I used arrayvec which gives us a nice API over a raw array.

Since DecoderState needs to be able to update lclppb at runtime, we can't just parameterize it with const generics unlike BitTree, so we have to account for the theoretical maximum parameters which is sizeof(u16) * 0x300 * { 1 << (4 + 8) } = 3145728 * 2. An array of 6 megabytes is way too large to fit on the stack and I was unable to actually decompress anything without a stack overflow.

For LZMA2 streams, the theoretical maximum parameters for the literal_probs array is sizeof(u16) * 0x300 * { 1 << 4 } = 12288 * 2. While this isn't great, 24KB is way smaller than 6MB so I was able to actually run some benchmarks, which are not great.

These benchmarks are with an allocating BitTree, i.e. without const-generics.

ArrayVec<u16, 0x300 * { 1 << 4 }>

running 8 tests
test compress_65536                  ... bench:   1,323,468 ns/iter (+/- 174,348)
test compress_empty                  ... bench:         695 ns/iter (+/- 20)
test compress_hello                  ... bench:       1,035 ns/iter (+/- 90)
test decompress_after_compress_65536 ... bench:   1,336,165 ns/iter (+/- 57,379)
test decompress_after_compress_empty ... bench:       6,166 ns/iter (+/- 359)
test decompress_after_compress_hello ... bench:       6,531 ns/iter (+/- 224)
test decompress_big_file             ... bench:   3,716,763 ns/iter (+/- 121,865)
test decompress_huge_dict            ... bench:       6,572 ns/iter (+/- 315)

test result: ok. 0 passed; 0 failed; 0 ignored; 8 measured; 0 filtered out; finished in 13.64s

Box<[u16]>

running 8 tests
test compress_65536                  ... bench:   1,334,964 ns/iter (+/- 1,371,984)
test compress_empty                  ... bench:         709 ns/iter (+/- 436)
test compress_hello                  ... bench:       1,033 ns/iter (+/- 64)
test decompress_after_compress_65536 ... bench:   1,145,697 ns/iter (+/- 82,369)
test decompress_after_compress_empty ... bench:       1,890 ns/iter (+/- 92)
test decompress_after_compress_hello ... bench:       2,231 ns/iter (+/- 295)
test decompress_big_file             ... bench:   3,620,380 ns/iter (+/- 152,660)
test decompress_huge_dict            ... bench:       2,271 ns/iter (+/- 75)

test result: ok. 0 passed; 0 failed; 0 ignored; 8 measured; 0 filtered out; finished in 7.76s

The huge array on the stack probably causes a lot of cache misses and substantially reduces performance. It seems the only benefit to using const generics here is with BitTree, where it does show an improvement in the benchmarks as above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked performance Make lzma-rs faster!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants