Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[huff0] Add x86 specialisation of Decode4X #512

Merged
merged 1 commit into from
Mar 8, 2022

Conversation

WojciechMula
Copy link
Contributor

Hi, first of all, thank you for such a great library! I have been working on speeding up the Zstd decompression, mainly by porting hot loops into the assembly. This is the first PR, that's pretty small and I'd like to make it an opportunity to discuss code shape. Is it something acceptable, or not.

I'm marking it as a draft because not all tests in Zstd pass now; I branched some time ago and seems there are were some changes I have to investigate.

Any way, below is comparison of decompressing speed for Zstd after applying the patch. Benchmarks were run on an Ice Lake machine.

benchmark                                                                 old ns/op     new ns/op     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            5064292       5055128       -0.18%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        924146        889296        -3.77%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           12552928      12475253      -0.62%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         3884720       3815638       -1.78%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          5320410       5307378       -0.24%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             1458895       1419301       -2.71%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       218670        219073        +0.18%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       129945        121948        -6.15%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             14234754      13921832      -2.20%
BenchmarkDecoder_DecoderSmall/html.zst-16                                 1028808       1002782       -2.53%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        83589         77477         -7.31%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               605687        603714        -0.33%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           111881        106144        -5.13%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              1445193       1424825       -1.41%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            481721        470827        -2.26%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             643764        641131        -0.41%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                234945        233164        -0.76%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          23411         23627         +0.92%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          11293         11302         +0.08%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                1644661       1592756       -3.16%
BenchmarkDecoder_DecodeAll/html.zst-16                                    126047        121406        -3.68%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           10396         9758          -6.14%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      1503668       1447556       -3.73%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      1526024       1498882       -1.78%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       1453765       1415595       -2.63%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         1497672       1473028       -1.65%
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          9182          9186          +0.04%
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          355812        359097        +0.92%
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           271279        273078        +0.66%
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             186720        192951        +3.34%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              3131          3181          +1.60%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              2929          2949          +0.68%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               3387          3451          +1.89%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 9063          9140          +0.85%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 5390          4876          -9.54%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 7683          7746          +0.82%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  7679          7760          +1.05%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    7923          7905          -0.23%
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       91261         84536         -7.37%
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       94548         89934         -4.88%
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        87788         83488         -4.90%
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          99753         97126         -2.63%
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         9213          9190          -0.25%
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         360974        364403        +0.95%
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          268990        269969        +0.36%
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            185887        192209        +3.40%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    27607         27162         -1.61%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    30728         30104         -2.03%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     25116         24681         -1.73%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       30037         29093         -3.14%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     9178          9183          +0.05%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     9170          9174          +0.04%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      9171          9179          +0.09%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        9174          9185          +0.12%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     160969        171452        +6.51%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     172487        157895        -8.46%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      159815        145387        -9.03%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        160579        155217        -3.34%
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         1077          1038          -3.62%
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         42039         43817         +4.23%
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          33769         34422         +1.93%
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            25179         25295         +0.46%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             493           507           +2.94%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             495           503           +1.68%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              517           550           +6.41%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                888           880           -0.89%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                756           693           -8.25%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                877           892           +1.66%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 834           875           +4.99%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   1001          941           -6.01%
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      13288         12076         -9.12%
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      14990         12745         -14.98%
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       12205         11149         -8.65%
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         13920         12165         -12.61%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        1039          1025          -1.35%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        43506         42534         -2.23%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         33428         33474         +0.14%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           25168         25283         +0.46%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   3781          3688          -2.46%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   3976          3873          -2.59%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    3204          3178          -0.81%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      3605          3329          -7.66%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    1031          1029          -0.19%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    1028          1081          +5.16%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     1034          1029          -0.48%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       1028          1038          +0.97%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       83577         71934         -13.93%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   16304         14373         -11.84%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      209034        177289        -15.19%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    67474         58313         -13.58%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     90482         78433         -13.32%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        29896         28681         -4.06%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  3488          3332          -4.47%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  1261          1246          -1.19%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        210072        182726        -13.02%
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            18952         16590         -12.46%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   1496          1332          -10.96%

@klauspost
Copy link
Owner

klauspost commented Mar 3, 2022

@WojciechMula Thanks for the great work. I will take a look and do some tests here as well.

A few pre-review notes (ignoring what is reported by tests, assuming you'll fix that)

Please make a noasm build tag, exclude gcc and appengine. You can grab build tags from s2: https://github.com/klauspost/compress/blob/master/s2/encodeblock_amd64.go#L3-L4

For the Go version, I found that interleaving 2 streams would give better pipelining. You removed that from the non-asm version. I will test that there isn't a regression here.

Please run asmfmt on the assembly.

Is SHRXQ and SHLXQ the only bmi used? They are (in my experience) not really faster, so a generic amd64 version should be just as fast.

Try breaking dependency chains by interleaving operations more. Your assembly is pretty much "serial", making the cpu having to work hard to re-order your code.

Zstd tests can be noisy. Checking here, literal decoding takes op 9.7% of cpu time in DecodeAllParallel, so I doubt you are saving 10-15%. BenchmarkDecompress4XNoTable is the cleanest 4X benchmark for direct comparisons.

@klauspost
Copy link
Owner

Numbers are looking good 👍🏼

BenchmarkDecompress4XNoTable/gettysburg-32           593.83       641.27       1.08x
BenchmarkDecompress4XNoTable/twain-32                491.42       643.20       1.31x
BenchmarkDecompress4XNoTable/pngdata.001-32          718.28       829.51       1.15x

(these are also the only ones that have tablelog >8)

Some small regressions without asm:

BenchmarkDecompress4XNoTable/gettysburg-32           593.83       565.72       0.95x
BenchmarkDecompress4XNoTable/pngdata.001-32          718.28       683.05       0.95x

@WojciechMula
Copy link
Contributor Author

For the Go version, I found that interleaving 2 streams would give better pipelining. You removed that from the non-asm version. I will test that there isn't a regression here.

Thank you for looking at this. Yeah, I restore that freshest version. Got lost with rebasing at some point.

Please run asmfmt on the assembly.

Sure!

Is SHRXQ and SHLXQ the only bmi used? They are (in my experience) not really faster, so a generic amd64 version should be just as fast.

Yes, I checked also SHR and SHL and the noticed BMI was faster, maybe not significantly.

Try breaking dependency chains by interleaving operations more. Your assembly is pretty much "serial", making the cpu having to work hard to re-order your code.

Sure, will do.

Zstd tests can be noisy. Checking here, literal decoding takes op 9.7% of cpu time in DecodeAllParallel, so I doubt you are saving 10-15%. BenchmarkDecompress4XNoTable is the cleanest 4X benchmark for direct comparisons.

Thank you for so quick response! I'll look at the regression for the plain Go version.

@klauspost
Copy link
Owner

Pure "amd64" speed is extremely similar, so for now bmi doesn't seem worth it:

BenchmarkDecompress4XNoTable/twain-32                491.42       638.31       1.30x
BenchmarkDecompress4XNoTable/pngdata.001-32          718.28       819.40       1.14x

You can maybe have the entire func (d *Decoder) Decompress4X(dst, src []byte) in an _amd64.go, so you can just keep the existing ask is, with opposite build tags.

My quickly thrown together code here: https://gist.github.com/klauspost/82d5c9b85c067d06d606f1c12c82615c

@klauspost
Copy link
Owner

I will do a more detailed review tomorrow. Obviously we need to fix the bugs.

huff0/decompress_amd64.s Show resolved Hide resolved
huff0/decompress.go Outdated Show resolved Hide resolved
huff0/decompress.go Outdated Show resolved Hide resolved
huff0/decompress_amd64.s Outdated Show resolved Hide resolved
huff0/decompress_amd64.s Outdated Show resolved Hide resolved
@klauspost
Copy link
Owner

klauspost commented Mar 4, 2022

Without the extra "AND":

BenchmarkDecompress4XNoTable/gettysburg-32           593.83       681.84       1.15x
BenchmarkDecompress4XNoTable/twain-32                491.42       680.16       1.38x
BenchmarkDecompress4XNoTable/pngdata.001-32          718.28       870.23       1.21x

This also makes it "competitive" to replace that tablelog <= 8 specialized versions, though obviously dedicated versions would likely be even better.

Only very small payloads (unlikely) are worse.

@WojciechMula
Copy link
Contributor Author

Without the extra "AND":

BenchmarkDecompress4XNoTable/gettysburg-32           593.83       681.84       1.15x
BenchmarkDecompress4XNoTable/twain-32                491.42       680.16       1.38x
BenchmarkDecompress4XNoTable/pngdata.001-32          718.28       870.23       1.21x

This also makes it "competitive" to replace that tablelog <= 8 specialized versions, though obviously dedicated versions would likely be even better.

Only very small payloads (unlikely) are worse.

Thank you for the review. I changed almost everything that you asked for. The only big thing is to reshuffle instructions (if it's possible).

Perf results are similar on IceLake.

@WojciechMula WojciechMula changed the title [huff0] Add x86 BMI1 specialisation of Decode4X [huff0] Add x86 specialisation of Decode4X Mar 4, 2022
@klauspost
Copy link
Owner

The only big thing is to reshuffle instructions (if it's possible).

You can just work on that later. The improvement is already significant.

@WojciechMula
Copy link
Contributor Author

The only big thing is to reshuffle instructions (if it's possible).

You can just work on that later. The improvement is already significant.

@klauspost I tried to interleave the operations as the Go code does. However, didn't notice any significant performance changes -- I'd rather say it's a noise. You can check what I did: https://github.com/WojciechMula/compress/tree/experiment. Any hints highly appreciate. :)

@WojciechMula WojciechMula marked this pull request as ready for review March 4, 2022 12:28
@klauspost
Copy link
Owner

Yeah, SSE <-> GPR usually is pretty slow. Don't worry about it, let's get it working with the current improvements, we can tweak later.

@klauspost
Copy link
Owner

Seems the file for other platforms is missing:

Error: huff0/decompress.go:207:26: s.Decoder().Decompress4X undefined (type *Decoder has no field or method Decompress4X)

@WojciechMula
Copy link
Contributor Author

Yeah, SSE <-> GPR usually is pretty slow. Don't worry about it, let's get it working with the current improvements, we can tweak later.

I also checked if interleaving decoding of two streams is profitable. It's not. Yeah, I learned recently that using the stack is way faster (and easier) than trying to keep temp values in SSE or AVX512 kregs.

Hopefully, I fixed the build tags.

Copy link
Owner

@klauspost klauspost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should fix the tags.

Comment on lines 1 to 2
//go:build noasm
// +build noasm
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//go:build noasm
// +build noasm
//go:build !amd64 || appengine || !gc || noasm
// +build !amd64 appengine !gc noasm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Fixed.

huff0/decompress_amd64.s.in Show resolved Hide resolved
@klauspost
Copy link
Owner

I have fuzz tested the current version, and it looks fine. 👍🏼 Once we get the build tags sorted, I will do a final benchmark and we can move to merging.

@WojciechMula
Copy link
Contributor Author

I have fuzz tested the current version, and it looks fine. 👍🏼 Once we get the build tags sorted, I will do a final benchmark and we can move to merging.

Great! Thank you very much for checking this. And sorry for build tags problems, TBH I'm not familiar with them. So far never used them.

Copy link
Owner

@klauspost klauspost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 Merging when tests complete.

@WojciechMula
Copy link
Contributor Author

👍🏼 Merging when tests complete.

Let me squash the commits before.

BenchmarkDecompress4XNoTable/gettysburg-32           593.83       681.84       1.15x
BenchmarkDecompress4XNoTable/twain-32                491.42       680.16       1.38x
BenchmarkDecompress4XNoTable/pngdata.001-32          718.28       870.23       1.21x
@WojciechMula
Copy link
Contributor Author

👍🏼 Merging when tests complete.

Let me squash the commits before.

@klauspost OK, squshed the changes and added perf results to the commit message

@klauspost klauspost merged commit 76e0660 into klauspost:master Mar 8, 2022
@WojciechMula WojciechMula deleted the huff0-amd64 branch March 10, 2022 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants