Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstd: Faster decoding memcopy in asm #583

Merged
merged 1 commit into from
May 9, 2022
Merged

Conversation

klauspost
Copy link
Owner

@klauspost klauspost commented May 9, 2022

Use faster method for copyMemory and copyOverlappedMemory.

λ benchcmp before.txt after.txt
benchmark                                                    old ns/op     new ns/op     delta
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-32          17933         14089         -21.44%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-32      4489          3847          -14.30%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-32       56959         44537         -21.81%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-32         44430         33115         -25.47%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-32       15040         12049         -19.89%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-32        18360         14098         -23.21%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-32           11643         12058         +3.56%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-32     1105          1063          -3.80%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-32     1865          1841          -1.29%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-32           52351         43161         -17.55%
BenchmarkDecoder_DecodeAllParallel/html.zst-32               5230          4134          -20.96%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-32      1276          1279          +0.24%

benchmark                                                    old MB/s     new MB/s     speedup
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-32          10278.54     13082.74     1.27x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-32      26415.30     30828.11     1.17x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-32       8459.72      10819.37     1.28x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-32         9605.02      12887.11     1.34x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-32       8323.26      10388.96     1.25x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-32        8283.62      10787.77     1.30x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-32           35179.58     33968.10     0.97x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-32     92685.55     96354.41     1.04x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-32     65985.69     66854.61     1.01x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-32           13411.18     16266.62     1.21x
BenchmarkDecoder_DecodeAllParallel/html.zst-32               19578.29     24768.39     1.27x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-32      3194.49      3188.05      1.00x
λ go test -short -bench=seqdec_execute >after.txt&&benchcmp before.txt after.txt
benchmark                                                                                       old ns/op     new ns/op     delta
Benchmark_seqdec_execute/n-12286-lits-13914-prev-9869-1990358-3296656-win-4194304.blk-32        129910        130620        +0.55%
Benchmark_seqdec_execute/n-12485-lits-6960-prev-976039-2250252-2463561-win-4194304.blk-32       139487        134032        -3.91%
Benchmark_seqdec_execute/n-14746-lits-14461-prev-209-8-1379909-win-4194304.blk-32               37155         38636         +3.99%
Benchmark_seqdec_execute/n-1525-lits-1498-prev-2009476-797934-2994405-win-4194304.blk-32        16318         15788         -3.25%
Benchmark_seqdec_execute/n-3478-lits-3628-prev-895243-2104056-2119329-win-4194304.blk-32        44386         43959         -0.96%
Benchmark_seqdec_execute/n-8422-lits-5840-prev-168095-2298675-433830-win-4194304.blk-32         100065        96156         -3.91%
Benchmark_seqdec_execute/n-1000-lits-1057-prev-21887-92-217-win-8388608.blk-32                  8119          7373          -9.19%
Benchmark_seqdec_execute/n-15134-lits-20798-prev-4882976-4884216-4474622-win-8388608.blk-32     84669         83034         -1.93%
Benchmark_seqdec_execute/n-2-lits-0-prev-620601-689171-848-win-8388608.blk-32                   2914          2773          -4.84%
Benchmark_seqdec_execute/n-90-lits-67-prev-19498-23-19710-win-8388608.blk-32                    4318          3824          -11.44%
Benchmark_seqdec_execute/n-931-lits-1179-prev-36502-1526-1518-win-8388608.blk-32                7851          7203          -8.25%
Benchmark_seqdec_execute/n-2898-lits-4062-prev-335-386-751-win-8388608.blk-32                   15161         14315         -5.58%
Benchmark_seqdec_execute/n-4056-lits-12419-prev-10792-66-309849-win-8388608.blk-32              23920         20065         -16.12%
Benchmark_seqdec_execute/n-8028-lits-4568-prev-917-65-920-win-8388608.blk-32                    53268         52768         -0.94%

Use faster method for `copyMemory` and `copyOverlappedMemory`.

```
λ go test -short -bench=seqdec_execute >after.txt&&benchcmp before.txt after.txt
benchmark                                                                                       old ns/op     new ns/op     delta
Benchmark_seqdec_execute/n-12286-lits-13914-prev-9869-1990358-3296656-win-4194304.blk-32        129910        130620        +0.55%
Benchmark_seqdec_execute/n-12485-lits-6960-prev-976039-2250252-2463561-win-4194304.blk-32       139487        134032        -3.91%
Benchmark_seqdec_execute/n-14746-lits-14461-prev-209-8-1379909-win-4194304.blk-32               37155         38636         +3.99%
Benchmark_seqdec_execute/n-1525-lits-1498-prev-2009476-797934-2994405-win-4194304.blk-32        16318         15788         -3.25%
Benchmark_seqdec_execute/n-3478-lits-3628-prev-895243-2104056-2119329-win-4194304.blk-32        44386         43959         -0.96%
Benchmark_seqdec_execute/n-8422-lits-5840-prev-168095-2298675-433830-win-4194304.blk-32         100065        96156         -3.91%
Benchmark_seqdec_execute/n-1000-lits-1057-prev-21887-92-217-win-8388608.blk-32                  8119          7373          -9.19%
Benchmark_seqdec_execute/n-15134-lits-20798-prev-4882976-4884216-4474622-win-8388608.blk-32     84669         83034         -1.93%
Benchmark_seqdec_execute/n-2-lits-0-prev-620601-689171-848-win-8388608.blk-32                   2914          2773          -4.84%
Benchmark_seqdec_execute/n-90-lits-67-prev-19498-23-19710-win-8388608.blk-32                    4318          3824          -11.44%
Benchmark_seqdec_execute/n-931-lits-1179-prev-36502-1526-1518-win-8388608.blk-32                7851          7203          -8.25%
Benchmark_seqdec_execute/n-2898-lits-4062-prev-335-386-751-win-8388608.blk-32                   15161         14315         -5.58%
Benchmark_seqdec_execute/n-4056-lits-12419-prev-10792-66-309849-win-8388608.blk-32              23920         20065         -16.12%
Benchmark_seqdec_execute/n-8028-lits-4568-prev-917-65-920-win-8388608.blk-32                    53268         52768         -0.94%
```
@klauspost
Copy link
Owner Author

It must really be hitting something it like. Maybe freeing up hyperthread capacity.

@klauspost klauspost merged commit 2f43739 into master May 9, 2022
@klauspost klauspost deleted the zstd-faster-memcopy branch May 9, 2022 10:17
@WojciechMula
Copy link
Contributor

Wow, that's a great speedup! I'm checking this on IceLake and be back in a moment. :)

@WojciechMula
Copy link
Contributor

Results from an Ice Lake:

benchmark                                                                 old ns/op     new ns/op     delta
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       46692         37370         -19.96%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   9762          9402          -3.69%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    148812        121489        -18.36%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      115594        91368         -20.96%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    38204         31465         -17.64%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     48923         39627         -19.00%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        25440         31584         +24.15%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  2762          2662          -3.62%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  1257          1265          +0.64%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        134655        110583        -17.88%
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            12158         9966          -18.03%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   1111          1093          -1.62%

benchmark                                                                 old MB/s     new MB/s     speedup
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       3947.60      4932.32      1.25x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   12147.89     12612.80     1.04x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    3238.05      3966.30      1.22x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      3691.85      4670.71      1.27x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    3276.63      3978.37      1.21x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     3108.74      3837.98      1.23x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        16100.85     12968.45     0.81x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  37071.35     38466.50     1.04x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  97944.56     97269.11     0.99x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        5213.97      6348.96      1.22x
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            8422.40      10274.65     1.22x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   3669.55      3727.78      1.02x

@klauspost
Copy link
Owner Author

Wow, even across vendors. Nice!

@lizthegrey
Copy link
Contributor

I'm jealous :( hopefully Avo will support arm64 soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants