Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid masks when possible in AVX2 logic #104

Merged
merged 1 commit into from
Nov 21, 2023

Conversation

sterrettm2
Copy link
Contributor

This avoids using masked stores when unmasked stores can be used. This gives a reasonable performance improvement for AVX2 quicksort.

Benchmark                                                          Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------
[simdsort vs. simdsort]/smallrandom_128/uint64_t                +0.0016         +0.0018          1495          1498          1499          1502
[simdsort vs. simdsort]/smallrandom_256/uint64_t                -0.0116         -0.0120          2680          2649          2685          2653
[simdsort vs. simdsort]/smallrandom_512/uint64_t                -0.0341         -0.0342          4747          4585          4752          4589
[simdsort vs. simdsort]/smallrandom_1k/uint64_t                 -0.0517         -0.0518          9669          9169          9675          9174
[simdsort vs. simdsort]/random_5k/uint64_t                      -0.0727         -0.0729         53449         49561         53480         49581
[simdsort vs. simdsort]/random_100k/uint64_t                    -0.0775         -0.0774       1590211       1466977       1590054       1466906
[simdsort vs. simdsort]/random_1m/uint64_t                      -0.0749         -0.0749      19080216      17650836      19078325      17649512
[simdsort vs. simdsort]/random_10m/uint64_t                     -0.0876         -0.0876     226984669     207103080     226958778     207083339
[simdsort vs. simdsort]/sorted_10k/uint64_t                     -0.0664         -0.0665        110923        103559        110954        103577
[simdsort vs. simdsort]/constant_10k/uint64_t                   -0.1289         -0.1293          7934          6912          7943          6916
[simdsort vs. simdsort]/reverse_10k/uint64_t                    -0.0735         -0.0735        110001        101916        110026        101934
[simdsort vs. simdsort]/smallrandom_128/int64_t                 -0.0117         -0.0128          1393          1377          1397          1379
[simdsort vs. simdsort]/smallrandom_256/int64_t                 -0.0173         -0.0173          2404          2363          2407          2366
[simdsort vs. simdsort]/smallrandom_512/int64_t                 -0.0428         -0.0423          4205          4025          4208          4030
[simdsort vs. simdsort]/smallrandom_1k/int64_t                  -0.0576         -0.0577          8372          7889          8376          7892
[simdsort vs. simdsort]/random_5k/int64_t                       -0.0860         -0.0860         46142         42173         46164         42195
[simdsort vs. simdsort]/random_100k/int64_t                     -0.1033         -0.1033       1441260       1292380       1441078       1292258
[simdsort vs. simdsort]/random_1m/int64_t                       -0.1059         -0.1059      17524288      15668153      17522129      15666515
[simdsort vs. simdsort]/random_10m/int64_t                      -0.1142         -0.1142     209621622     185672531     209603287     185663681
[simdsort vs. simdsort]/sorted_10k/int64_t                      -0.1011         -0.1010         96866         87078         96882         87097
[simdsort vs. simdsort]/constant_10k/int64_t                    -0.1465         -0.1467          7188          6135          7197          6141
[simdsort vs. simdsort]/reverse_10k/int64_t                     -0.1120         -0.1120         97091         86215         97111         86233
[simdsort vs. simdsort]/smallrandom_128/uint32_t                -0.0029         -0.0024           957           955           959           957
[simdsort vs. simdsort]/smallrandom_256/uint32_t                -0.0145         -0.0134          1236          1218          1239          1223
[simdsort vs. simdsort]/smallrandom_512/uint32_t                -0.0367         -0.0346          2063          1988          2069          1998
[simdsort vs. simdsort]/smallrandom_1k/uint32_t                 -0.0511         -0.0520          3524          3344          3533          3349
[simdsort vs. simdsort]/random_5k/uint32_t                      -0.0751         -0.0754         18423         17039         18435         17045
[simdsort vs. simdsort]/random_100k/uint32_t                    -0.0686         -0.0685        534384        497732        534360        497742
[simdsort vs. simdsort]/random_1m/uint32_t                      -0.0703         -0.0703       6743557       6269561       6742604       6268891
[simdsort vs. simdsort]/random_10m/uint32_t                     -0.0701         -0.0701      81997035      76249868      81984609      76240226
[simdsort vs. simdsort]/sorted_10k/uint32_t                     -0.0739         -0.0738         38253         35425         38261         35437
[simdsort vs. simdsort]/constant_10k/uint32_t                   -0.1264         -0.1269          3565          3114          3569          3116
[simdsort vs. simdsort]/reverse_10k/uint32_t                    -0.0769         -0.0769         36628         33813         36639         33821
[simdsort vs. simdsort]/smallrandom_128/int32_t                 -0.0026         -0.0016           958           955           960           958
[simdsort vs. simdsort]/smallrandom_256/int32_t                 -0.0010         -0.0018          1226          1225          1229          1227
[simdsort vs. simdsort]/smallrandom_512/int32_t                 -0.0240         -0.0249          2050          2001          2059          2008
[simdsort vs. simdsort]/smallrandom_1k/int32_t                  -0.0473         -0.0471          3568          3399          3576          3407
[simdsort vs. simdsort]/random_5k/int32_t                       -0.0742         -0.0744         18741         17351         18753         17357
[simdsort vs. simdsort]/random_100k/int32_t                     -0.0764         -0.0764        551299        509171        551280        509172
[simdsort vs. simdsort]/random_1m/int32_t                       -0.0720         -0.0720       6953377       6452529       6952455       6451780
[simdsort vs. simdsort]/random_10m/int32_t                      -0.0684         -0.0684      84518422      78738969      84510987      78732561
[simdsort vs. simdsort]/sorted_10k/int32_t                      -0.0786         -0.0785         39097         36026         39107         36036
[simdsort vs. simdsort]/constant_10k/int32_t                    -0.1326         -0.1327          3687          3198          3689          3199
[simdsort vs. simdsort]/reverse_10k/int32_t                     -0.0772         -0.0772         37328         34446         37336         34454
[simdsort vs. simdsort]/smallrandom_128/float                   -0.0043         -0.0051           959           955           962           957
[simdsort vs. simdsort]/smallrandom_256/float                   +0.0023         +0.0026          1214          1217          1218          1222
[simdsort vs. simdsort]/smallrandom_512/float                   -0.0193         -0.0202          2217          2174          2224          2179
[simdsort vs. simdsort]/smallrandom_1k/float                    -0.0371         -0.0369          3794          3653          3802          3662
[simdsort vs. simdsort]/random_5k/float                         -0.0757         -0.0759         17745         16402         17756         16409
[simdsort vs. simdsort]/random_100k/float                       -0.0872         -0.0871        535158        488496        535161        488533
[simdsort vs. simdsort]/random_1m/float                         -0.0853         -0.0853       6599370       6036337       6598623       6035796
[simdsort vs. simdsort]/random_10m/float                        -0.0770         -0.0770      80710809      74496962      80705295      74494355
[simdsort vs. simdsort]/sorted_10k/float                        -0.0703         -0.0700         37518         34882         37531         34903
[simdsort vs. simdsort]/constant_10k/float                      -0.1308         -0.1310          3513          3054          3517          3056
[simdsort vs. simdsort]/reverse_10k/float                       -0.0816         -0.0813         36008         33072         36023         33094
[simdsort vs. simdsort]/smallrandom_128/double                  -0.0030         -0.0016          1182          1178          1184          1183
[simdsort vs. simdsort]/smallrandom_256/double                  -0.0257         -0.0254          1996          1945          2000          1949
[simdsort vs. simdsort]/smallrandom_512/double                  -0.0441         -0.0441          3126          2988          3131          2993
[simdsort vs. simdsort]/smallrandom_1k/double                   -0.0683         -0.0681          6194          5771          6198          5776
[simdsort vs. simdsort]/random_5k/double                        -0.1004         -0.1008         33034         29716         33054         29723
[simdsort vs. simdsort]/random_100k/double                      -0.1059         -0.1059       1084996        970050       1084839        969934
[simdsort vs. simdsort]/random_1m/double                        -0.0972         -0.0971      13083520      11811385      13081373      11810638
[simdsort vs. simdsort]/random_10m/double                       -0.0924         -0.0924     159442640     144717523     159429764     144699184
[simdsort vs. simdsort]/sorted_10k/double                       -0.1020         -0.1022         67096         60250         67112         60255
[simdsort vs. simdsort]/constant_10k/double                     -0.2229         -0.2228          5871          4562          5878          4568
[simdsort vs. simdsort]/reverse_10k/double                      -0.1132         -0.1134         66940         59361         66959         59367

@r-devulap
Copy link
Contributor

r-devulap commented Nov 20, 2023

Initially I expected this be wrong because you would be writing incorrect data to the array with the unmasked stores. But then I realized they get overwritten by the right numbers in subsequent stores. You don't need the masked store at all any point of time (even for the last register).

We could probably experiment with this logic in avx512_double_compressstore as well. Split the mask_compressstoreu into vcompress and unmasked store. Not sure how if it would help with performance, but might be worth a try

Never mind, that was surprisingly horrible!

[simdsort/random.*int64 vs. simdsort/random.*int64]_t                +3.4797         +3.4787         22676        101581         22685        101599
[simdsort/random.*int64 vs. simdsort/random.*int64]_t                +4.9618         +4.9614        595535       3550466        595559       3550387
[simdsort/random.*int64 vs. simdsort/random.*int64]_t                +4.4454         +4.4457       7788570      42412147       7787859      42410619
[simdsort/random.*int64 vs. simdsort/random.*int64]_t                +5.2818         +5.2807     100400029     630688598     100398139     630571543


typename vtype::reg_t temp = vtype::permutevar(reg, perm);

vtype::mask_storeu(leftStore, left, temp);
vtype::mask_storeu(rightStore, _mm256_xor_si256(oxff, left), temp);
if constexpr (masked) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the masked store? I think we can get away with just the regular store. Or am I wrong?

@sterrettm2 sterrettm2 force-pushed the avx2_avoidmasks branch 2 times, most recently from 58ab67d to 23b6d32 Compare November 21, 2023 20:28
Copy link
Contributor

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @sterrettm2!

@r-devulap r-devulap merged commit fd937a5 into intel:main Nov 21, 2023
5 of 6 checks passed
r-devulap added a commit to r-devulap/numpy that referenced this pull request Nov 27, 2023
Perf improvements to AVX2 sorting: see
intel/x86-simd-sort#104
r-devulap added a commit to r-devulap/numpy that referenced this pull request Nov 30, 2023
Perf improvements to AVX2 sorting: see
intel/x86-simd-sort#104
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants