compress:check more bytes to reduce ZSTD_count call #3199

JunHe77 · 2022-07-18T03:30:55Z

Signed-off-by: Jun He jun.he@arm.com
Change-Id: I3449ea423d5c8e8344f75341f19a2d1643c703f6

JunHe77 · 2022-07-18T05:58:05Z

This patch changes the comparison in the loop of longest match from 1B to 4B.

This extended comparison is safe as we compare the byte starting from match[3] and ip[3], and later move to match[ml] and ip[ml] when ml is bigger than 3 in the original implementation. So changing to 4B guarantees the reading of match and ip are always in a valid range.

Comparing 4B will skip those sequences that don't match {ip[ml+1-sizeof(U32)]..ip[ml-1]}, but happen to match at ip[ml], so the ZSTD_count call can be reduced.

The benchmark on Arm shows good uplift with this change (up to 8% on N1 and 11% on M1, at L8) with both gcc and clang.

JunHe77 · 2022-08-31T02:00:34Z

@terrelln , @embg , anything I need to update on this PR? Thanks.

embg · 2022-08-31T02:07:27Z

@terrelln , @embg , anything I need to update on this PR? Thanks.

Sorry for the delay, I just need to run benchmarks. Will do it tomorrow.

Edit: have been working through a huge backlog, sorry again for the delay. Will prioritize the benchmarking this week.

embg

I think this is good to merge (see measurements below), but there are two small nits that should be fixed first. I'll measure again on Skylake after the nits are resolved (measuring on M1 is harder and this optimization is architecture-independent, also the nits are not functional).

It's definitely a few % faster on Intel Skylake X, especially levels 8-12 (which use the most ZSTD_count() operations). On clang15, level 12 is 10% faster!

Intel Skylake X, clang15

$ bench run -- ./zstd-dev-clang15 -b5 -e12 ~/silesia.tar
 5#silesia.tar       : 211975168 ->  62967059 (x3.366),   58.8 MB/s,  491.6 MB/s
 6#silesia.tar       : 211975168 ->  61468353 (x3.449),   41.4 MB/s,  526.3 MB/s
 7#silesia.tar       : 211975168 ->  60446005 (x3.507),   34.9 MB/s,  527.9 MB/s
 8#silesia.tar       : 211975168 ->  59919804 (x3.538),   27.5 MB/s,  539.3 MB/s
 9#silesia.tar       : 211975168 ->  59288102 (x3.575),   25.6 MB/s,  529.4 MB/s
10#silesia.tar       : 211975168 ->  58635176 (x3.615),   19.0 MB/s,  535.3 MB/s
11#silesia.tar       : 211975168 ->  58236155 (x3.640),   13.9 MB/s,  539.7 MB/s
12#silesia.tar       : 211975168 ->  58180337 (x3.643),   13.0 MB/s,  539.1 MB/s
$ bench run -- ./zstd-opt-clang15 -b5 -e12 ~/silesia.tar
 5#silesia.tar       : 211975168 ->  62967059 (x3.366),   59.3 MB/s,  494.2 MB/s
 6#silesia.tar       : 211975168 ->  61468353 (x3.449),   42.6 MB/s,  524.4 MB/s
 7#silesia.tar       : 211975168 ->  60446005 (x3.507),   35.4 MB/s,  522.2 MB/s
 8#silesia.tar       : 211975168 ->  59919804 (x3.538),   28.7 MB/s,  541.3 MB/s
 9#silesia.tar       : 211975168 ->  59288102 (x3.575),   27.1 MB/s,  520.2 MB/s
10#silesia.tar       : 211975168 ->  58635176 (x3.615),   19.8 MB/s,  531.3 MB/s
11#silesia.tar       : 211975168 ->  58236155 (x3.640),   15.3 MB/s,  528.4 MB/s
12#silesia.tar       : 211975168 ->  58180337 (x3.643),   14.3 MB/s,  537.5 MB/s

Intel Skylake X, gcc11

$ bench run -- ./zstd-dev-gcc11 -b5 -e12 ~/silesia.tar
 5#silesia.tar       : 211975168 ->  62967059 (x3.366),   56.1 MB/s,  524.1 MB/s
 6#silesia.tar       : 211975168 ->  61468353 (x3.449),   39.1 MB/s,  556.4 MB/s
 7#silesia.tar       : 211975168 ->  60446005 (x3.507),   33.6 MB/s,  558.7 MB/s
 8#silesia.tar       : 211975168 ->  59919804 (x3.538),   26.6 MB/s,  575.3 MB/s
 9#silesia.tar       : 211975168 ->  59288102 (x3.575),   24.7 MB/s,  557.8 MB/s
10#silesia.tar       : 211975168 ->  58635176 (x3.615),   18.5 MB/s,  565.0 MB/s
11#silesia.tar       : 211975168 ->  58236155 (x3.640),   13.7 MB/s,  568.4 MB/s
12#silesia.tar       : 211975168 ->  58180337 (x3.643),   13.0 MB/s,  565.3 MB/s
$ bench run -- ./zstd-opt-gcc11 -b5 -e12 ~/silesia.tar
 5#silesia.tar       : 211975168 ->  62967059 (x3.366),   57.0 MB/s,  524.8 MB/s
 6#silesia.tar       : 211975168 ->  61468353 (x3.449),   41.4 MB/s,  557.8 MB/s
 7#silesia.tar       : 211975168 ->  60446005 (x3.507),   35.5 MB/s,  558.4 MB/s
 8#silesia.tar       : 211975168 ->  59919804 (x3.538),   28.1 MB/s,  575.4 MB/s
 9#silesia.tar       : 211975168 ->  59288102 (x3.575),   26.2 MB/s,  559.8 MB/s
10#silesia.tar       : 211975168 ->  58635176 (x3.615),   19.7 MB/s,  565.3 MB/s
11#silesia.tar       : 211975168 ->  58236155 (x3.640),   15.2 MB/s,  567.8 MB/s
12#silesia.tar       : 211975168 ->  58180337 (x3.643),   13.9 MB/s,  565.5 MB/s

It's harder to get accurate numbers for the Apple M1, since I don't have any setup for pinning to performance cores vs efficiency cores, disabling thermal throttling, etc. I plugged into power to reduce any battery saving effects.

There are nice wins at every optimized level except levels 6 and 9 (-1% and 0% change, respectively). Since the optimization is architecture-independent, I think that's just measurement noise. There are no regressions > 1%, even with a lot of measurement noise, which itself indicates a nice win.

Apple M1, clang13

embg@embg-mbp zstd % ./zstd-dev-clang13 -b5 -e12 ~/silesia.tar
 5#silesia.tar       : 211975168 ->  62967059 (x3.366),  185.0 MB/s, 1308.4 MB/s
 6#silesia.tar       : 211975168 ->  61468353 (x3.449),  129.8 MB/s, 1392.3 MB/s
 7#silesia.tar       : 211975168 ->  60446005 (x3.507),  105.9 MB/s, 1334.8 MB/s
 8#silesia.tar       : 211975168 ->  59919804 (x3.538),   86.0 MB/s, 1466.0 MB/s
 9#silesia.tar       : 211975168 ->  59288102 (x3.575),   82.6 MB/s, 1455.6 MB/s
10#silesia.tar       : 211975168 ->  58635176 (x3.615),   59.2 MB/s, 1483.8 MB/s
11#silesia.tar       : 211975168 ->  58236155 (x3.640),   40.7 MB/s, 1505.4 MB/s
12#silesia.tar       : 211975168 ->  58180337 (x3.643),   34.6 MB/s, 1513.6 MB/s
embg@embg-mbp zstd % ./zstd-opt-clang13 -b5 -e12 ~/silesia.tar
 5#silesia.tar       : 211975168 ->  62967059 (x3.366),  186.9 MB/s, 1305.9 MB/s
 6#silesia.tar       : 211975168 ->  61468353 (x3.449),  128.0 MB/s, 1332.7 MB/s
 7#silesia.tar       : 211975168 ->  60446005 (x3.507),  110.8 MB/s, 1312.5 MB/s
 8#silesia.tar       : 211975168 ->  59919804 (x3.538),   86.5 MB/s, 1364.0 MB/s
 9#silesia.tar       : 211975168 ->  59288102 (x3.575),   82.6 MB/s, 1371.2 MB/s
10#silesia.tar       : 211975168 ->  58635176 (x3.615),   63.7 MB/s, 1506.9 MB/s
11#silesia.tar       : 211975168 ->  58236155 (x3.640),   46.5 MB/s, 1534.7 MB/s
12#silesia.tar       : 211975168 ->  58180337 (x3.643),   42.3 MB/s, 1537.3 MB/s

embg · 2022-09-11T21:01:50Z

lib/compress/zstd_lazy.c

-            if (match[ml] == ip[ml])   /* potentially better */
-                currentMl = ZSTD_count(ip, match, iLimit);
+            if (dictMode == ZSTD_noDict) {
+                lookup = MEM_read32(match + pr); /* read 4B starting from (match + ml + 1 - sizeof(U32)) */


Why is lookup declared outside the for loop? Can't we declare it as const here, inside the if (dictMode == ZSTD_noDict) {} scope?

embg · 2022-09-11T21:03:15Z

lib/compress/zstd_lazy.c

-                if (match[ml] == ip[ml])   /* potentially better */
-                    currentMl = ZSTD_count(ip, match, iLimit);
+                if (dictMode == ZSTD_noDict) {
+                    lookup = MEM_read32(match + pr); /* read 4B starting from (match + ml + 1 - sizeof(U32)) */


Same comment about the declaration of lookup.

Cyan4973 · 2022-09-13T08:47:35Z

lib/compress/zstd_lazy.c

@@ -692,8 +697,14 @@ size_t ZSTD_HcFindBestMatch(
        if ((dictMode != ZSTD_extDict) || matchIndex >= dictLimit) {
            const BYTE* const match = base + matchIndex;
            assert(matchIndex >= dictLimit);   /* ensures this is true if dictMode != ZSTD_extDict */
-            if (match[ml] == ip[ml])   /* potentially better */
-                currentMl = ZSTD_count(ip, match, iLimit);
+            if (dictMode == ZSTD_noDict) {


Quick question : why is this modification protected by ZSTD_noDict ?
Is there a case where it wouldn't work if dictMode != ZSTD_noDict ?

Cyan4973 · 2022-09-13T17:25:00Z

I see 2 levels of optimization bundled in this PR :

Comparing a 4-bytes trailing U32, instead of a single byte, as explained in the PR
Saving the need to read *(ip+ml-3) by caching it into a variable ref

The first level is very simple, and brings most of the benefits (>90 %).
It can be done by 2 simple line changes : replace if (match[ml] == ip[ml]) by if (MEM_read32(match + ml - 3) == MEM_read32(ip + ml - 3)).
It's all well contained, trivial to read and maintain, and offers a decent speed benefit at higher compression levels (L8+).

The second level is more complex : it requires to maintain a state, ref, which is updated in one place, and used in another. Moreover, the speed benefits it offers are questionable : it seems positive, over a large enough nb of tests, but it's small enough to be within noise margin. However, the maintenance load is higher, it makes updating this section of the code more difficult in the future.
If we were talking about the fast mode (zstd_fast.c), the value of it could be debated, as speed matters a lot. But since we are talking about middle levels of performance, they are all trade-off, and maintenance complexity is part of the trade off.
So I would recommend to simply skip this part.

This would make this patch much leaner, essentially a 2-lines change, with no distance interaction, while delivering > 90% of the speed gains. A no brainer for merge.

embg · 2022-09-14T18:26:51Z

This would make this patch much leaner, essentially a 2-lines change, with no distance interaction, while delivering > 90% of the speed gains. A no brainer for merge.

Good point @Cyan4973. I agree that the long-distance variables aren't worth the small additional perf win. @JunHe77 can you please update along the lines suggested by myself and @Cyan4973?

Comparing 4B instead of comparing 1B in ZSTD_noDict mode, thus it can avoid cases like match in match[ml] but mismatch in match[ml-3]..match[ml-1]. So the call count of ZSTD_count can be reduced. Signed-off-by: Jun He <jun.he@arm.com> Change-Id: I3449ea423d5c8e8344f75341f19a2d1643c703f6

JunHe77 · 2022-09-18T07:12:13Z

@embg , @Cyan4973 , thanks for the comments. Patch has been updated to reflect these.

facebook-github-bot added the CLA Signed label Jul 18, 2022

JunHe77 force-pushed the comp branch from a222b32 to 78e60fc Compare July 19, 2022 07:58

terrelln self-requested a review July 26, 2022 17:24

terrelln self-assigned this Jul 26, 2022

terrelln requested a review from embg July 26, 2022 17:30

terrelln assigned embg Jul 26, 2022

embg requested changes Sep 11, 2022

View reviewed changes

Cyan4973 reviewed Sep 13, 2022

View reviewed changes

JunHe77 force-pushed the comp branch from 78e60fc to ce52acd Compare September 18, 2022 06:46

Cyan4973 approved these changes Sep 18, 2022

View reviewed changes

Cyan4973 merged commit 97c23cf into facebook:dev Sep 19, 2022

Cyan4973 mentioned this pull request Feb 9, 2023

release v1.5.4 #3487

Merged

JunHe77 deleted the comp branch March 12, 2023 07:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compress:check more bytes to reduce ZSTD_count call #3199

compress:check more bytes to reduce ZSTD_count call #3199

JunHe77 commented Jul 18, 2022

JunHe77 commented Jul 18, 2022 •

edited

Loading

JunHe77 commented Aug 31, 2022

embg commented Aug 31, 2022 •

edited

Loading

embg left a comment •

edited

Loading

embg Sep 11, 2022

embg Sep 11, 2022

Cyan4973 Sep 13, 2022

Cyan4973 commented Sep 13, 2022

embg commented Sep 14, 2022

JunHe77 commented Sep 18, 2022

compress:check more bytes to reduce ZSTD_count call #3199

compress:check more bytes to reduce ZSTD_count call #3199

Conversation

JunHe77 commented Jul 18, 2022

JunHe77 commented Jul 18, 2022 • edited Loading

JunHe77 commented Aug 31, 2022

embg commented Aug 31, 2022 • edited Loading

embg left a comment • edited Loading

Choose a reason for hiding this comment

embg Sep 11, 2022

Choose a reason for hiding this comment

embg Sep 11, 2022

Choose a reason for hiding this comment

Cyan4973 Sep 13, 2022

Choose a reason for hiding this comment

Cyan4973 commented Sep 13, 2022

embg commented Sep 14, 2022

JunHe77 commented Sep 18, 2022

JunHe77 commented Jul 18, 2022 •

edited

Loading

embg commented Aug 31, 2022 •

edited

Loading

embg left a comment •

edited

Loading