Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress:check more bytes to reduce ZSTD_count call #3199

Merged
merged 1 commit into from
Sep 19, 2022

Conversation

JunHe77
Copy link
Contributor

@JunHe77 JunHe77 commented Jul 18, 2022

Signed-off-by: Jun He jun.he@arm.com
Change-Id: I3449ea423d5c8e8344f75341f19a2d1643c703f6

@JunHe77
Copy link
Contributor Author

JunHe77 commented Jul 18, 2022

This patch changes the comparison in the loop of longest match from 1B to 4B.

This extended comparison is safe as we compare the byte starting from match[3] and ip[3], and later move to match[ml] and ip[ml] when ml is bigger than 3 in the original implementation. So changing to 4B guarantees the reading of match and ip are always in a valid range.

Comparing 4B will skip those sequences that don't match {ip[ml+1-sizeof(U32)]..ip[ml-1]}, but happen to match at ip[ml], so the ZSTD_count call can be reduced.

The benchmark on Arm shows good uplift with this change (up to 8% on N1 and 11% on M1, at L8) with both gcc and clang.

@JunHe77
Copy link
Contributor Author

JunHe77 commented Aug 31, 2022

@terrelln , @embg , anything I need to update on this PR? Thanks.

@embg
Copy link
Contributor

embg commented Aug 31, 2022

@terrelln , @embg , anything I need to update on this PR? Thanks.

Sorry for the delay, I just need to run benchmarks. Will do it tomorrow.

Edit: have been working through a huge backlog, sorry again for the delay. Will prioritize the benchmarking this week.

Copy link
Contributor

@embg embg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to merge (see measurements below), but there are two small nits that should be fixed first. I'll measure again on Skylake after the nits are resolved (measuring on M1 is harder and this optimization is architecture-independent, also the nits are not functional).


It's definitely a few % faster on Intel Skylake X, especially levels 8-12 (which use the most ZSTD_count() operations). On clang15, level 12 is 10% faster!

Intel Skylake X, clang15
$ bench run -- ./zstd-dev-clang15 -b5 -e12 ~/silesia.tar
 5#silesia.tar       : 211975168 ->  62967059 (x3.366),   58.8 MB/s,  491.6 MB/s
 6#silesia.tar       : 211975168 ->  61468353 (x3.449),   41.4 MB/s,  526.3 MB/s
 7#silesia.tar       : 211975168 ->  60446005 (x3.507),   34.9 MB/s,  527.9 MB/s
 8#silesia.tar       : 211975168 ->  59919804 (x3.538),   27.5 MB/s,  539.3 MB/s
 9#silesia.tar       : 211975168 ->  59288102 (x3.575),   25.6 MB/s,  529.4 MB/s
10#silesia.tar       : 211975168 ->  58635176 (x3.615),   19.0 MB/s,  535.3 MB/s
11#silesia.tar       : 211975168 ->  58236155 (x3.640),   13.9 MB/s,  539.7 MB/s
12#silesia.tar       : 211975168 ->  58180337 (x3.643),   13.0 MB/s,  539.1 MB/s
$ bench run -- ./zstd-opt-clang15 -b5 -e12 ~/silesia.tar
 5#silesia.tar       : 211975168 ->  62967059 (x3.366),   59.3 MB/s,  494.2 MB/s
 6#silesia.tar       : 211975168 ->  61468353 (x3.449),   42.6 MB/s,  524.4 MB/s
 7#silesia.tar       : 211975168 ->  60446005 (x3.507),   35.4 MB/s,  522.2 MB/s
 8#silesia.tar       : 211975168 ->  59919804 (x3.538),   28.7 MB/s,  541.3 MB/s
 9#silesia.tar       : 211975168 ->  59288102 (x3.575),   27.1 MB/s,  520.2 MB/s
10#silesia.tar       : 211975168 ->  58635176 (x3.615),   19.8 MB/s,  531.3 MB/s
11#silesia.tar       : 211975168 ->  58236155 (x3.640),   15.3 MB/s,  528.4 MB/s
12#silesia.tar       : 211975168 ->  58180337 (x3.643),   14.3 MB/s,  537.5 MB/s
Intel Skylake X, gcc11
$ bench run -- ./zstd-dev-gcc11 -b5 -e12 ~/silesia.tar
 5#silesia.tar       : 211975168 ->  62967059 (x3.366),   56.1 MB/s,  524.1 MB/s
 6#silesia.tar       : 211975168 ->  61468353 (x3.449),   39.1 MB/s,  556.4 MB/s
 7#silesia.tar       : 211975168 ->  60446005 (x3.507),   33.6 MB/s,  558.7 MB/s
 8#silesia.tar       : 211975168 ->  59919804 (x3.538),   26.6 MB/s,  575.3 MB/s
 9#silesia.tar       : 211975168 ->  59288102 (x3.575),   24.7 MB/s,  557.8 MB/s
10#silesia.tar       : 211975168 ->  58635176 (x3.615),   18.5 MB/s,  565.0 MB/s
11#silesia.tar       : 211975168 ->  58236155 (x3.640),   13.7 MB/s,  568.4 MB/s
12#silesia.tar       : 211975168 ->  58180337 (x3.643),   13.0 MB/s,  565.3 MB/s
$ bench run -- ./zstd-opt-gcc11 -b5 -e12 ~/silesia.tar
 5#silesia.tar       : 211975168 ->  62967059 (x3.366),   57.0 MB/s,  524.8 MB/s
 6#silesia.tar       : 211975168 ->  61468353 (x3.449),   41.4 MB/s,  557.8 MB/s
 7#silesia.tar       : 211975168 ->  60446005 (x3.507),   35.5 MB/s,  558.4 MB/s
 8#silesia.tar       : 211975168 ->  59919804 (x3.538),   28.1 MB/s,  575.4 MB/s
 9#silesia.tar       : 211975168 ->  59288102 (x3.575),   26.2 MB/s,  559.8 MB/s
10#silesia.tar       : 211975168 ->  58635176 (x3.615),   19.7 MB/s,  565.3 MB/s
11#silesia.tar       : 211975168 ->  58236155 (x3.640),   15.2 MB/s,  567.8 MB/s
12#silesia.tar       : 211975168 ->  58180337 (x3.643),   13.9 MB/s,  565.5 MB/s

It's harder to get accurate numbers for the Apple M1, since I don't have any setup for pinning to performance cores vs efficiency cores, disabling thermal throttling, etc. I plugged into power to reduce any battery saving effects.

There are nice wins at every optimized level except levels 6 and 9 (-1% and 0% change, respectively). Since the optimization is architecture-independent, I think that's just measurement noise. There are no regressions > 1%, even with a lot of measurement noise, which itself indicates a nice win.

Apple M1, clang13
embg@embg-mbp zstd % ./zstd-dev-clang13 -b5 -e12 ~/silesia.tar
 5#silesia.tar       : 211975168 ->  62967059 (x3.366),  185.0 MB/s, 1308.4 MB/s
 6#silesia.tar       : 211975168 ->  61468353 (x3.449),  129.8 MB/s, 1392.3 MB/s
 7#silesia.tar       : 211975168 ->  60446005 (x3.507),  105.9 MB/s, 1334.8 MB/s
 8#silesia.tar       : 211975168 ->  59919804 (x3.538),   86.0 MB/s, 1466.0 MB/s
 9#silesia.tar       : 211975168 ->  59288102 (x3.575),   82.6 MB/s, 1455.6 MB/s
10#silesia.tar       : 211975168 ->  58635176 (x3.615),   59.2 MB/s, 1483.8 MB/s
11#silesia.tar       : 211975168 ->  58236155 (x3.640),   40.7 MB/s, 1505.4 MB/s
12#silesia.tar       : 211975168 ->  58180337 (x3.643),   34.6 MB/s, 1513.6 MB/s
embg@embg-mbp zstd % ./zstd-opt-clang13 -b5 -e12 ~/silesia.tar
 5#silesia.tar       : 211975168 ->  62967059 (x3.366),  186.9 MB/s, 1305.9 MB/s
 6#silesia.tar       : 211975168 ->  61468353 (x3.449),  128.0 MB/s, 1332.7 MB/s
 7#silesia.tar       : 211975168 ->  60446005 (x3.507),  110.8 MB/s, 1312.5 MB/s
 8#silesia.tar       : 211975168 ->  59919804 (x3.538),   86.5 MB/s, 1364.0 MB/s
 9#silesia.tar       : 211975168 ->  59288102 (x3.575),   82.6 MB/s, 1371.2 MB/s
10#silesia.tar       : 211975168 ->  58635176 (x3.615),   63.7 MB/s, 1506.9 MB/s
11#silesia.tar       : 211975168 ->  58236155 (x3.640),   46.5 MB/s, 1534.7 MB/s
12#silesia.tar       : 211975168 ->  58180337 (x3.643),   42.3 MB/s, 1537.3 MB/s

if (match[ml] == ip[ml]) /* potentially better */
currentMl = ZSTD_count(ip, match, iLimit);
if (dictMode == ZSTD_noDict) {
lookup = MEM_read32(match + pr); /* read 4B starting from (match + ml + 1 - sizeof(U32)) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is lookup declared outside the for loop? Can't we declare it as const here, inside the if (dictMode == ZSTD_noDict) {} scope?

if (match[ml] == ip[ml]) /* potentially better */
currentMl = ZSTD_count(ip, match, iLimit);
if (dictMode == ZSTD_noDict) {
lookup = MEM_read32(match + pr); /* read 4B starting from (match + ml + 1 - sizeof(U32)) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about the declaration of lookup.

@@ -692,8 +697,14 @@ size_t ZSTD_HcFindBestMatch(
if ((dictMode != ZSTD_extDict) || matchIndex >= dictLimit) {
const BYTE* const match = base + matchIndex;
assert(matchIndex >= dictLimit); /* ensures this is true if dictMode != ZSTD_extDict */
if (match[ml] == ip[ml]) /* potentially better */
currentMl = ZSTD_count(ip, match, iLimit);
if (dictMode == ZSTD_noDict) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question : why is this modification protected by ZSTD_noDict ?
Is there a case where it wouldn't work if dictMode != ZSTD_noDict ?

@Cyan4973
Copy link
Contributor

I see 2 levels of optimization bundled in this PR :

  • Comparing a 4-bytes trailing U32, instead of a single byte, as explained in the PR
  • Saving the need to read *(ip+ml-3) by caching it into a variable ref

The first level is very simple, and brings most of the benefits (>90 %).
It can be done by 2 simple line changes : replace if (match[ml] == ip[ml]) by if (MEM_read32(match + ml - 3) == MEM_read32(ip + ml - 3)).
It's all well contained, trivial to read and maintain, and offers a decent speed benefit at higher compression levels (L8+).

The second level is more complex : it requires to maintain a state, ref, which is updated in one place, and used in another. Moreover, the speed benefits it offers are questionable : it seems positive, over a large enough nb of tests, but it's small enough to be within noise margin. However, the maintenance load is higher, it makes updating this section of the code more difficult in the future.
If we were talking about the fast mode (zstd_fast.c), the value of it could be debated, as speed matters a lot. But since we are talking about middle levels of performance, they are all trade-off, and maintenance complexity is part of the trade off.
So I would recommend to simply skip this part.

This would make this patch much leaner, essentially a 2-lines change, with no distance interaction, while delivering > 90% of the speed gains. A no brainer for merge.

@embg
Copy link
Contributor

embg commented Sep 14, 2022

This would make this patch much leaner, essentially a 2-lines change, with no distance interaction, while delivering > 90% of the speed gains. A no brainer for merge.

Good point @Cyan4973. I agree that the long-distance variables aren't worth the small additional perf win. @JunHe77 can you please update along the lines suggested by myself and @Cyan4973?

Comparing 4B instead of comparing 1B in ZSTD_noDict
mode, thus it can avoid cases like match in match[ml]
but mismatch in match[ml-3]..match[ml-1]. So the call
count of ZSTD_count can be reduced.

Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I3449ea423d5c8e8344f75341f19a2d1643c703f6
@JunHe77
Copy link
Contributor Author

JunHe77 commented Sep 18, 2022

@embg , @Cyan4973 , thanks for the comments. Patch has been updated to reflect these.

@Cyan4973 Cyan4973 merged commit 97c23cf into facebook:dev Sep 19, 2022
@Cyan4973 Cyan4973 mentioned this pull request Feb 9, 2023
@JunHe77 JunHe77 deleted the comp branch March 12, 2023 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants