-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compress:check more bytes to reduce ZSTD_count call #3199
Conversation
This patch changes the comparison in the loop of longest match from 1B to 4B. This extended comparison is safe as we compare the byte starting from match[3] and ip[3], and later move to match[ml] and ip[ml] when ml is bigger than 3 in the original implementation. So changing to 4B guarantees the reading of match and ip are always in a valid range. Comparing 4B will skip those sequences that don't match {ip[ml+1-sizeof(U32)]..ip[ml-1]}, but happen to match at ip[ml], so the ZSTD_count call can be reduced. The benchmark on Arm shows good uplift with this change (up to 8% on N1 and 11% on M1, at L8) with both gcc and clang. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to merge (see measurements below), but there are two small nits that should be fixed first. I'll measure again on Skylake after the nits are resolved (measuring on M1 is harder and this optimization is architecture-independent, also the nits are not functional).
It's definitely a few % faster on Intel Skylake X, especially levels 8-12 (which use the most ZSTD_count()
operations). On clang15, level 12 is 10% faster!
Intel Skylake X, clang15
$ bench run -- ./zstd-dev-clang15 -b5 -e12 ~/silesia.tar
5#silesia.tar : 211975168 -> 62967059 (x3.366), 58.8 MB/s, 491.6 MB/s
6#silesia.tar : 211975168 -> 61468353 (x3.449), 41.4 MB/s, 526.3 MB/s
7#silesia.tar : 211975168 -> 60446005 (x3.507), 34.9 MB/s, 527.9 MB/s
8#silesia.tar : 211975168 -> 59919804 (x3.538), 27.5 MB/s, 539.3 MB/s
9#silesia.tar : 211975168 -> 59288102 (x3.575), 25.6 MB/s, 529.4 MB/s
10#silesia.tar : 211975168 -> 58635176 (x3.615), 19.0 MB/s, 535.3 MB/s
11#silesia.tar : 211975168 -> 58236155 (x3.640), 13.9 MB/s, 539.7 MB/s
12#silesia.tar : 211975168 -> 58180337 (x3.643), 13.0 MB/s, 539.1 MB/s
$ bench run -- ./zstd-opt-clang15 -b5 -e12 ~/silesia.tar
5#silesia.tar : 211975168 -> 62967059 (x3.366), 59.3 MB/s, 494.2 MB/s
6#silesia.tar : 211975168 -> 61468353 (x3.449), 42.6 MB/s, 524.4 MB/s
7#silesia.tar : 211975168 -> 60446005 (x3.507), 35.4 MB/s, 522.2 MB/s
8#silesia.tar : 211975168 -> 59919804 (x3.538), 28.7 MB/s, 541.3 MB/s
9#silesia.tar : 211975168 -> 59288102 (x3.575), 27.1 MB/s, 520.2 MB/s
10#silesia.tar : 211975168 -> 58635176 (x3.615), 19.8 MB/s, 531.3 MB/s
11#silesia.tar : 211975168 -> 58236155 (x3.640), 15.3 MB/s, 528.4 MB/s
12#silesia.tar : 211975168 -> 58180337 (x3.643), 14.3 MB/s, 537.5 MB/s
Intel Skylake X, gcc11
$ bench run -- ./zstd-dev-gcc11 -b5 -e12 ~/silesia.tar
5#silesia.tar : 211975168 -> 62967059 (x3.366), 56.1 MB/s, 524.1 MB/s
6#silesia.tar : 211975168 -> 61468353 (x3.449), 39.1 MB/s, 556.4 MB/s
7#silesia.tar : 211975168 -> 60446005 (x3.507), 33.6 MB/s, 558.7 MB/s
8#silesia.tar : 211975168 -> 59919804 (x3.538), 26.6 MB/s, 575.3 MB/s
9#silesia.tar : 211975168 -> 59288102 (x3.575), 24.7 MB/s, 557.8 MB/s
10#silesia.tar : 211975168 -> 58635176 (x3.615), 18.5 MB/s, 565.0 MB/s
11#silesia.tar : 211975168 -> 58236155 (x3.640), 13.7 MB/s, 568.4 MB/s
12#silesia.tar : 211975168 -> 58180337 (x3.643), 13.0 MB/s, 565.3 MB/s
$ bench run -- ./zstd-opt-gcc11 -b5 -e12 ~/silesia.tar
5#silesia.tar : 211975168 -> 62967059 (x3.366), 57.0 MB/s, 524.8 MB/s
6#silesia.tar : 211975168 -> 61468353 (x3.449), 41.4 MB/s, 557.8 MB/s
7#silesia.tar : 211975168 -> 60446005 (x3.507), 35.5 MB/s, 558.4 MB/s
8#silesia.tar : 211975168 -> 59919804 (x3.538), 28.1 MB/s, 575.4 MB/s
9#silesia.tar : 211975168 -> 59288102 (x3.575), 26.2 MB/s, 559.8 MB/s
10#silesia.tar : 211975168 -> 58635176 (x3.615), 19.7 MB/s, 565.3 MB/s
11#silesia.tar : 211975168 -> 58236155 (x3.640), 15.2 MB/s, 567.8 MB/s
12#silesia.tar : 211975168 -> 58180337 (x3.643), 13.9 MB/s, 565.5 MB/s
It's harder to get accurate numbers for the Apple M1, since I don't have any setup for pinning to performance cores vs efficiency cores, disabling thermal throttling, etc. I plugged into power to reduce any battery saving effects.
There are nice wins at every optimized level except levels 6 and 9 (-1% and 0% change, respectively). Since the optimization is architecture-independent, I think that's just measurement noise. There are no regressions > 1%, even with a lot of measurement noise, which itself indicates a nice win.
Apple M1, clang13
embg@embg-mbp zstd % ./zstd-dev-clang13 -b5 -e12 ~/silesia.tar
5#silesia.tar : 211975168 -> 62967059 (x3.366), 185.0 MB/s, 1308.4 MB/s
6#silesia.tar : 211975168 -> 61468353 (x3.449), 129.8 MB/s, 1392.3 MB/s
7#silesia.tar : 211975168 -> 60446005 (x3.507), 105.9 MB/s, 1334.8 MB/s
8#silesia.tar : 211975168 -> 59919804 (x3.538), 86.0 MB/s, 1466.0 MB/s
9#silesia.tar : 211975168 -> 59288102 (x3.575), 82.6 MB/s, 1455.6 MB/s
10#silesia.tar : 211975168 -> 58635176 (x3.615), 59.2 MB/s, 1483.8 MB/s
11#silesia.tar : 211975168 -> 58236155 (x3.640), 40.7 MB/s, 1505.4 MB/s
12#silesia.tar : 211975168 -> 58180337 (x3.643), 34.6 MB/s, 1513.6 MB/s
embg@embg-mbp zstd % ./zstd-opt-clang13 -b5 -e12 ~/silesia.tar
5#silesia.tar : 211975168 -> 62967059 (x3.366), 186.9 MB/s, 1305.9 MB/s
6#silesia.tar : 211975168 -> 61468353 (x3.449), 128.0 MB/s, 1332.7 MB/s
7#silesia.tar : 211975168 -> 60446005 (x3.507), 110.8 MB/s, 1312.5 MB/s
8#silesia.tar : 211975168 -> 59919804 (x3.538), 86.5 MB/s, 1364.0 MB/s
9#silesia.tar : 211975168 -> 59288102 (x3.575), 82.6 MB/s, 1371.2 MB/s
10#silesia.tar : 211975168 -> 58635176 (x3.615), 63.7 MB/s, 1506.9 MB/s
11#silesia.tar : 211975168 -> 58236155 (x3.640), 46.5 MB/s, 1534.7 MB/s
12#silesia.tar : 211975168 -> 58180337 (x3.643), 42.3 MB/s, 1537.3 MB/s
lib/compress/zstd_lazy.c
Outdated
if (match[ml] == ip[ml]) /* potentially better */ | ||
currentMl = ZSTD_count(ip, match, iLimit); | ||
if (dictMode == ZSTD_noDict) { | ||
lookup = MEM_read32(match + pr); /* read 4B starting from (match + ml + 1 - sizeof(U32)) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is lookup
declared outside the for loop? Can't we declare it as const
here, inside the if (dictMode == ZSTD_noDict) {}
scope?
lib/compress/zstd_lazy.c
Outdated
if (match[ml] == ip[ml]) /* potentially better */ | ||
currentMl = ZSTD_count(ip, match, iLimit); | ||
if (dictMode == ZSTD_noDict) { | ||
lookup = MEM_read32(match + pr); /* read 4B starting from (match + ml + 1 - sizeof(U32)) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about the declaration of lookup
.
lib/compress/zstd_lazy.c
Outdated
@@ -692,8 +697,14 @@ size_t ZSTD_HcFindBestMatch( | |||
if ((dictMode != ZSTD_extDict) || matchIndex >= dictLimit) { | |||
const BYTE* const match = base + matchIndex; | |||
assert(matchIndex >= dictLimit); /* ensures this is true if dictMode != ZSTD_extDict */ | |||
if (match[ml] == ip[ml]) /* potentially better */ | |||
currentMl = ZSTD_count(ip, match, iLimit); | |||
if (dictMode == ZSTD_noDict) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question : why is this modification protected by ZSTD_noDict
?
Is there a case where it wouldn't work if dictMode != ZSTD_noDict
?
I see 2 levels of optimization bundled in this PR :
The first level is very simple, and brings most of the benefits (>90 %). The second level is more complex : it requires to maintain a state, This would make this patch much leaner, essentially a 2-lines change, with no distance interaction, while delivering > 90% of the speed gains. A no brainer for merge. |
Good point @Cyan4973. I agree that the long-distance variables aren't worth the small additional perf win. @JunHe77 can you please update along the lines suggested by myself and @Cyan4973? |
Comparing 4B instead of comparing 1B in ZSTD_noDict mode, thus it can avoid cases like match in match[ml] but mismatch in match[ml-3]..match[ml-1]. So the call count of ZSTD_count can be reduced. Signed-off-by: Jun He <jun.he@arm.com> Change-Id: I3449ea423d5c8e8344f75341f19a2d1643c703f6
Signed-off-by: Jun He jun.he@arm.com
Change-Id: I3449ea423d5c8e8344f75341f19a2d1643c703f6