Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update linux kernel to latest zstd (from 1.4.10) #3275

Closed
kdave opened this issue Sep 27, 2022 · 22 comments
Closed

Update linux kernel to latest zstd (from 1.4.10) #3275

kdave opened this issue Sep 27, 2022 · 22 comments
Assignees

Comments

@kdave
Copy link

kdave commented Sep 27, 2022

Hi, people of reddit asked (https://old.reddit.com/r/kernel/comments/xp2o53/why_is_the_version_of_zstd_in_the_kernel_outdated/) about updating the linux port of zstd to newer version. This has been promised back then when 1.4.10 was synced a year ago and I'd be glad to see an update (or even a regular update in each kernel release) as btrfs is using zstd and any performance improvement is most welcome.

There's automation support (contrib/linux-kernel) so generating the patch itself should not take too much time. As there are more things to do like review and benchmarking it's not the last step but at least the preliminary version can be added to linux-next tree. The final pull request sent to Linus may or may not happen depending on the testing results.

IMHO neglecting the regular updates is more "expensive", like it was with the 1.4.10 update that took about a year and a lot of convincing. The linux-next tree is really convenient as it does not pose a huge risk for users and helps to catch bugs early.

@Natrox
Copy link

Natrox commented Sep 28, 2022

Reddit OP here, I have a version of Linux 5.19 with upstream Zstd. If you need help with testing, integration, anything at all - feel free to ask. I can create a preliminary patch on linux-next if it helps you.

@felixhandte
Copy link
Contributor

@kdave thanks for flagging this.

@Natrox, awesome!

@terrelln is the relevant authority for this stuff, so we should defer to him. He's out on vacation at the moment, but I'll make sure he sees this next week.

@terrelln
Copy link
Contributor

terrelln commented Oct 4, 2022

We are planning the 1.5.4 zstd release later this year. I will plan on updating the kernel with that release.

However, we don't expect too much more development before the release. So I will put up a patch to linux-next for testing with the current dev branch in the next 2 weeks. That will allow us to work out any kinks before the 1.5.4 release.

@terrelln terrelln self-assigned this Oct 4, 2022
@Natrox
Copy link

Natrox commented Oct 5, 2022

@terrelln Thank you for your confirmation! Is there any way I can assist with pre-release testing when the time comes?
Please let me know, I would be happy to run tests whenever necessary.

@nolange
Copy link
Contributor

nolange commented Oct 6, 2022

We are planning the 1.5.4 zstd release later this year. I will plan on updating the kernel with that release.

Did 1.5.3 happen, or do you have the convention that uneven means unstable?

@Natrox
Copy link

Natrox commented Oct 6, 2022

We are planning the 1.5.4 zstd release later this year. I will plan on updating the kernel with that release.

Did 1.5.3 happen, or do you have the convention that uneven means unstable?

Sorry to chime in, but I saw it so I figured I would give you a little info in case you did not know.
Far as I can tell it was elected not to release this version due to failed checks, however, there is a changelog you can find here:
https://github.com/facebook/zstd/blob/dev/CHANGELOG

The version notes show quite a few optimizations, however I do not know what will make it into 1.5.4 nor what caused the check fails.

I do not know the rest of the history behind it, so I think someone else will correct me or add information. I could speculate but I think you know better in this case.

@terrelln
Copy link
Contributor

terrelln commented Oct 6, 2022

Did 1.5.3 happen, or do you have the convention that uneven means unstable?

We have not released 1.5.3 publicly yet. We were experimenting with a different release process where we release internally first, and publicly second. But other priorities took over and we haven't made the public release yet.

I don't know if we will call it 1.5.3 or 1.5.4, but we will make a public release later this year.

@terrelln
Copy link
Contributor

terrelln commented Oct 6, 2022

@terrelln Thank you for your confirmation! Is there any way I can assist with pre-release testing when the time comes?
Please let me know, I would be happy to run tests whenever necessary.

Thanks! I will update here when I've put the patch up. At that point any testing would be greatly appreciated!

@terrelln
Copy link
Contributor

FYI I've started working on this. I expect to have a patch up next week. And I am going to be aiming for the v6.2 merge window.

@terrelln
Copy link
Contributor

Quick update:

I'm preparing a patch to update the kernel to v1.5.2 now, aiming for the v6.2 merge window. If the next zstd release is out before then, I will update to that release instead.

I've completed the build, boot, and light btrfs/squashfs/zram testing. I've started performance measurements, however I'm seeing a ~10-20% performance regression for compression levels 5-15. I've bisected it to 13cad3a, which was ~neutral in user-space, but seems to not do well in kernel-space.

I'm debugging the performance issues, but I'm not sure how long it will take me. I expect to spend a bit more time on it, but the efforts might stall out before I find a satisfactory solution.

So quick question for interested parties: Are you still interested in an update patch that would regression performance for levels 5-15? These probably aren't super heavily used in the kernel, since I'd expect most usage to be levels 1-3, and decompression. If I don't find a solution soon, I could still put the patch up & merge into my linux-next branch to get test coverage. That would allow us to get a jump start on testing the majority of the code, and if we do find a solution, we can add that later.

@Natrox
Copy link

Natrox commented Oct 19, 2022

@terrelln Thank you for the update! And thank you for your work thus far.

Personally speaking, I do not mind the compression regression but it is hard to say whether that is a shared sentiment. I'll leave that for others. I do use 3+ compression levels quite a bit, but my situation does not care for write speed.

Speaking on where to go with it, I think your plan thus far sounds sane, particularly the merge into linux-next.
I'll stay tuned, happy to help you test the branch, my CPUs have some room today. Thanks once again.

@John-Gee
Copy link

John-Gee commented Oct 19, 2022

@terrelln Could you maybe switch compression levels to balance the regression or would that drop compression ratio too much? (ie make 15 in latest be more like current 13 or whatever other level with a similar performance and ratio). Or maybe you could revert the bad commit while you figure it out?

Personally I use 2-5 but I'd be happy to use higher if performance was better (which I believe was supposed to be the case with newer zstd).

@kdave
Copy link
Author

kdave commented Oct 19, 2022

10-20% is not negligible perf regression, I'd rather not push such update. We got speedups in previous updates so not making it worse should be one of the criteria. I'd understand a minor perf drop (say <3%) in case of some necessary code updates like added safety checks but the linked commit is about improving compilation speed. That's trading time on developer side for all user's run time which is IMHO wrong.

Also you're measuring -O3 builds, you're asking the compiler to try its best to produce fast code, that's expected to be slow. A quicker development build can do -O2 without much difference to -O3 in compression speed but with reduced compilation time. The -O3 is more like a production build, that could be tested less frequently, eg. daily or weekly to check if there are regressions but not necessarily for each PR/commit.

In the commit there are more indirections done due to separating functions and adding pointer calls, that's always going to hurt in kernel due to all the spectre mitigations. The fix is to avoid indirection if possible and/or replace it by switches, but the patch does the opposite. It may be better from the code organization and design patterns point of view but for a high cost.

@X6B
Copy link

X6B commented Oct 19, 2022

@terrelln I am interested. Share your patch please.

@terrelln
Copy link
Contributor

In the commit there are more indirections done due to separating functions and adding pointer calls, that's always going to hurt in kernel due to all the spectre mitigations. The fix is to avoid indirection if possible and/or replace it by switches, but the patch does the opposite. It may be better from the code organization and design patterns point of view but for a high cost.

Yeah, I've verified that the indirect function calls are definitely causing the issue. We went this route to reduce code size. But I can try to re-structure the code to keep the code size wins, but without the indirect function calls.

terrelln added a commit to terrelln/zstd that referenced this issue Oct 20, 2022
Use a switch statement to select the search function instead of an
indirect function call. This results in a sizable performance win.

This PR is a modification of the approach taken in PR facebook#2828.
When I measured performance for that commit, it was neutral.
However, I now see a performance regression on gcc, but still
neutral on clang. I'm measuring on the same platform, but with
newer compilers. The new approach beats both the current dev
branch and the baseline before PR facebook#2828 was merged.

This PR is necessary for Issue facebook#3275, to update zstd in the kernel.
Without this PR there is a large regression in greedy - btlazy2
compression speed. With this PR it is about neutral.

gcc version: 12.2.0
clang version: 14.0.6
dataset: silesia.tar

| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta  |
|----------|-------|------------------|-----------------|--------|
| gcc      |     5 |            102.6 |           113.7 | +10.8% |
| gcc      |     7 |             66.6 |            74.8 | +12.3% |
| gcc      |     9 |             51.5 |            58.9 | +14.3% |
| gcc      |    13 |             14.3 |            14.3 |  +0.0% |
| clang    |     5 |            108.1 |           114.8 |  +6.2% |
| clang    |     7 |             68.5 |            72.3 |  +5.5% |
| clang    |     9 |             53.2 |            56.2 |  +5.6% |
| clang    |    13 |             14.3 |            14.7 |  +2.8% |

The binary size stays just about the same for clang and gcc, measured
using the `size` command:

| Compiler | Branch | Text    | Data | BSS | Total   |
|----------|--------|---------|------|-----|---------|
| gcc      | dev    | 1127950 | 3312 | 280 | 1131542 |
| gcc      | PR     | 1123422 | 2512 | 280 | 1126214 |
| clang    | dev    | 1046254 | 3256 | 216 | 1049726 |
| clang    | PR     | 1048198 | 2296 | 216 | 1050710 |
@terrelln
Copy link
Contributor

I've been able to reduce the regression with PR #3295. There is still a bit of work to do, but we should be able to at least get pretty close to neutral for the compression levels where we saw a regression.

terrelln added a commit to terrelln/zstd that referenced this issue Oct 20, 2022
Use a switch statement to select the search function instead of an
indirect function call. This results in a sizable performance win.

This PR is a modification of the approach taken in PR facebook#2828.
When I measured performance for that commit, it was neutral.
However, I now see a performance regression on gcc, but still
neutral on clang. I'm measuring on the same platform, but with
newer compilers. The new approach beats both the current dev
branch and the baseline before PR facebook#2828 was merged.

This PR is necessary for Issue facebook#3275, to update zstd in the kernel.
Without this PR there is a large regression in greedy - btlazy2
compression speed. With this PR it is about neutral.

gcc version: 12.2.0
clang version: 14.0.6
dataset: silesia.tar

| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta  |
|----------|-------|------------------|-----------------|--------|
| gcc      |     5 |            102.6 |           113.7 | +10.8% |
| gcc      |     7 |             66.6 |            74.8 | +12.3% |
| gcc      |     9 |             51.5 |            58.9 | +14.3% |
| gcc      |    13 |             14.3 |            14.3 |  +0.0% |
| clang    |     5 |            108.1 |           114.8 |  +6.2% |
| clang    |     7 |             68.5 |            72.3 |  +5.5% |
| clang    |     9 |             53.2 |            56.2 |  +5.6% |
| clang    |    13 |             14.3 |            14.7 |  +2.8% |

The binary size stays just about the same for clang and gcc, measured
using the `size` command:

| Compiler | Branch | Text    | Data | BSS | Total   |
|----------|--------|---------|------|-----|---------|
| gcc      | dev    | 1127950 | 3312 | 280 | 1131542 |
| gcc      | PR     | 1123422 | 2512 | 280 | 1126214 |
| clang    | dev    | 1046254 | 3256 | 216 | 1049726 |
| clang    | PR     | 1048198 | 2296 | 216 | 1050710 |
terrelln added a commit to terrelln/zstd that referenced this issue Oct 21, 2022
Use a switch statement to select the search function instead of an
indirect function call. This results in a sizable performance win.

This PR is a modification of the approach taken in PR facebook#2828.
When I measured performance for that commit, it was neutral.
However, I now see a performance regression on gcc, but still
neutral on clang. I'm measuring on the same platform, but with
newer compilers. The new approach beats both the current dev
branch and the baseline before PR facebook#2828 was merged.

This PR is necessary for Issue facebook#3275, to update zstd in the kernel.
Without this PR there is a large regression in greedy - btlazy2
compression speed. With this PR it is about neutral.

gcc version: 12.2.0
clang version: 14.0.6
dataset: silesia.tar

| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta  |
|----------|-------|------------------|-----------------|--------|
| gcc      |     5 |            102.6 |           113.7 | +10.8% |
| gcc      |     7 |             66.6 |            74.8 | +12.3% |
| gcc      |     9 |             51.5 |            58.9 | +14.3% |
| gcc      |    13 |             14.3 |            14.3 |  +0.0% |
| clang    |     5 |            108.1 |           114.8 |  +6.2% |
| clang    |     7 |             68.5 |            72.3 |  +5.5% |
| clang    |     9 |             53.2 |            56.2 |  +5.6% |
| clang    |    13 |             14.3 |            14.7 |  +2.8% |

The binary size stays just about the same for clang and gcc, measured
using the `size` command:

| Compiler | Branch | Text    | Data | BSS | Total   |
|----------|--------|---------|------|-----|---------|
| gcc      | dev    | 1127950 | 3312 | 280 | 1131542 |
| gcc      | PR     | 1123422 | 2512 | 280 | 1126214 |
| clang    | dev    | 1046254 | 3256 | 216 | 1049726 |
| clang    | PR     | 1048198 | 2296 | 216 | 1050710 |
terrelln added a commit that referenced this issue Oct 21, 2022
Use a switch statement to select the search function instead of an
indirect function call. This results in a sizable performance win.

This PR is a modification of the approach taken in PR #2828.
When I measured performance for that commit, it was neutral.
However, I now see a performance regression on gcc, but still
neutral on clang. I'm measuring on the same platform, but with
newer compilers. The new approach beats both the current dev
branch and the baseline before PR #2828 was merged.

This PR is necessary for Issue #3275, to update zstd in the kernel.
Without this PR there is a large regression in greedy - btlazy2
compression speed. With this PR it is about neutral.

gcc version: 12.2.0
clang version: 14.0.6
dataset: silesia.tar

| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta  |
|----------|-------|------------------|-----------------|--------|
| gcc      |     5 |            102.6 |           113.7 | +10.8% |
| gcc      |     7 |             66.6 |            74.8 | +12.3% |
| gcc      |     9 |             51.5 |            58.9 | +14.3% |
| gcc      |    13 |             14.3 |            14.3 |  +0.0% |
| clang    |     5 |            108.1 |           114.8 |  +6.2% |
| clang    |     7 |             68.5 |            72.3 |  +5.5% |
| clang    |     9 |             53.2 |            56.2 |  +5.6% |
| clang    |    13 |             14.3 |            14.7 |  +2.8% |

The binary size stays just about the same for clang and gcc, measured
using the `size` command:

| Compiler | Branch | Text    | Data | BSS | Total   |
|----------|--------|---------|------|-----|---------|
| gcc      | dev    | 1127950 | 3312 | 280 | 1131542 |
| gcc      | PR     | 1123422 | 2512 | 280 | 1126214 |
| clang    | dev    | 1046254 | 3256 | 216 | 1049726 |
| clang    | PR     | 1048198 | 2296 | 216 | 1050710 |
terrelln added a commit that referenced this issue Oct 22, 2022
Use a switch statement to select the search function instead of an
indirect function call. This results in a sizable performance win.

This PR is a modification of the approach taken in PR #2828.
When I measured performance for that commit, it was neutral.
However, I now see a performance regression on gcc, but still
neutral on clang. I'm measuring on the same platform, but with
newer compilers. The new approach beats both the current dev
branch and the baseline before PR #2828 was merged.

This PR is necessary for Issue #3275, to update zstd in the kernel.
Without this PR there is a large regression in greedy - btlazy2
compression speed. With this PR it is about neutral.

gcc version: 12.2.0
clang version: 14.0.6
dataset: silesia.tar

| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta  |
|----------|-------|------------------|-----------------|--------|
| gcc      |     5 |            102.6 |           113.7 | +10.8% |
| gcc      |     7 |             66.6 |            74.8 | +12.3% |
| gcc      |     9 |             51.5 |            58.9 | +14.3% |
| gcc      |    13 |             14.3 |            14.3 |  +0.0% |
| clang    |     5 |            108.1 |           114.8 |  +6.2% |
| clang    |     7 |             68.5 |            72.3 |  +5.5% |
| clang    |     9 |             53.2 |            56.2 |  +5.6% |
| clang    |    13 |             14.3 |            14.7 |  +2.8% |

The binary size stays just about the same for clang and gcc, measured
using the `size` command:

| Compiler | Branch | Text    | Data | BSS | Total   |
|----------|--------|---------|------|-----|---------|
| gcc      | dev    | 1127950 | 3312 | 280 | 1131542 |
| gcc      | PR     | 1123422 | 2512 | 280 | 1126214 |
| clang    | dev    | 1046254 | 3256 | 216 | 1049726 |
| clang    | PR     | 1048198 | 2296 | 216 | 1050710 |
@John-Gee
Copy link

Will the update to .4 be easier to do or is still complicated?

@X6B
Copy link

X6B commented Feb 20, 2023

Will the update to .4 be easier to do or is still complicated?

I'm using this patch from Oleksandr Natalenko for 6.2 kernel: https://codeberg.org/pf-kernel/linux/commit/d8907dff4871c104a5a67e43157cc0d8012408e1

No problems here.

@ptr1337
Copy link

ptr1337 commented Feb 20, 2023

Will the update to .4 be easier to do or is still complicated?

I'm using this patch from Oleksandr Natalenko for 6.2 kernel: https://codeberg.org/pf-kernel/linux/commit/d8907dff4871c104a5a67e43157cc0d8012408e1

No problems here.

Using it also in linux-cachyos. Is working fine.

@terrelln
Copy link
Contributor

Closing this issue because Linux has been updated to v1.5.2.

Will the update to .4 be easier to do or is still complicated?

It will be easy, but I'm going to wait until the v6.4 merge window, since I haven't been baking the patch in linux-next yet.

I'm using this patch from Oleksandr Natalenko for 6.2 kernel: https://codeberg.org/pf-kernel/linux/commit/d8907dff4871c104a5a67e43157cc0d8012408e1
No problems here.

Glad to hear it!

@zhmars
Copy link

zhmars commented May 1, 2023

Closing this issue because Linux has been updated to v1.5.2

Hi, I notice the update of xxhash lib (required by zstd and ksm) for linux is missing.

zstd v1.5.1 had updated it to latest v0.8.1. But zstd v1.5.2 in linux kernel still using a old version. There are no major update since it was added: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/lib/xxhash.c

Could you consider updating it to latest version? This may avoid potential compatibility issues and bring some performance improvements.

@terrelln
Copy link
Contributor

terrelln commented May 3, 2023

This may avoid potential compatibility issues and bring some performance improvements.

There will be no compatibility issues, because zstd only uses the stable public API, and the hash function is fixed. AFAIK there haven't been any performance improvements to XXH64, but @Cyan4973 can verify.

I'd recommend opening an issue on the XXHash project to track this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants