Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement abstraction over mul_add #22

Merged
merged 2 commits into from
Feb 29, 2024

Conversation

FreezyLemon
Copy link
Contributor

The codegen for mul_add is a bit nonsensical when FMA is disabled: Compiler Explorer

I think fmaf is a libc function? Anyways, THIS is actually the main cause for non-FMA slowdowns:

Tested with RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-fma.

Before:

yuv 8-bit 4:4:4 full to xyb
                        time:   [5.0675 ms 5.0925 ms 5.1187 ms]

yuv 8-bit 4:2:0 full to xyb
                        time:   [4.9403 ms 4.9446 ms 4.9489 ms]

yuv 8-bit 4:2:0 limited to xyb
                        time:   [4.9657 ms 4.9698 ms 4.9748 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

yuv 10-bit 4:4:4 full to xyb
                        time:   [15.780 ms 15.842 ms 15.908 ms]
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) high mild
  4 (4.00%) high severe

yuv 10-bit 4:2:0 full to xyb
                        time:   [15.594 ms 15.612 ms 15.632 ms]
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild

yuv 10-bit 4:2:0 limited to xyb
                        time:   [15.482 ms 15.569 ms 15.663 ms]
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

rgb to lrgb via hybrid log-gamma system
                        time:   [2.6201 ms 2.6238 ms 2.6277 ms]
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe

After:

yuv 8-bit 4:4:4 full to xyb
                        time:   [1.6948 ms 1.7007 ms 1.7064 ms]
                        change: [-66.733% -66.546% -66.355%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking yuv 8-bit 4:2:0 full to xyb: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.5s, enable flat sampling, or reduce sample count to 50.
yuv 8-bit 4:2:0 full to xyb
                        time:   [1.6762 ms 1.6768 ms 1.6774 ms]
                        change: [-66.124% -66.092% -66.061%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking yuv 8-bit 4:2:0 limited to xyb: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.5s, enable flat sampling, or reduce sample count to 50.
yuv 8-bit 4:2:0 limited to xyb
                        time:   [1.7152 ms 1.7178 ms 1.7197 ms]
                        change: [-65.841% -65.744% -65.660%] (p = 0.00 < 0.05)
                        Performance has improved.

yuv 10-bit 4:4:4 full to xyb
                        time:   [3.5269 ms 3.5337 ms 3.5414 ms]
                        change: [-77.797% -77.694% -77.597%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

yuv 10-bit 4:2:0 full to xyb
                        time:   [3.4447 ms 3.4582 ms 3.4715 ms]
                        change: [-77.940% -77.849% -77.754%] (p = 0.00 < 0.05)
                        Performance has improved.

yuv 10-bit 4:2:0 limited to xyb
                        time:   [3.5220 ms 3.5237 ms 3.5254 ms]
                        change: [-77.505% -77.368% -77.240%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

rgb to lrgb via hybrid log-gamma system
                        time:   [275.83 µs 275.95 µs 276.09 µs]
                        change: [-89.500% -89.484% -89.469%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

Apparently, the codegen for mul_add is really bad
when FMA is disabled. Let's see if this is true
@FreezyLemon
Copy link
Contributor Author

I'm actually not sure if it'd be worth using multiply_add over mul_add everywhere in the project. But the polyX functions seem to be the most important

@FreezyLemon
Copy link
Contributor Author

I read up on libc::fmaf, and it seems like this change will result in reduced accuracy. fmaf seems to guarantee only one rounding error, but will use very slow code to achieve that on non-FMA platforms. Probably needs a bit more consideration

@shssoichiro
Copy link
Collaborator

shssoichiro commented Feb 29, 2024

Interesting. I'll accept this for now, but I think I actually want to upstream this to the Rust stdlib or at least have them consider it. It seems like a common footgun.

@shssoichiro shssoichiro merged commit f51dd7f into rust-av:main Feb 29, 2024
1 check passed
@FreezyLemon
Copy link
Contributor Author

FreezyLemon commented Feb 29, 2024

It's a somewhat common misconception; can't find the link now but some people have been discussing the advantages and disadvantages of mul_add in some GH issue (clippy I think? they have a lint that recommends mul_add over a * b + c) before.

The thing is that you shouldn't use mul_add for performance gains because the only thing it guarantees is the higher accuracy (one rounding error instead of two). The speed increase is more of an incidental thing, but it's reliable on any x86-64-v3 (or later) target.

So I guess it was our fault for using mul_add and expecting speed increases across the board?

@shssoichiro
Copy link
Collaborator

shssoichiro commented Feb 29, 2024

It should give improvements when built with FMA in most cases, because without mul_add, the compiler will not generate FMA instructions. Particularly when vectorization is possible as well. The issue is only when compiling without FMA support.

@FreezyLemon FreezyLemon deleted the non-fma-fix branch February 29, 2024 18:09
@FreezyLemon
Copy link
Contributor Author

FreezyLemon commented Feb 29, 2024

Yeah that's what I meant. It's a bit weird because on the one hand, rustc doesn't seem to generate FMA instructions without mul_add. On the other hand, mul_add's codegen seems to imply that the primary concern is the improved precision of the floating-point operation because it will favor high precision over fast codegen (mul_add could just be one mul and one add instruction on non-FMA targets if you don't care about precision)

@shssoichiro
Copy link
Collaborator

Not surprisingly, Rust won't fix this as it alters the result: rust-lang/rust#112192

There is some more in-depth talk about a -ffast-math mode here: rust-lang/rust#21690 but it doesn't sound like we're likely to get it any time soon, so it seems like we're stuck doing a manual workaround. Unfortunate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants