Implement abstraction over mul_add #22

FreezyLemon · 2024-02-28T11:14:26Z

The codegen for mul_add is a bit nonsensical when FMA is disabled: Compiler Explorer

I think fmaf is a libc function? Anyways, THIS is actually the main cause for non-FMA slowdowns:

Tested with RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-fma.

Before:

yuv 8-bit 4:4:4 full to xyb
                        time:   [5.0675 ms 5.0925 ms 5.1187 ms]

yuv 8-bit 4:2:0 full to xyb
                        time:   [4.9403 ms 4.9446 ms 4.9489 ms]

yuv 8-bit 4:2:0 limited to xyb
                        time:   [4.9657 ms 4.9698 ms 4.9748 ms]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

yuv 10-bit 4:4:4 full to xyb
                        time:   [15.780 ms 15.842 ms 15.908 ms]
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) high mild
  4 (4.00%) high severe

yuv 10-bit 4:2:0 full to xyb
                        time:   [15.594 ms 15.612 ms 15.632 ms]
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild

yuv 10-bit 4:2:0 limited to xyb
                        time:   [15.482 ms 15.569 ms 15.663 ms]
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

rgb to lrgb via hybrid log-gamma system
                        time:   [2.6201 ms 2.6238 ms 2.6277 ms]
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe

After:

yuv 8-bit 4:4:4 full to xyb
                        time:   [1.6948 ms 1.7007 ms 1.7064 ms]
                        change: [-66.733% -66.546% -66.355%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking yuv 8-bit 4:2:0 full to xyb: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.5s, enable flat sampling, or reduce sample count to 50.
yuv 8-bit 4:2:0 full to xyb
                        time:   [1.6762 ms 1.6768 ms 1.6774 ms]
                        change: [-66.124% -66.092% -66.061%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking yuv 8-bit 4:2:0 limited to xyb: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.5s, enable flat sampling, or reduce sample count to 50.
yuv 8-bit 4:2:0 limited to xyb
                        time:   [1.7152 ms 1.7178 ms 1.7197 ms]
                        change: [-65.841% -65.744% -65.660%] (p = 0.00 < 0.05)
                        Performance has improved.

yuv 10-bit 4:4:4 full to xyb
                        time:   [3.5269 ms 3.5337 ms 3.5414 ms]
                        change: [-77.797% -77.694% -77.597%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

yuv 10-bit 4:2:0 full to xyb
                        time:   [3.4447 ms 3.4582 ms 3.4715 ms]
                        change: [-77.940% -77.849% -77.754%] (p = 0.00 < 0.05)
                        Performance has improved.

yuv 10-bit 4:2:0 limited to xyb
                        time:   [3.5220 ms 3.5237 ms 3.5254 ms]
                        change: [-77.505% -77.368% -77.240%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

rgb to lrgb via hybrid log-gamma system
                        time:   [275.83 µs 275.95 µs 276.09 µs]
                        change: [-89.500% -89.484% -89.469%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

Apparently, the codegen for mul_add is really bad when FMA is disabled. Let's see if this is true

FreezyLemon · 2024-02-28T11:15:23Z

I'm actually not sure if it'd be worth using multiply_add over mul_add everywhere in the project. But the polyX functions seem to be the most important

FreezyLemon · 2024-02-28T11:23:10Z

I read up on libc::fmaf, and it seems like this change will result in reduced accuracy. fmaf seems to guarantee only one rounding error, but will use very slow code to achieve that on non-FMA platforms. Probably needs a bit more consideration

shssoichiro · 2024-02-29T02:31:47Z

Interesting. I'll accept this for now, but I think I actually want to upstream this to the Rust stdlib or at least have them consider it. It seems like a common footgun.

FreezyLemon · 2024-02-29T16:17:59Z

It's a somewhat common misconception; can't find the link now but some people have been discussing the advantages and disadvantages of mul_add in some GH issue (clippy I think? they have a lint that recommends mul_add over a * b + c) before.

The thing is that you shouldn't use mul_add for performance gains because the only thing it guarantees is the higher accuracy (one rounding error instead of two). The speed increase is more of an incidental thing, but it's reliable on any x86-64-v3 (or later) target.

So I guess it was our fault for using mul_add and expecting speed increases across the board?

shssoichiro · 2024-02-29T17:47:16Z

It should give improvements when built with FMA in most cases, because without mul_add, the compiler will not generate FMA instructions. Particularly when vectorization is possible as well. The issue is only when compiling without FMA support.

FreezyLemon · 2024-02-29T18:11:53Z

Yeah that's what I meant. It's a bit weird because on the one hand, rustc doesn't seem to generate FMA instructions without mul_add. On the other hand, mul_add's codegen seems to imply that the primary concern is the improved precision of the floating-point operation because it will favor high precision over fast codegen (mul_add could just be one mul and one add instruction on non-FMA targets if you don't care about precision)

shssoichiro · 2024-03-01T14:08:43Z

Not surprisingly, Rust won't fix this as it alters the result: rust-lang/rust#112192

There is some more in-depth talk about a -ffast-math mode here: rust-lang/rust#21690 but it doesn't sound like we're likely to get it any time soon, so it seems like we're stuck doing a manual workaround. Unfortunate.

Implement abstraction over mul_add

9b866a3

Apparently, the codegen for mul_add is really bad when FMA is disabled. Let's see if this is true

Add attribute to allow non-FMA code once

2eb7809

Luni-4 requested a review from shssoichiro February 28, 2024 19:11

shssoichiro merged commit f51dd7f into rust-av:main Feb 29, 2024
1 check passed

FreezyLemon deleted the non-fma-fix branch February 29, 2024 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement abstraction over mul_add #22

Implement abstraction over mul_add #22

FreezyLemon commented Feb 28, 2024

FreezyLemon commented Feb 28, 2024

FreezyLemon commented Feb 28, 2024

shssoichiro commented Feb 29, 2024 •

edited

Loading

FreezyLemon commented Feb 29, 2024 •

edited

Loading

shssoichiro commented Feb 29, 2024 •

edited

Loading

FreezyLemon commented Feb 29, 2024 •

edited

Loading

shssoichiro commented Mar 1, 2024

Implement abstraction over mul_add #22

Implement abstraction over mul_add #22

Conversation

FreezyLemon commented Feb 28, 2024

FreezyLemon commented Feb 28, 2024

FreezyLemon commented Feb 28, 2024

shssoichiro commented Feb 29, 2024 • edited Loading

FreezyLemon commented Feb 29, 2024 • edited Loading

shssoichiro commented Feb 29, 2024 • edited Loading

FreezyLemon commented Feb 29, 2024 • edited Loading

shssoichiro commented Mar 1, 2024

shssoichiro commented Feb 29, 2024 •

edited

Loading

FreezyLemon commented Feb 29, 2024 •

edited

Loading

shssoichiro commented Feb 29, 2024 •

edited

Loading

FreezyLemon commented Feb 29, 2024 •

edited

Loading