Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64: Revert "Use CASP instead of LDXP/STXP for store/RMW if available" on Apple hardware #89

Merged
merged 1 commit into from
Mar 25, 2023

Conversation

taiki-e
Copy link
Owner

@taiki-e taiki-e commented Mar 25, 2023

As of Apple M1/M1 Pro, on Apple hardware, CAS loop-based RMW is much slower than LL/SC loop-based RMW.
So, revert 93e6ec5 (#70) on Apple hardware for now.

Benchmarking bench_portable_atomic_arch/u128_concurrent_swap: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.9s, enable flat sampling, or reduce sample count to 50.
Benchmarking bench_portable_atomic_arch/u128_concurrent_swap: Collecting 100
bench_portable_atomic_arch/u128_concurrent_swap
                        time:   [1.4365 ms 1.5230 ms 1.5948 ms]
                        change: [+378.49% +402.44% +424.27%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) low severe
  4 (4.00%) low mild
Benchmarking bench_portable_atomic_arch/u128_concurrent_store_swap: Warming u
Benchmarking bench_portable_atomic_arch/u128_concurrent_store_swap: Collectin
bench_portable_atomic_arch/u128_concurrent_store_swap
                        time:   [286.21 µs 292.15 µs 296.65 µs]
                        change: [+96.504% +102.53% +108.63%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
Benchmarking bench_portable_atomic_arch/u128_concurrent_fetch_add: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.8s, enable flat sampling, or reduce sample count to 50.
Benchmarking bench_portable_atomic_arch/u128_concurrent_fetch_add: Collecting
bench_portable_atomic_arch/u128_concurrent_fetch_add
                        time:   [1.6351 ms 1.6787 ms 1.7170 ms]
                        change: [+279.29% +294.80% +309.89%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild

@taiki-e taiki-e added the O-arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state label Mar 25, 2023
…able" on Apple hardware

As of Apple M1/M1 Pro, on Apple hardware, CAS loop-based RMW is much
slower than LL/SC loop-based RMW.
So, revert 93e6ec5 on Apple hardware
for now.

```
Benchmarking bench_portable_atomic_arch/u128_concurrent_swap: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.9s, enable flat sampling, or reduce sample count to 50.
Benchmarking bench_portable_atomic_arch/u128_concurrent_swap: Collecting 100
bench_portable_atomic_arch/u128_concurrent_swap
                        time:   [1.4365 ms 1.5230 ms 1.5948 ms]
                        change: [+378.49% +402.44% +424.27%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) low severe
  4 (4.00%) low mild
Benchmarking bench_portable_atomic_arch/u128_concurrent_store_swap: Warming u
Benchmarking bench_portable_atomic_arch/u128_concurrent_store_swap: Collectin
bench_portable_atomic_arch/u128_concurrent_store_swap
                        time:   [286.21 µs 292.15 µs 296.65 µs]
                        change: [+96.504% +102.53% +108.63%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
Benchmarking bench_portable_atomic_arch/u128_concurrent_fetch_add: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.8s, enable flat sampling, or reduce sample count to 50.
Benchmarking bench_portable_atomic_arch/u128_concurrent_fetch_add: Collecting
bench_portable_atomic_arch/u128_concurrent_fetch_add
                        time:   [1.6351 ms 1.6787 ms 1.7170 ms]
                        change: [+279.29% +294.80% +309.89%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild
```
@taiki-e taiki-e merged commit 90e2b0e into main Mar 25, 2023
@taiki-e taiki-e deleted the aarch64-apple branch March 25, 2023 18:56

// As of Apple M1/M1 Pro, on Apple hardware, CAS loop-based RMW is much slower than LL/SC
// loop-based RMW: https://github.com/taiki-e/portable-atomic/pull/89
if is_macos || target_os == "ios" || target_os == "tvos" || target_os == "watchos" {
Copy link

@teohhanhui teohhanhui Oct 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This misses Asahi Linux...

And is still the case:

let is_apple = is_macos || target_os == "ios" || target_os == "tvos" || target_os == "watchos";

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to know that the code is being compiled against Asahi Linux at build time?

Otherwise, it would be reasonable to set --cfg portable_atomic_ll_sc_rmw on the user's side.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an option to revert the whole 93e6ec5, but we want to keep the same codegen as LLVM if possible (LLVM uses CASP regardless of whether it is Apple hardware, though), and we also need to make sure that it does not lose performance on hardware that CASP is fast.

Copy link
Owner Author

@taiki-e taiki-e Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to know that the code is being compiled against Asahi Linux at build time?

I can think of is:

  • target_vendor. If custom targets are used.
  • -C target-cpu=apple-*. If the user uses it.

Both cases can be detected at build time, but I would like to know which one Asahi Linux users actually use.

Copy link
Owner Author

@taiki-e taiki-e Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented the latter approach in 5c3a43b.

$ RUSTFLAGS='-C target-cpu=apple-m1' cargo build -vv --target aarch64-unknown-linux-gnu | grep portable_atomic_ll_sc_rmw
[portable-atomic 1.4.3] cargo:rustc-cfg=portable_atomic_ll_sc_rmw(build)                                                                                                    

EDIT: Published in 1.5.0.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teohhanhui Was the above approach sufficient for you? Or was it not?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not using any RUSTFLAGS, so...

Copy link
Owner Author

@taiki-e taiki-e Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear:

  • Those specified in .cargo/config (or .cargo/config.toml) are also taken into account.
  • With the builtin aarch64-unknown-linux-{gnu,musl} target, if you specify neither target-cpu nor target-feature, the default one (LL/SC) will be used and you should get good performance on apple hardwere.

@taiki-e taiki-e added O-aarch64 Target: Armv8-A, Armv8-R, or later processors in AArch64 mode and removed O-arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state labels Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-aarch64 Target: Armv8-A, Armv8-R, or later processors in AArch64 mode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants