Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add memory barrier to Mutex#unlock on aarch64 #14272

Merged
merged 2 commits into from
Feb 10, 2024

Conversation

jgaskins
Copy link
Contributor

@jgaskins jgaskins commented Jan 30, 2024

This solution is the same as the one used in #13050.

The following code is expected to output 1000000 preceded by the time it took to perform it:

mutex = Mutex.new
numbers = Array(Int32).new(initial_capacity: 1_000_000)
done = Channel(Nil).new
concurrency = 20
iterations = 1_000_000 // concurrency
concurrency.times do
  spawn do
    iterations.times { mutex.synchronize { numbers << 0 } }
  ensure
    done.send nil
  end
end

start = Time.monotonic
concurrency.times { done.receive }
print Time.monotonic - start
print ' '
sleep 100.milliseconds # Wait just a bit longer to be sure the discrepancy isn't due to a *different* race condition
pp numbers.size

Before this commit, on an Apple M1 CPU, the array size would be anywhere from 880k-970k, but I never observed it reach 1M. Here is a sample:

$ repeat 20 (CRYSTAL_WORKERS=10 ./mutex_check)
00:00:00.119271625 881352
00:00:00.111249083 936709
00:00:00.102355208 946428
00:00:00.116415166 926724
00:00:00.127152583 899899
00:00:00.097160792 964577
00:00:00.120564958 930859
00:00:00.122803000 917583
00:00:00.093986834 954112
00:00:00.079212333 967772
00:00:00.093168208 953491
00:00:00.102553834 962147
00:00:00.091601625 967304
00:00:00.108157208 954855
00:00:00.080879666 944870
00:00:00.114638042 930429
00:00:00.093617083 956496
00:00:00.112108959 940205
00:00:00.092837875 944993
00:00:00.097882625 916220

This indicates that some of the mutex locks were getting through when they should not have been. With this commit, using the exact same parameters (built with --release -Dpreview_mt and run with CRYSTAL_WORKERS=10 to spread out across all 10 cores) these are the results I'm seeing:

00:00:00.078898166 1000000
00:00:00.072308084 1000000
00:00:00.047157000 1000000
00:00:00.088043834 1000000
00:00:00.060784625 1000000
00:00:00.067710250 1000000
00:00:00.081070750 1000000
00:00:00.065572208 1000000
00:00:00.065006958 1000000
00:00:00.061041541 1000000
00:00:00.059648291 1000000
00:00:00.078100125 1000000
00:00:00.050676250 1000000
00:00:00.049395875 1000000
00:00:00.069352334 1000000
00:00:00.063897833 1000000
00:00:00.067534333 1000000
00:00:00.070290833 1000000
00:00:00.067361500 1000000
00:00:00.078021833 1000000

Note that it's not only correct, but also significantly faster.

Fixes #13055

This solution is the same as the one used in crystal-lang#13050.

The following code is expected to output `1000000` preceded by the time
it took to perform it:

```
mutex = Mutex.new
numbers = Array(Int32).new(initial_capacity: 1_000_000)
done = Channel(Nil).new
concurrency = 20
iterations = 1_000_000 // concurrency
concurrency.times do
  spawn do
    iterations.times { mutex.synchronize { numbers << 0 } }
  ensure
    done.send nil
  end
end

start = Time.monotonic
concurrency.times { done.receive }
print Time.monotonic - start
print ' '
sleep 100.milliseconds # Wait just a bit longer to be sure the discrepancy isn't due to a *different* race condition
pp numbers.size
```

Before this commit, on an Apple M1 CPU, the array size would be anywhere
from 880k-970k, but I never observed it reach 1M. Here is a sample:

```
$ repeat 20 (CRYSTAL_WORKERS=10 ./mutex_check)
00:00:00.119271625 881352
00:00:00.111249083 936709
00:00:00.102355208 946428
00:00:00.116415166 926724
00:00:00.127152583 899899
00:00:00.097160792 964577
00:00:00.120564958 930859
00:00:00.122803000 917583
00:00:00.093986834 954112
00:00:00.079212333 967772
00:00:00.093168208 953491
00:00:00.102553834 962147
00:00:00.091601625 967304
00:00:00.108157208 954855
00:00:00.080879666 944870
00:00:00.114638042 930429
00:00:00.093617083 956496
00:00:00.112108959 940205
00:00:00.092837875 944993
00:00:00.097882625 916220
```

This indicates that some of the mutex locks were getting through when
they should not have been. With this commit, using the exact same
parameters (built with `--release -Dpreview_mt` and run with
`CRYSTAL_WORKERS=10` to spread out across all 10 cores) these are the
results I'm seeing:

```
00:00:00.078898166 1000000
00:00:00.072308084 1000000
00:00:00.047157000 1000000
00:00:00.088043834 1000000
00:00:00.060784625 1000000
00:00:00.067710250 1000000
00:00:00.081070750 1000000
00:00:00.065572208 1000000
00:00:00.065006958 1000000
00:00:00.061041541 1000000
00:00:00.059648291 1000000
00:00:00.078100125 1000000
00:00:00.050676250 1000000
00:00:00.049395875 1000000
00:00:00.069352334 1000000
00:00:00.063897833 1000000
00:00:00.067534333 1000000
00:00:00.070290833 1000000
00:00:00.067361500 1000000
00:00:00.078021833 1000000
```

Note that it's not only correct, but also significantly faster.
@Blacksmoke16 Blacksmoke16 added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:concurrency platform:aarch64 labels Jan 30, 2024
@straight-shoota
Copy link
Member

Are you sure this resolves #13055 entirely and there are no other places that may need barriers?

@beta-ziliani
Copy link
Member

👀 @ysbaddaden

@ysbaddaden
Copy link
Contributor

@jgaskins

What if you replace the lazy set (@state.lazy_set(0)) with an explicit one set (@state.set(0))? Do you still need the memory barrier?

Here is for example what the linux kernel source code (v4.4) has to say:

About ARM32:

A memory barrier is required after we get a lock, and before we release it, because V6 CPUs are assumed to have weakly ordered memory.

I assume this stands for V7 CPUs too.

About ARM64:

The memory barriers are implicit with the load-acquire and store-release instructions.

We use sequential consistency instead of acquire/release but that should only impact performance & seq-cst is stronger than acquire/release anyway.

My understanding is that the atomic is enough as long as we don't break the contract (without a barrier the CPU may reorder lazy set before we increment the counter).

@jgaskins
Copy link
Contributor Author

Are you sure this resolves #13055 entirely and there are no other places that may need barriers?

@straight-shoota I’m sure that it fixes the issues I’ve observed with thread-safety on aarch64 in load tests I’ve performed on my software. I don’t know of a way to prove that it’s fixed in all scenarios.

If you’re referring to the wording in the title of the PR, I can change it to “add memory barriers” as in #13050.

What if you replace the lazy set (@state.lazy_set(0)) with an explicit one set (@state.set(0))? Do you still need the memory barrier?

In my tests last night, that did give me the expected values, but was slower. I don’t know how much that matters since correctness > speed (up to a point), but this implementation gave us both.

@ysbaddaden
Copy link
Contributor

@jgaskins nice, at least it proves that it's working. The speed improvement with a barrier is weird 🤔

I'd be interested to see the performance impact when using acquire/release semantics on the atomics (without the barrier) instead sequential consistency 👀

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Jan 30, 2024

We might get better performance by using LSE atomics from ARMv8.1 (e.g. ldadda) instead of the legacy LL/SC (e.g. ldaxr + stxr). Disassembling cross compiled objects, I noticed that LLVM generates LL/SC atomics by defaults. Apparently we can use mattr=+lse to generate lse atomics 👀

EDIT: confirmed, by default llvm will generate ll/sc atomics but specifying --mattr=+lse will use the lse atomics insted. I'm looking into the fix & performance issue.

@ysbaddaden
Copy link
Contributor

I ran the example code from the PR description on a Neoverse-N1 server 🤩 with 16 worker threads.

  • Crystal 1.11.2 is the stock release (no patches)
  • jgaskins patch is this PR (add a memory barrier to mutex)
  • ysbaddaden patch is replacing #lazy_set for #set(0) and removing the
    barriers in both Mutex and Crystal::SpinLock.

With LL/SC atomics (-Dpreview_mt --release):

Crystal 1.11.2 (LL/SC) jgaskins patch (LL/SC) ysbaddaden patch (LL/SC)
00:00:00.385133632 999870 00:00:00.499416919 1000000 00:00:00.550988029 1000000
00:00:00.371160988 999891 00:00:00.482909860 1000000 00:00:00.490515706 1000000
00:00:00.452127314 999990 00:00:00.343948665 1000000 00:00:00.439060716 1000000
00:00:00.347059963 999991 00:00:00.434351488 1000000 00:00:00.434563770 1000000
00:00:00.455184212 999994 00:00:00.440151883 1000000 00:00:00.452553157 1000000
00:00:00.484056906 999895 00:00:00.526242680 1000000 00:00:00.390584986 1000000
00:00:00.516859382 999990 00:00:00.376327340 1000000 00:00:00.446540081 1000000
00:00:00.536798222 999931 00:00:00.475414134 1000000 00:00:00.468770775 1000000
00:00:00.451565270 999997 00:00:00.426166719 1000000 00:00:00.440710567 1000000
00:00:00.449864220 999828 00:00:00.400186963 1000000 00:00:00.423722185 1000000
avg: 0.444981010 avg: 0.440511665 avg: 0.453800997

With LSE atomics (-Dpreview_mt --release --mattr=+lse):

Crystal 1.11.2 (LSE) jgaskins patch (LSE) ysbaddaden patch (LSE)
00:00:00.216061856 999332 00:00:00.249694139 1000000 00:00:00.240949127 1000000
00:00:00.219081074 993259 00:00:00.239127756 1000000 00:00:00.226352440 1000000
00:00:00.215822334 992114 00:00:00.248496972 1000000 00:00:00.246995643 1000000
00:00:00.221506808 989608 00:00:00.239560918 1000000 00:00:00.211989393 1000000
00:00:00.220899165 994043 00:00:00.235796576 1000000 00:00:00.234099366 1000000
00:00:00.217702506 992565 00:00:00.231499750 1000000 00:00:00.236619821 1000000
00:00:00.213177758 995030 00:00:00.242951419 1000000 00:00:00.261796132 1000000
00:00:00.231702350 990755 00:00:00.243229460 1000000 00:00:00.234265326 1000000
00:00:00.217356223 994786 00:00:00.256804222 1000000 00:00:00.248510852 1000000
00:00:00.219489556 997060 00:00:00.243536342 1000000 00:00:00.240165962 1000000
avg: 0.219279962 avg: 0.243069755 avg: 0.238174406

Take aways:

  • Locks on Crystal 1.11.2 are completely off 😱
  • LSE atomics are incredibly faster than LL/SC with 16 threads (they were similar with 4 threads) 🚀
  • Memory barriers are indeed implicit on AArch64 👍
  • I don't see a noticeable performance difference between lazy set + memory barriers and proper set with no barriers (neither on acquire or release) on the Neoverse N1;
  • I'm eager to compare acquire/release vs seq-cst memory orders 👀

NOTE: we might consider enabling LSE by default for AArch64, and having a -Dwithout_lse_atomics flag, and/or check the CPU flags to see if the feature is available at compile time.

@jgaskins
Copy link
Contributor Author

Weird. With LSE it was slower on my M1 Mac, but ~18% faster than this PR on an Ampere Arm server on Google Cloud (T2A VM, 8 cores), which is fascinating.

jgaskins ysbaddaden
00:00:00.408768519 1000000 00:00:00.329878465 1000000
00:00:00.335315272 1000000 00:00:00.268396861 1000000
00:00:00.360981995 1000000 00:00:00.272967381 1000000
00:00:00.324802272 1000000 00:00:00.330465706 1000000
00:00:00.321439072 1000000 00:00:00.275313622 1000000
00:00:00.349576434 1000000 00:00:00.364220269 1000000
00:00:00.439792642 1000000 00:00:00.365180429 1000000
00:00:00.312492590 1000000 00:00:00.339958506 1000000
00:00:00.363270475 1000000 00:00:00.338208266 1000000
00:00:00.523387650 1000000 00:00:00.285177182 1000000
avg: 00:00:00.37398269209999996 avg: 00:00:00.3169766687

@HertzDevil
Copy link
Contributor

The other part of #13055 is Crystal::RWLock, which is only used for garbage collection

Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM then!

@Sija
Copy link
Contributor

Sija commented Feb 1, 2024

Would be nice to have some spec coverage.

@jgaskins
Copy link
Contributor Author

jgaskins commented Feb 2, 2024

There's a spec for it, but CI doesn't use -Dpreview_mt, so it doesn't catch this.

There is one CI entry that uses -Dpreview_mt, but it's Linux x86_64-only.

@jgaskins jgaskins changed the title Fix Mutex on aarch64 Add memory barrier to Mutex#unlock on aarch64 Feb 8, 2024
@straight-shoota straight-shoota added this to the 1.12.0 milestone Feb 8, 2024
@straight-shoota straight-shoota merged commit db67d71 into crystal-lang:master Feb 10, 2024
57 checks passed
@jgaskins jgaskins deleted the fix-mutex-on-aarch64 branch February 12, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. platform:aarch64 topic:multithreading topic:stdlib:concurrency
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

More places might need memory barriers on AArch64
7 participants