Add memory barrier to `Mutex#unlock` on aarch64 #14272

jgaskins · 2024-01-30T00:08:42Z

This solution is the same as the one used in #13050.

The following code is expected to output 1000000 preceded by the time it took to perform it:

mutex = Mutex.new
numbers = Array(Int32).new(initial_capacity: 1_000_000)
done = Channel(Nil).new
concurrency = 20
iterations = 1_000_000 // concurrency
concurrency.times do
  spawn do
    iterations.times { mutex.synchronize { numbers << 0 } }
  ensure
    done.send nil
  end
end

start = Time.monotonic
concurrency.times { done.receive }
print Time.monotonic - start
print ' '
sleep 100.milliseconds # Wait just a bit longer to be sure the discrepancy isn't due to a *different* race condition
pp numbers.size

Before this commit, on an Apple M1 CPU, the array size would be anywhere from 880k-970k, but I never observed it reach 1M. Here is a sample:

$ repeat 20 (CRYSTAL_WORKERS=10 ./mutex_check)
00:00:00.119271625 881352
00:00:00.111249083 936709
00:00:00.102355208 946428
00:00:00.116415166 926724
00:00:00.127152583 899899
00:00:00.097160792 964577
00:00:00.120564958 930859
00:00:00.122803000 917583
00:00:00.093986834 954112
00:00:00.079212333 967772
00:00:00.093168208 953491
00:00:00.102553834 962147
00:00:00.091601625 967304
00:00:00.108157208 954855
00:00:00.080879666 944870
00:00:00.114638042 930429
00:00:00.093617083 956496
00:00:00.112108959 940205
00:00:00.092837875 944993
00:00:00.097882625 916220

This indicates that some of the mutex locks were getting through when they should not have been. With this commit, using the exact same parameters (built with --release -Dpreview_mt and run with CRYSTAL_WORKERS=10 to spread out across all 10 cores) these are the results I'm seeing:

00:00:00.078898166 1000000
00:00:00.072308084 1000000
00:00:00.047157000 1000000
00:00:00.088043834 1000000
00:00:00.060784625 1000000
00:00:00.067710250 1000000
00:00:00.081070750 1000000
00:00:00.065572208 1000000
00:00:00.065006958 1000000
00:00:00.061041541 1000000
00:00:00.059648291 1000000
00:00:00.078100125 1000000
00:00:00.050676250 1000000
00:00:00.049395875 1000000
00:00:00.069352334 1000000
00:00:00.063897833 1000000
00:00:00.067534333 1000000
00:00:00.070290833 1000000
00:00:00.067361500 1000000
00:00:00.078021833 1000000

Note that it's not only correct, but also significantly faster.

Fixes #13055

This solution is the same as the one used in crystal-lang#13050. The following code is expected to output `1000000` preceded by the time it took to perform it: ``` mutex = Mutex.new numbers = Array(Int32).new(initial_capacity: 1_000_000) done = Channel(Nil).new concurrency = 20 iterations = 1_000_000 // concurrency concurrency.times do spawn do iterations.times { mutex.synchronize { numbers << 0 } } ensure done.send nil end end start = Time.monotonic concurrency.times { done.receive } print Time.monotonic - start print ' ' sleep 100.milliseconds # Wait just a bit longer to be sure the discrepancy isn't due to a *different* race condition pp numbers.size ``` Before this commit, on an Apple M1 CPU, the array size would be anywhere from 880k-970k, but I never observed it reach 1M. Here is a sample: ``` $ repeat 20 (CRYSTAL_WORKERS=10 ./mutex_check) 00:00:00.119271625 881352 00:00:00.111249083 936709 00:00:00.102355208 946428 00:00:00.116415166 926724 00:00:00.127152583 899899 00:00:00.097160792 964577 00:00:00.120564958 930859 00:00:00.122803000 917583 00:00:00.093986834 954112 00:00:00.079212333 967772 00:00:00.093168208 953491 00:00:00.102553834 962147 00:00:00.091601625 967304 00:00:00.108157208 954855 00:00:00.080879666 944870 00:00:00.114638042 930429 00:00:00.093617083 956496 00:00:00.112108959 940205 00:00:00.092837875 944993 00:00:00.097882625 916220 ``` This indicates that some of the mutex locks were getting through when they should not have been. With this commit, using the exact same parameters (built with `--release -Dpreview_mt` and run with `CRYSTAL_WORKERS=10` to spread out across all 10 cores) these are the results I'm seeing: ``` 00:00:00.078898166 1000000 00:00:00.072308084 1000000 00:00:00.047157000 1000000 00:00:00.088043834 1000000 00:00:00.060784625 1000000 00:00:00.067710250 1000000 00:00:00.081070750 1000000 00:00:00.065572208 1000000 00:00:00.065006958 1000000 00:00:00.061041541 1000000 00:00:00.059648291 1000000 00:00:00.078100125 1000000 00:00:00.050676250 1000000 00:00:00.049395875 1000000 00:00:00.069352334 1000000 00:00:00.063897833 1000000 00:00:00.067534333 1000000 00:00:00.070290833 1000000 00:00:00.067361500 1000000 00:00:00.078021833 1000000 ``` Note that it's not only correct, but also significantly faster.

straight-shoota · 2024-01-30T11:48:51Z

Are you sure this resolves #13055 entirely and there are no other places that may need barriers?

beta-ziliani · 2024-01-30T13:00:42Z

👀 @ysbaddaden

ysbaddaden · 2024-01-30T14:03:06Z

@jgaskins

What if you replace the lazy set (@state.lazy_set(0)) with an explicit one set (@state.set(0))? Do you still need the memory barrier?

Here is for example what the linux kernel source code (v4.4) has to say:

About ARM32:

A memory barrier is required after we get a lock, and before we release it, because V6 CPUs are assumed to have weakly ordered memory.

I assume this stands for V7 CPUs too.

About ARM64:

The memory barriers are implicit with the load-acquire and store-release instructions.

We use sequential consistency instead of acquire/release but that should only impact performance & seq-cst is stronger than acquire/release anyway.

My understanding is that the atomic is enough as long as we don't break the contract (without a barrier the CPU may reorder lazy set before we increment the counter).

jgaskins · 2024-01-30T14:22:07Z

Are you sure this resolves #13055 entirely and there are no other places that may need barriers?

@straight-shoota I’m sure that it fixes the issues I’ve observed with thread-safety on aarch64 in load tests I’ve performed on my software. I don’t know of a way to prove that it’s fixed in all scenarios.

If you’re referring to the wording in the title of the PR, I can change it to “add memory barriers” as in #13050.

What if you replace the lazy set (@state.lazy_set(0)) with an explicit one set (@state.set(0))? Do you still need the memory barrier?

In my tests last night, that did give me the expected values, but was slower. I don’t know how much that matters since correctness > speed (up to a point), but this implementation gave us both.

ysbaddaden · 2024-01-30T15:26:30Z

@jgaskins nice, at least it proves that it's working. The speed improvement with a barrier is weird 🤔

I'd be interested to see the performance impact when using acquire/release semantics on the atomics (without the barrier) instead sequential consistency 👀

ysbaddaden · 2024-01-30T16:09:36Z

We might get better performance by using LSE atomics from ARMv8.1 (e.g. ldadda) instead of the legacy LL/SC (e.g. ldaxr + stxr). Disassembling cross compiled objects, I noticed that LLVM generates LL/SC atomics by defaults. Apparently we can use mattr=+lse to generate lse atomics 👀

EDIT: confirmed, by default llvm will generate ll/sc atomics but specifying --mattr=+lse will use the lse atomics insted. I'm looking into the fix & performance issue.

ysbaddaden · 2024-01-30T17:59:53Z

I ran the example code from the PR description on a Neoverse-N1 server 🤩 with 16 worker threads.

Crystal 1.11.2 is the stock release (no patches)
jgaskins patch is this PR (add a memory barrier to mutex)
ysbaddaden patch is replacing #lazy_set for #set(0) and removing the
barriers in both Mutex and Crystal::SpinLock.

With LL/SC atomics (-Dpreview_mt --release):

Crystal 1.11.2 (LL/SC)	jgaskins patch (LL/SC)	ysbaddaden patch (LL/SC)
00:00:00.385133632 999870	00:00:00.499416919 1000000	00:00:00.550988029 1000000
00:00:00.371160988 999891	00:00:00.482909860 1000000	00:00:00.490515706 1000000
00:00:00.452127314 999990	00:00:00.343948665 1000000	00:00:00.439060716 1000000
00:00:00.347059963 999991	00:00:00.434351488 1000000	00:00:00.434563770 1000000
00:00:00.455184212 999994	00:00:00.440151883 1000000	00:00:00.452553157 1000000
00:00:00.484056906 999895	00:00:00.526242680 1000000	00:00:00.390584986 1000000
00:00:00.516859382 999990	00:00:00.376327340 1000000	00:00:00.446540081 1000000
00:00:00.536798222 999931	00:00:00.475414134 1000000	00:00:00.468770775 1000000
00:00:00.451565270 999997	00:00:00.426166719 1000000	00:00:00.440710567 1000000
00:00:00.449864220 999828	00:00:00.400186963 1000000	00:00:00.423722185 1000000
avg: 0.444981010	avg: 0.440511665	avg: 0.453800997

With LSE atomics (-Dpreview_mt --release --mattr=+lse):

Crystal 1.11.2 (LSE)	jgaskins patch (LSE)	ysbaddaden patch (LSE)
00:00:00.216061856 999332	00:00:00.249694139 1000000	00:00:00.240949127 1000000
00:00:00.219081074 993259	00:00:00.239127756 1000000	00:00:00.226352440 1000000
00:00:00.215822334 992114	00:00:00.248496972 1000000	00:00:00.246995643 1000000
00:00:00.221506808 989608	00:00:00.239560918 1000000	00:00:00.211989393 1000000
00:00:00.220899165 994043	00:00:00.235796576 1000000	00:00:00.234099366 1000000
00:00:00.217702506 992565	00:00:00.231499750 1000000	00:00:00.236619821 1000000
00:00:00.213177758 995030	00:00:00.242951419 1000000	00:00:00.261796132 1000000
00:00:00.231702350 990755	00:00:00.243229460 1000000	00:00:00.234265326 1000000
00:00:00.217356223 994786	00:00:00.256804222 1000000	00:00:00.248510852 1000000
00:00:00.219489556 997060	00:00:00.243536342 1000000	00:00:00.240165962 1000000
avg: 0.219279962	avg: 0.243069755	avg: 0.238174406

Take aways:

Locks on Crystal 1.11.2 are completely off 😱
LSE atomics are incredibly faster than LL/SC with 16 threads (they were similar with 4 threads) 🚀
Memory barriers are indeed implicit on AArch64 👍
I don't see a noticeable performance difference between lazy set + memory barriers and proper set with no barriers (neither on acquire or release) on the Neoverse N1;
I'm eager to compare acquire/release vs seq-cst memory orders 👀

NOTE: we might consider enabling LSE by default for AArch64, and having a -Dwithout_lse_atomics flag, and/or check the CPU flags to see if the feature is available at compile time.

jgaskins · 2024-01-30T23:20:11Z

Weird. With LSE it was slower on my M1 Mac, but ~18% faster than this PR on an Ampere Arm server on Google Cloud (T2A VM, 8 cores), which is fascinating.

jgaskins	ysbaddaden
00:00:00.408768519 1000000	00:00:00.329878465 1000000
00:00:00.335315272 1000000	00:00:00.268396861 1000000
00:00:00.360981995 1000000	00:00:00.272967381 1000000
00:00:00.324802272 1000000	00:00:00.330465706 1000000
00:00:00.321439072 1000000	00:00:00.275313622 1000000
00:00:00.349576434 1000000	00:00:00.364220269 1000000
00:00:00.439792642 1000000	00:00:00.365180429 1000000
00:00:00.312492590 1000000	00:00:00.339958506 1000000
00:00:00.363270475 1000000	00:00:00.338208266 1000000
00:00:00.523387650 1000000	00:00:00.285177182 1000000
avg: 00:00:00.37398269209999996	avg: 00:00:00.3169766687

HertzDevil · 2024-02-01T13:22:17Z

The other part of #13055 is Crystal::RWLock, which is only used for garbage collection

straight-shoota

LGTM then!

Sija · 2024-02-01T15:27:41Z

Would be nice to have some spec coverage.

jgaskins · 2024-02-02T17:30:01Z

There's a spec for it, but CI doesn't use -Dpreview_mt, so it doesn't catch this.

There is one CI entry that uses -Dpreview_mt, but it's Linux x86_64-only.

jgaskins added 2 commits January 29, 2024 17:46

Merge branch 'master' into fix-mutex-on-aarch64

d71150c

Blacksmoke16 added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:concurrency platform:aarch64 labels Jan 30, 2024

HertzDevil added the topic:multithreading label Feb 1, 2024

straight-shoota approved these changes Feb 1, 2024

View reviewed changes

jgaskins changed the title ~~Fix Mutex on aarch64~~ Add memory barrier to Mutex#unlock on aarch64 Feb 8, 2024

HertzDevil approved these changes Feb 8, 2024

View reviewed changes

straight-shoota added this to the 1.12.0 milestone Feb 8, 2024

straight-shoota merged commit db67d71 into crystal-lang:master Feb 10, 2024
57 checks passed

ysbaddaden mentioned this pull request Feb 12, 2024

Fix: Atomics and Locks (ARM, AArch64, X86) #14293

Merged

jgaskins deleted the fix-mutex-on-aarch64 branch February 12, 2024 18:12

BrewTestBot mentioned this pull request Apr 9, 2024

crystal 1.12.0 Homebrew/homebrew-core#168465

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add memory barrier to `Mutex#unlock` on aarch64 #14272

Add memory barrier to `Mutex#unlock` on aarch64 #14272

jgaskins commented Jan 30, 2024 •

edited

Loading

straight-shoota commented Jan 30, 2024

beta-ziliani commented Jan 30, 2024

ysbaddaden commented Jan 30, 2024

jgaskins commented Jan 30, 2024

ysbaddaden commented Jan 30, 2024

ysbaddaden commented Jan 30, 2024 •

edited

Loading

ysbaddaden commented Jan 30, 2024

jgaskins commented Jan 30, 2024

HertzDevil commented Feb 1, 2024

straight-shoota left a comment

Sija commented Feb 1, 2024

jgaskins commented Feb 2, 2024

Add memory barrier to Mutex#unlock on aarch64 #14272

Add memory barrier to Mutex#unlock on aarch64 #14272

Conversation

jgaskins commented Jan 30, 2024 • edited Loading

straight-shoota commented Jan 30, 2024

beta-ziliani commented Jan 30, 2024

ysbaddaden commented Jan 30, 2024

jgaskins commented Jan 30, 2024

ysbaddaden commented Jan 30, 2024

ysbaddaden commented Jan 30, 2024 • edited Loading

ysbaddaden commented Jan 30, 2024

jgaskins commented Jan 30, 2024

HertzDevil commented Feb 1, 2024

straight-shoota left a comment

Choose a reason for hiding this comment

Sija commented Feb 1, 2024

jgaskins commented Feb 2, 2024

Add memory barrier to `Mutex#unlock` on aarch64 #14272

Add memory barrier to `Mutex#unlock` on aarch64 #14272

jgaskins commented Jan 30, 2024 •

edited

Loading

ysbaddaden commented Jan 30, 2024 •

edited

Loading