-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add memory barrier to Mutex#unlock
on aarch64
#14272
Add memory barrier to Mutex#unlock
on aarch64
#14272
Conversation
This solution is the same as the one used in crystal-lang#13050. The following code is expected to output `1000000` preceded by the time it took to perform it: ``` mutex = Mutex.new numbers = Array(Int32).new(initial_capacity: 1_000_000) done = Channel(Nil).new concurrency = 20 iterations = 1_000_000 // concurrency concurrency.times do spawn do iterations.times { mutex.synchronize { numbers << 0 } } ensure done.send nil end end start = Time.monotonic concurrency.times { done.receive } print Time.monotonic - start print ' ' sleep 100.milliseconds # Wait just a bit longer to be sure the discrepancy isn't due to a *different* race condition pp numbers.size ``` Before this commit, on an Apple M1 CPU, the array size would be anywhere from 880k-970k, but I never observed it reach 1M. Here is a sample: ``` $ repeat 20 (CRYSTAL_WORKERS=10 ./mutex_check) 00:00:00.119271625 881352 00:00:00.111249083 936709 00:00:00.102355208 946428 00:00:00.116415166 926724 00:00:00.127152583 899899 00:00:00.097160792 964577 00:00:00.120564958 930859 00:00:00.122803000 917583 00:00:00.093986834 954112 00:00:00.079212333 967772 00:00:00.093168208 953491 00:00:00.102553834 962147 00:00:00.091601625 967304 00:00:00.108157208 954855 00:00:00.080879666 944870 00:00:00.114638042 930429 00:00:00.093617083 956496 00:00:00.112108959 940205 00:00:00.092837875 944993 00:00:00.097882625 916220 ``` This indicates that some of the mutex locks were getting through when they should not have been. With this commit, using the exact same parameters (built with `--release -Dpreview_mt` and run with `CRYSTAL_WORKERS=10` to spread out across all 10 cores) these are the results I'm seeing: ``` 00:00:00.078898166 1000000 00:00:00.072308084 1000000 00:00:00.047157000 1000000 00:00:00.088043834 1000000 00:00:00.060784625 1000000 00:00:00.067710250 1000000 00:00:00.081070750 1000000 00:00:00.065572208 1000000 00:00:00.065006958 1000000 00:00:00.061041541 1000000 00:00:00.059648291 1000000 00:00:00.078100125 1000000 00:00:00.050676250 1000000 00:00:00.049395875 1000000 00:00:00.069352334 1000000 00:00:00.063897833 1000000 00:00:00.067534333 1000000 00:00:00.070290833 1000000 00:00:00.067361500 1000000 00:00:00.078021833 1000000 ``` Note that it's not only correct, but also significantly faster.
Are you sure this resolves #13055 entirely and there are no other places that may need barriers? |
What if you replace the lazy set ( Here is for example what the linux kernel source code (v4.4) has to say:
I assume this stands for V7 CPUs too.
We use sequential consistency instead of acquire/release but that should only impact performance & seq-cst is stronger than acquire/release anyway. My understanding is that the atomic is enough as long as we don't break the contract (without a barrier the CPU may reorder lazy set before we increment the counter). |
@straight-shoota I’m sure that it fixes the issues I’ve observed with thread-safety on If you’re referring to the wording in the title of the PR, I can change it to “add memory barriers” as in #13050.
In my tests last night, that did give me the expected values, but was slower. I don’t know how much that matters since correctness > speed (up to a point), but this implementation gave us both. |
@jgaskins nice, at least it proves that it's working. The speed improvement with a barrier is weird 🤔 I'd be interested to see the performance impact when using acquire/release semantics on the atomics (without the barrier) instead sequential consistency 👀 |
We might get better performance by using LSE atomics from ARMv8.1 (e.g. EDIT: confirmed, by default llvm will generate ll/sc atomics but specifying |
I ran the example code from the PR description on a Neoverse-N1 server 🤩 with 16 worker threads.
With LL/SC atomics (
With LSE atomics (
Take aways:
NOTE: we might consider enabling LSE by default for AArch64, and having a |
Weird. With LSE it was slower on my M1 Mac, but ~18% faster than this PR on an Ampere Arm server on Google Cloud (T2A VM, 8 cores), which is fascinating.
|
The other part of #13055 is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM then!
Would be nice to have some spec coverage. |
There's a spec for it, but CI doesn't use There is one CI entry that uses |
Mutex
on aarch64Mutex#unlock
on aarch64
This solution is the same as the one used in #13050.
The following code is expected to output
1000000
preceded by the time it took to perform it:Before this commit, on an Apple M1 CPU, the array size would be anywhere from 880k-970k, but I never observed it reach 1M. Here is a sample:
This indicates that some of the mutex locks were getting through when they should not have been. With this commit, using the exact same parameters (built with
--release -Dpreview_mt
and run withCRYSTAL_WORKERS=10
to spread out across all 10 cores) these are the results I'm seeing:Note that it's not only correct, but also significantly faster.
Fixes #13055