Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add memory barriers on lock/unlock of SpinLock #13050

Merged

Conversation

bcardiff
Copy link
Member

@bcardiff bcardiff commented Feb 7, 2023

Fixes #13010. Credits goes to @ggiraldez

Examples provided in #13010 (comment) and #13010 (comment) works in aarch64-apple-darwin without problem so far.

Something I was thinking is that maybe we can use a more relaxed ordering since we are now emitting explicit memory barriers. So not using Atomic which enforces sequentially_consistent in the #swap, #get used in the SpinLock. But that can come later if needed. I am not sure if that would lead to performance improvement at all.

@straight-shoota straight-shoota added this to the 1.8.0 milestone Feb 7, 2023
@bcardiff
Copy link
Member Author

bcardiff commented Feb 7, 2023

Until released, this fix can be applied as a monkey-patch as:

class Crystal::SpinLock
  def lock
    previous_def
    ::Atomic::Ops.fence :sequentially_consistent, false 
  end

  def unlock
    ::Atomic::Ops.fence :sequentially_consistent, false 
    previous_def
  end
end

@straight-shoota straight-shoota merged commit 2d49b07 into crystal-lang:master Feb 8, 2023
jgaskins added a commit to jgaskins/crystal that referenced this pull request Jan 30, 2024
This solution is the same as the one used in crystal-lang#13050.

The following code is expected to output `1000000` preceded by the time
it took to perform it:

```
mutex = Mutex.new
numbers = Array(Int32).new(initial_capacity: 1_000_000)
done = Channel(Nil).new
concurrency = 20
iterations = 1_000_000 // concurrency
concurrency.times do
  spawn do
    iterations.times { mutex.synchronize { numbers << 0 } }
  ensure
    done.send nil
  end
end

start = Time.monotonic
concurrency.times { done.receive }
print Time.monotonic - start
print ' '
sleep 100.milliseconds # Wait just a bit longer to be sure the discrepancy isn't due to a *different* race condition
pp numbers.size
```

Before this commit, on an Apple M1 CPU, the array size would be anywhere
from 880k-970k, but I never observed it reach 1M. Here is a sample:

```
$ repeat 20 (CRYSTAL_WORKERS=10 ./mutex_check)
00:00:00.119271625 881352
00:00:00.111249083 936709
00:00:00.102355208 946428
00:00:00.116415166 926724
00:00:00.127152583 899899
00:00:00.097160792 964577
00:00:00.120564958 930859
00:00:00.122803000 917583
00:00:00.093986834 954112
00:00:00.079212333 967772
00:00:00.093168208 953491
00:00:00.102553834 962147
00:00:00.091601625 967304
00:00:00.108157208 954855
00:00:00.080879666 944870
00:00:00.114638042 930429
00:00:00.093617083 956496
00:00:00.112108959 940205
00:00:00.092837875 944993
00:00:00.097882625 916220
```

This indicates that some of the mutex locks were getting through when
they should not have been. With this commit, using the exact same
parameters (built with `--release -Dpreview_mt` and run with
`CRYSTAL_WORKERS=10` to spread out across all 10 cores) these are the
results I'm seeing:

```
00:00:00.078898166 1000000
00:00:00.072308084 1000000
00:00:00.047157000 1000000
00:00:00.088043834 1000000
00:00:00.060784625 1000000
00:00:00.067710250 1000000
00:00:00.081070750 1000000
00:00:00.065572208 1000000
00:00:00.065006958 1000000
00:00:00.061041541 1000000
00:00:00.059648291 1000000
00:00:00.078100125 1000000
00:00:00.050676250 1000000
00:00:00.049395875 1000000
00:00:00.069352334 1000000
00:00:00.063897833 1000000
00:00:00.067534333 1000000
00:00:00.070290833 1000000
00:00:00.067361500 1000000
00:00:00.078021833 1000000
```

Note that it's not only correct, but also significantly faster.
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in atomic operations on aarch64 with multi-threading
4 participants