Implement faster thread local rng for scheduler #55501

gbaraldi · 2024-08-15T17:01:54Z

Implement optimal uniform random number generator using the method proposed in swiftlang/swift#39143 based on OpenSSL's implementation of it in https://github.com/openssl/openssl/blob/1d2cbd9b5a126189d5e9bc78a3bdb9709427d02b/crypto/rand/rand_uniform.c#L13-L99

This PR also fixes some bugs found while developing it. This is a replacement for #50203 and fixes the issues found by @IanButterworth with both rngs

C rng

New scheduler rng

~~On my benchmarks the julia implementation seems to be almost 50% faster than the current implementation.~~
With oscars suggestion of removing the debiasing this is now almost 5x faster than the original implementation. And almost fully branchless

We might want to backport the two previous commits since they technically fix bugs.

oscardssmith · 2024-08-15T17:15:57Z

base/partr.jl

+# this process, so each each word beyond the first has a probability
+# of 2^-32 of not terminating the process.  That is, we're extremely
+# likely to stop very rapidly.
+    for _ in 1:10


I feel like we should be able to delete this loop entirely. Without it, we have a bias of 2^-32 which seems like it should be plenty low for the purposes of guaranteeing uniform scheduling

That's a fair point

This seems uniform enough.

gbaraldi · 2024-08-15T20:37:28Z

Hmm, this seems to have made i686 very unhappy. Ok the issue is that pointer load. 32 bit has different alignment which is what is blowing this up

giordano · 2024-08-16T00:13:47Z

We might want to backport the two previous commits since they technically fix bugs.

Sounds like that should be a separate PR

IanButterworth · 2024-08-16T14:11:08Z

Potentially naive question, but do we have a good benchmark to demonstrate that a faster scheduler rng actually speeds up scheduling?

I ask because in my testing of #50203 I saw the surprising behavior that the threaded fib got slower the faster the rng used (with no changes to allocations)

function fib(n::Int)
    n < 2 && return n
    t = Threads.@spawn fib(n - 2)
    return fib(n - 1) + fetch(t)
end

gbaraldi · 2024-08-16T14:12:18Z

I don't think it will affect it too too much. Because if the rand call is super hot then things aren't going so well elsewhere :)

base/partr.jl

…ling

base/partr.jl

giordano · 2024-08-27T18:12:18Z

There are already merge conflicts

src/scheduler.c

vchuravy · 2024-08-29T20:32:44Z

@nanosoldier runtests(ALL, vs = ":master")

nanosoldier · 2024-09-01T14:53:35Z

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

giordano · 2024-09-01T15:21:09Z

Do we need to run benchmarks, too?

gbaraldi · 2024-09-03T14:14:31Z

I don't think any benchmark is gonna show this. At least no current one. Calling the function is 5x faster that's all I can say

@IanButterworth

Implement optimal uniform random number generator using the method proposed in swiftlang/swift#39143 based on OpenSSL's implementation of it in https://github.com/openssl/openssl/blob/1d2cbd9b5a126189d5e9bc78a3bdb9709427d02b/crypto/rand/rand_uniform.c#L13-L99 This PR also fixes some bugs found while developing it. This is a replacement for #50203 and fixes the issues found by @IanButterworth with both rngs C rng <img width="1011" alt="image" src="https://github.com/user-attachments/assets/0dd9d5f2-17ef-4a70-b275-1d12692be060"> New scheduler rng <img width="985" alt="image" src="https://github.com/user-attachments/assets/4abd0a57-a1d9-46ec-99a5-535f366ecafa"> ~On my benchmarks the julia implementation seems to be almost 50% faster than the current implementation.~ With oscars suggestion of removing the debiasing this is now almost 5x faster than the original implementation. And almost fully branchless We might want to backport the two previous commits since they technically fix bugs. --------- Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

oscardssmith reviewed Aug 15, 2024

View reviewed changes

gbaraldi force-pushed the gb/fast-tls-rng branch from 7223ddf to 261ec6e Compare August 15, 2024 17:23

gbaraldi requested review from vchuravy and IanButterworth August 15, 2024 17:46

nsajko added the domain:randomness Random number generation and the Random stdlib label Aug 15, 2024

gbaraldi mentioned this pull request Aug 16, 2024

Fix fast getptls ccall lowering. #55507

Merged

Implement faster thread local rng for the scheduler.

6aa9a35

gbaraldi force-pushed the gb/fast-tls-rng branch from 9fe0fb2 to 6aa9a35 Compare August 16, 2024 13:41

Fix implementation in 32 bit platforms

cd16a49

giordano reviewed Aug 16, 2024

View reviewed changes

base/partr.jl Show resolved Hide resolved

giordano changed the title ~~Implement faster thread local rng for scheduler. Also related fixes~~ Implement faster thread local rng for scheduler Aug 17, 2024

Merge branch 'master' into gb/fast-tls-rng

1d4d210

gbaraldi added status:merge me PR is reviewed. Merge when all tests are passing and removed status:merge me PR is reviewed. Merge when all tests are passing labels Aug 26, 2024

Update code to use ccalls instead of unsafe_load with fast ccall hand…

1039ef4

…ling