Implement cong in Julia #50203

vchuravy · 2023-06-17T01:38:04Z

Moving cong to Julia safes a tiny bit in overhead,
but accessing the PTLS rngseed for the scheduler is a bit... ugly.

gbaraldi · 2023-06-19T17:05:56Z

Can we do something that doesn't need the PTLS?

vchuravy · 2023-06-19T17:23:16Z

I think the challenge is that we want randomness that is independent of the task state, since we don't have "scheduler state" outside of the ptls that's the only option.

I am more annoyed how slow this is due to the mod operation.

oscardssmith · 2023-06-19T17:51:00Z

Can we guarantee that the size of the heaps are powers of 2? If so, you can just use an &.

vchuravy · 2023-06-19T18:18:06Z

It's the number of heaps and not the size of them, and that is equal to the number of threads. Which often is a power of two, but not guaranteed

oscardssmith · 2023-06-19T20:32:35Z

What about using Base.MultiplicativeInverses to precompute the multiplicativeinverse of the number of threads?

base/partr.jl

IanButterworth · 2024-07-15T17:25:30Z

Is this a valid benchmark?

Update: ** not a fair test because the base commit is different **

this PR

% ./julia -t6 -q
julia> function fib(n::Int)
                  if n < 2
                      return n
                  end
                  t = Threads.@spawn fib(n - 2)
                  return fib(n - 1) + fetch(t)
              end
fib (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime fib(30)
  2.098 s (6929222 allocations: 1.13 GiB)
832040

julia> @btime fib(30)
  2.029 s (6896970 allocations: 1.05 GiB)
832040

master `b88f64f` (via juliaup)

% julia +nightly -t6
...
julia> @btime fib(30)
  748.326 ms (6772487 allocations: 734.01 MiB)
832040

julia> @btime fib(30)
  700.210 ms (6743099 allocations: 659.97 MiB)
832040

vchuravy · 2024-07-15T17:27:48Z

Probably. I haven't checked the type-stability of this code after the recent rewrite. Since the allocation increased I assume we have one.

@IanButterworth do you want to take this over?

vchuravy · 2024-07-15T17:29:25Z

base/partr.jl

+cong(max::UInt32) = iszero(max) ? UInt32(0) : jl_rand_ptls(max) + UInt32(1)
+
+function jl_rand_ptls(max::UInt32)
+    rngseed = Base.unsafe_load(Base.unsafe_convert(Ptr{UInt64}, Core.getptls()), 2) # TODO, less horrid


@vtjnash any ideas on how to do this nicer?

I made it so we only unsafe_convert plts once. Is that nice enough?

IanButterworth · 2024-07-15T18:14:25Z

Doesn't appear to be an allocation within cong. And it's faster.
This PR b5d065a

julia> @btime Base.Partr.cong(UInt32(100))
  2.333 ns (0 allocations: 0 bytes)

master

julia> @btime Base.Partr.cong(UInt32(100))
  5.333 ns (0 allocations: 0 bytes)
0x00000036

However I just locally updated this branch from master and it regressed to

julia> @btime Base.Partr.cong(UInt32(100))
  6.958 ns (0 allocations: 0 bytes)
0x00000061

Co-authored-by: Ian Butterworth <i.r.butterworth@gmail.com>

gbaraldi · 2024-07-24T02:11:52Z

So looking into this together with Ian we discovered that our rng is just very bad. I also discovered that get_ptls is not getting optimized as it should because the declaration is behind an ifdef so the ccall transform doesn't do it.
Finally we also discovered that rand(1:range) performs quite on par with the current cong and better than the corrected one.
The corrected one being
Following the origin in https://opensource.apple.com/source/Libc/Libc-1439.141.1/gen/FreeBSD/arc4random.3.auto.html
instead of the wrong implementation in https://www.pcg-random.org/posts/bounded-rands.html.
Basically the current thing only uses n bits where n is 2^n for minimal value that's largest than the max. Which isn't a lot of bits leading to basically cycling over the numbers.

function _cong(max::UInt64, seed::UInt64)
    if max < 2
        return UInt64(0), seed
    end
    mask = typemax(UInt64)
    zeros = leading_zeros(max)
    mask >>= zeros
    bits = 8 * sizeof(UInt64) - zeros
    while true
        value = UInt64(69069) * seed + UInt64(362437)
        seed = value
        x = seed & mask
        if (x < max)
            return x, seed
        end
        bits_left = zeros
        while bits_left >= bits
            value >>= bits
            x = value & mask
            if x < max
                return x, seed
            end
            bits_left -= bits
        end
    end
    return x, seed
end

IanButterworth · 2024-07-24T02:30:37Z

To illustrate the periodicity of cong on master.

julia> for i in 2:20
           @info i
           x = [Base.Partr.cong(UInt32(i)) for _ in 1:100];
           display(lineplot(x, width=100, height = 8))
       end

2-6	7-11	12-16

@IanButterworth

Implement optimal uniform random number generator using the method proposed in swiftlang/swift#39143 based on OpenSSL's implementation of it in https://github.com/openssl/openssl/blob/1d2cbd9b5a126189d5e9bc78a3bdb9709427d02b/crypto/rand/rand_uniform.c#L13-L99 This PR also fixes some bugs found while developing it. This is a replacement for #50203 and fixes the issues found by @IanButterworth with both rngs C rng <img width="1011" alt="image" src="https://github.com/user-attachments/assets/0dd9d5f2-17ef-4a70-b275-1d12692be060"> New scheduler rng <img width="985" alt="image" src="https://github.com/user-attachments/assets/4abd0a57-a1d9-46ec-99a5-535f366ecafa"> ~On my benchmarks the julia implementation seems to be almost 50% faster than the current implementation.~ With oscars suggestion of removing the debiasing this is now almost 5x faster than the original implementation. And almost fully branchless We might want to backport the two previous commits since they technically fix bugs. --------- Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

@IanButterworth

Implement optimal uniform random number generator using the method proposed in swiftlang/swift#39143 based on OpenSSL's implementation of it in https://github.com/openssl/openssl/blob/1d2cbd9b5a126189d5e9bc78a3bdb9709427d02b/crypto/rand/rand_uniform.c#L13-L99 This PR also fixes some bugs found while developing it. This is a replacement for #50203 and fixes the issues found by @IanButterworth with both rngs C rng <img width="1011" alt="image" src="https://github.com/user-attachments/assets/0dd9d5f2-17ef-4a70-b275-1d12692be060"> New scheduler rng <img width="985" alt="image" src="https://github.com/user-attachments/assets/4abd0a57-a1d9-46ec-99a5-535f366ecafa"> ~On my benchmarks the julia implementation seems to be almost 50% faster than the current implementation.~ With oscars suggestion of removing the debiasing this is now almost 5x faster than the original implementation. And almost fully branchless We might want to backport the two previous commits since they technically fix bugs. --------- Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

Base automatically changed from vc/stable_wait to master July 5, 2023 10:11

IanButterworth reviewed Jul 15, 2024

View reviewed changes

base/partr.jl Outdated Show resolved Hide resolved

base/partr.jl Outdated Show resolved Hide resolved

vchuravy commented Jul 15, 2024

View reviewed changes

vchuravy and others added 3 commits July 23, 2024 14:35

Implement cong in Julia

8ce8082

Update impl

16f1be0

Apply suggestions from code review

9e27434

Co-authored-by: Ian Butterworth <i.r.butterworth@gmail.com>

IanButterworth force-pushed the vc/julia_cong branch from b5d065a to f1ea54e Compare July 23, 2024 18:40

only unsafe_convert ptls once

5176f22

IanButterworth force-pushed the vc/julia_cong branch from f1ea54e to 5176f22 Compare July 23, 2024 18:40

gbaraldi mentioned this pull request Aug 15, 2024

Implement faster thread local rng for scheduler #55501

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement cong in Julia #50203

Implement cong in Julia #50203

vchuravy commented Jun 17, 2023

gbaraldi commented Jun 19, 2023

vchuravy commented Jun 19, 2023

oscardssmith commented Jun 19, 2023

vchuravy commented Jun 19, 2023

oscardssmith commented Jun 19, 2023

IanButterworth commented Jul 15, 2024 •

edited

Loading

vchuravy commented Jul 15, 2024

vchuravy Jul 15, 2024

IanButterworth Jul 23, 2024

IanButterworth commented Jul 15, 2024

gbaraldi commented Jul 24, 2024 •

edited

Loading

IanButterworth commented Jul 24, 2024 •

edited

Loading

Implement cong in Julia #50203

Are you sure you want to change the base?

Implement cong in Julia #50203

Conversation

vchuravy commented Jun 17, 2023

gbaraldi commented Jun 19, 2023

vchuravy commented Jun 19, 2023

oscardssmith commented Jun 19, 2023

vchuravy commented Jun 19, 2023

oscardssmith commented Jun 19, 2023

IanButterworth commented Jul 15, 2024 • edited Loading

this PR

master b88f64f (via juliaup)

vchuravy commented Jul 15, 2024

vchuravy Jul 15, 2024

Choose a reason for hiding this comment

IanButterworth Jul 23, 2024

Choose a reason for hiding this comment

IanButterworth commented Jul 15, 2024

gbaraldi commented Jul 24, 2024 • edited Loading

IanButterworth commented Jul 24, 2024 • edited Loading

IanButterworth commented Jul 15, 2024 •

edited

Loading

master `b88f64f` (via juliaup)

gbaraldi commented Jul 24, 2024 •

edited

Loading

IanButterworth commented Jul 24, 2024 •

edited

Loading