Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cong in Julia #50203

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Implement cong in Julia #50203

wants to merge 4 commits into from

Conversation

vchuravy
Copy link
Member

Moving cong to Julia safes a tiny bit in overhead,
but accessing the PTLS rngseed for the scheduler is a bit... ugly.

@gbaraldi
Copy link
Member

Can we do something that doesn't need the PTLS?

@vchuravy
Copy link
Member Author

I think the challenge is that we want randomness that is independent of the task state, since we don't have "scheduler state" outside of the ptls that's the only option.

I am more annoyed how slow this is due to the mod operation.

@oscardssmith
Copy link
Member

Can we guarantee that the size of the heaps are powers of 2? If so, you can just use an &.

@vchuravy
Copy link
Member Author

It's the number of heaps and not the size of them, and that is equal to the number of threads. Which often is a power of two, but not guaranteed

@oscardssmith
Copy link
Member

What about using Base.MultiplicativeInverses to precompute the multiplicativeinverse of the number of threads?

Base automatically changed from vc/stable_wait to master July 5, 2023 10:11
base/partr.jl Outdated Show resolved Hide resolved
base/partr.jl Outdated Show resolved Hide resolved
@IanButterworth
Copy link
Sponsor Member

IanButterworth commented Jul 15, 2024

Is this a valid benchmark?

Update: ** not a fair test because the base commit is different **

this PR

% ./julia -t6 -q
julia> function fib(n::Int)
                  if n < 2
                      return n
                  end
                  t = Threads.@spawn fib(n - 2)
                  return fib(n - 1) + fetch(t)
              end
fib (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime fib(30)
  2.098 s (6929222 allocations: 1.13 GiB)
832040

julia> @btime fib(30)
  2.029 s (6896970 allocations: 1.05 GiB)
832040

master b88f64f (via juliaup)

% julia +nightly -t6
...
julia> @btime fib(30)
  748.326 ms (6772487 allocations: 734.01 MiB)
832040

julia> @btime fib(30)
  700.210 ms (6743099 allocations: 659.97 MiB)
832040

@vchuravy
Copy link
Member Author

Probably. I haven't checked the type-stability of this code after the recent rewrite. Since the allocation increased I assume we have one.

@IanButterworth do you want to take this over?

base/partr.jl Outdated
cong(max::UInt32) = iszero(max) ? UInt32(0) : jl_rand_ptls(max) + UInt32(1)

function jl_rand_ptls(max::UInt32)
rngseed = Base.unsafe_load(Base.unsafe_convert(Ptr{UInt64}, Core.getptls()), 2) # TODO, less horrid
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtjnash any ideas on how to do this nicer?

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it so we only unsafe_convert plts once. Is that nice enough?

@IanButterworth
Copy link
Sponsor Member

Doesn't appear to be an allocation within cong. And it's faster.
This PR b5d065a

julia> @btime Base.Partr.cong(UInt32(100))
  2.333 ns (0 allocations: 0 bytes)

master

julia> @btime Base.Partr.cong(UInt32(100))
  5.333 ns (0 allocations: 0 bytes)
0x00000036

However I just locally updated this branch from master and it regressed to

julia> @btime Base.Partr.cong(UInt32(100))
  6.958 ns (0 allocations: 0 bytes)
0x00000061

vchuravy and others added 3 commits July 23, 2024 14:35
Co-authored-by: Ian Butterworth <i.r.butterworth@gmail.com>
@gbaraldi
Copy link
Member

gbaraldi commented Jul 24, 2024

So looking into this together with Ian we discovered that our rng is just very bad. I also discovered that get_ptls is not getting optimized as it should because the declaration is behind an ifdef so the ccall transform doesn't do it.
Finally we also discovered that rand(1:range) performs quite on par with the current cong and better than the corrected one.
The corrected one being
Following the origin in https://opensource.apple.com/source/Libc/Libc-1439.141.1/gen/FreeBSD/arc4random.3.auto.html
instead of the wrong implementation in https://www.pcg-random.org/posts/bounded-rands.html.
Basically the current thing only uses n bits where n is 2^n for minimal value that's largest than the max. Which isn't a lot of bits leading to basically cycling over the numbers.

function _cong(max::UInt64, seed::UInt64)
    if max < 2
        return UInt64(0), seed
    end
    mask = typemax(UInt64)
    zeros = leading_zeros(max)
    mask >>= zeros
    bits = 8 * sizeof(UInt64) - zeros
    while true
        value = UInt64(69069) * seed + UInt64(362437)
        seed = value
        x = seed & mask
        if (x < max)
            return x, seed
        end
        bits_left = zeros
        while bits_left >= bits
            value >>= bits
            x = value & mask
            if x < max
                return x, seed
            end
            bits_left -= bits
        end
    end
    return x, seed
end

@IanButterworth
Copy link
Sponsor Member

IanButterworth commented Jul 24, 2024

To illustrate the periodicity of cong on master.

julia> for i in 2:20
           @info i
           x = [Base.Partr.cong(UInt32(i)) for _ in 1:100];
           display(lineplot(x, width=100, height = 8))
       end
2-6 7-11 12-16
Screenshot 2024-07-23 at 10 26 47 PM Screenshot 2024-07-23 at 10 26 59 PM Screenshot 2024-07-23 at 10 27 15 PM

gbaraldi added a commit that referenced this pull request Sep 9, 2024
Implement optimal uniform random number generator using the method
proposed in swiftlang/swift#39143 based on
OpenSSL's implementation of it in
https://github.com/openssl/openssl/blob/1d2cbd9b5a126189d5e9bc78a3bdb9709427d02b/crypto/rand/rand_uniform.c#L13-L99

This PR also fixes some bugs found while developing it. This is a
replacement for #50203 and fixes
the issues found by @IanButterworth with both rngs

C rng
<img width="1011" alt="image"
src="https://github.com/user-attachments/assets/0dd9d5f2-17ef-4a70-b275-1d12692be060">

New scheduler rng
<img width="985" alt="image"
src="https://github.com/user-attachments/assets/4abd0a57-a1d9-46ec-99a5-535f366ecafa">

~On my benchmarks the julia implementation seems to be almost 50% faster
than the current implementation.~
With oscars suggestion of removing the debiasing this is now almost 5x
faster than the original implementation. And almost fully branchless

We might want to backport the two previous commits since they
technically fix bugs.

---------

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
kshyatt pushed a commit that referenced this pull request Sep 12, 2024
Implement optimal uniform random number generator using the method
proposed in swiftlang/swift#39143 based on
OpenSSL's implementation of it in
https://github.com/openssl/openssl/blob/1d2cbd9b5a126189d5e9bc78a3bdb9709427d02b/crypto/rand/rand_uniform.c#L13-L99

This PR also fixes some bugs found while developing it. This is a
replacement for #50203 and fixes
the issues found by @IanButterworth with both rngs

C rng
<img width="1011" alt="image"
src="https://github.com/user-attachments/assets/0dd9d5f2-17ef-4a70-b275-1d12692be060">

New scheduler rng
<img width="985" alt="image"
src="https://github.com/user-attachments/assets/4abd0a57-a1d9-46ec-99a5-535f366ecafa">

~On my benchmarks the julia implementation seems to be almost 50% faster
than the current implementation.~
With oscars suggestion of removing the debiasing this is now almost 5x
faster than the original implementation. And almost fully branchless

We might want to backport the two previous commits since they
technically fix bugs.

---------

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants