random: introduce `Sampler` to formalize hooking into rand machinery #23964

rfourquet · 2017-10-02T14:20:00Z

NOTE: this is WIP as currently this incurs a performance regression, unfortunately, so this may have to sleep here for some time.

This is an attempt at making it easier for user code to define rand on custom types. This was also motivated by the tedious code repetition in Base. For each new type for which rand is defined, we had to define all different incantations, in particular array versions.

To define rand on a new type T, it would now be enough to define:

State(rng, x::T), which creates an object (say of type StateT<:State) possibly storing some state which can help speed up generation of multiple values (typically what we do for rand(1:n))
rand(rng, st::StateT).

As in most cases, the state is trivial, there are default fall-back methods (State(rng, x) = StateTrivial(x) where StateTrivial is a predefined type with 1 field which wraps x) which handle point 1 automatically, and allow to simply define rand(rng, st::StateTrivial{T}) (point 2).

The benefit is also that it exposes to the user a way to generate more efficiently (when the state is not trivial) multiple random values, e.g.
st = State(rng, 1:10); while cond; ... y = rand(rng, st)... end.

rng has to be an argument to State as the optimal state could depend on the type of rng.
Note that a subsequent work could be to combine rng and the state into a single object for an easier API: g = Generator(rng, 1:10); ... y = rand(g).

As said, this incurs a penalty for generation from an array: rand(rng, [1, 2, 3]). The reason is that (AFAIU) State(rng, [1, 2, 3]) creates an object StateSimple([1, 2, 3], some_state) which holds a reference to the initial array; it doesn't seem to be a problem of heap allocation, but rather that the creation of the StateSimple object isn't currently optimized out. This question is clearly not in my comfort zone, and I would love if someone can help me solve this performance problem.
Note also that in the current state, nanosoldier would report other performance regressions, but which are not real, cf. JuliaCI/BaseBenchmarks.jl#124.

mschauer · 2017-10-02T17:28:06Z

Maybe "Sampler" instead of "State" would be more descriptive?

mschauer · 2017-10-03T07:01:48Z

base/random/RNGs.jl

@@ -317,65 +325,63 @@ function rand!(r::MersenneTwister, A::Array{Float64}, n::Int=length(A),
    A
 end

+rand!(r::MersenneTwister, A::Array{Float64}, st::StateTrivial{<:FloatInterval_64}) =
+    _rand!(r, A, length(A), st[])


this is the location of the performance regression? you could try noinline here

(here or in _rand!)

Thanks for looking! Actually the performance hit (by about 1x or 2x) doesn't occur here. I will comment where it happens.

rfourquet · 2017-10-03T11:25:39Z

Maybe "Sampler" instead of "State" would be more descriptive?

Of course the name is up for debate! I'm not familiar with the Distributions package (I don't manage to install it), but would Sampler be compatible with its meaning therein?
At least State carries the notion that some state has to be precomputed.

rfourquet · 2017-10-03T11:44:09Z

base/random/generation.jl

-end
-
-rand!(A::AbstractArray, r::AbstractArray) = rand!(GLOBAL_RNG, A, r)
+State(rng::AbstractRNG, r::AbstractArray) = StateSimple(r, State(rng, 1:length(r)))


@mschauer The performace hit happens between here and below in rand(rng::AbstractRNG, st::StateSimple{<:AbstractArray,<:State}). StateSimple introduces a kind of indirection compared to how this was implemented before (in a call like rand(rng, [1, 2, 3])), which currently is not eliminated/optimized-out (but could be in theory, I think).

it reminded me of https://github.com/JuliaGraphs/LightGraphs.jl/blob/master/src/graphtypes/simplegraphs/simpleedgeiter.jl#L85 where the wrapper SimpleEdgeIter is completely optimized away, but only if has_edge is not inlined.

Oh... I don't understand how this works, but will then try to put some @noinline here or there!

Cargo cult programming :-D

mschauer · 2017-10-03T11:56:21Z

Distributions claimed Sampler in the past, but not anymore, as now all XXXSampler <: Sampleable
We could also take AbstractSampler as abstract type, then there is no issue.

rfourquet · 2017-10-03T12:05:27Z

But is a sampler in the Distributions terminlogy similar to the State objects here? what about what I called Generator in the OP? I also had in mind to eventually define Normal in the Random module, in order to be able to write rand(rng, Normal(), ...); maybe Normal and Exponential would have to be subtypes of a Distribution supertype (the idea is that it's not scalable to define randn/randexp for all possible incantations). I'm saying this because I'm not good with choosing names, but this may have to be taken into account for chosing the new name for State.

vchuravy · 2017-10-03T12:11:25Z

I think State is a good name and as datapoint is what is used by the curand API (on yhe GPU you need to pass the state explicitly).

mschauer · 2017-10-03T12:35:44Z

A state in the context of random number generation usually changes...

mschauer · 2017-10-04T09:12:36Z

I would like to have Distribution - types for distributions sampled in .Random, e.g. Normal(). Similar as types give the Julia/programmers means to talk about variables, Distributions give means to talk about random variables. I would be hesitant to buy into Distributions "univariate", "multivariate", "discrete", "continuous" Distribution trait hierarchy, which is too specific and kind of overkill for the package and more so for .Random. Not easy to resolve.

rfourquet · 2017-10-05T15:00:19Z

So, given that there seemed to be interest and that this work is difficult to rebase (I wrote a version 6 months ago and had to basically rewrite it from scratch), I decided to write a band-aid for the perf regression... and discovered that actually it was caused by calling a less efficient algorithm (it's always easier to accuse the compiler ;-) ).

Basically the problem is that in different places, we use different algorithms depending on whether we generate one or indefinitely many random values. It turned out that the amount of hack necessary to make these optimizations compatible with the State framework was bigger than evolving it so as to formalize this practice. So, now State takes three arguments: State(::AbstractRNG, ::MyNewType, ::Repetition), where Repetition = Union{Val{1}, Val{Inf}} (Int could later on be added to this Union, to allow manage threshold beyond which an algorithm is more performant). To make it easier for the general case, there is the fall-back State(rng, x) = State(rng, x, Val(Inf)). Overall I'm pretty happy with this, and it offers nice flexibility. Demo:

to define rand(::String), it would be enough to write this one definition: State(rng::AbstractRNG, s::AbstractString, n::Repetetion) = State(rng, collect(s), n).
cf. the diff of this PR to see how a version is defined for optimizing the case of generating only one Char: State(::AbstractRNG, str::AbstractString, ::Val{1}) = StateTrivial(str) and then rand(rng, st::StateTrivial{AbstractString})= ....
note that in the case when one wants to generate optimally a very small number of characters: st = State(rng, mystring, Val(1)); while ... y = rand(rng, st) ... end. Currently this is equivalent to calling directly multiple times rand(rng, mystring), but I will add soon a second commit where this becomes more efficient).

Of course the name Repetition and its values can be bikeshed; also I will rename later State to Sampler according to the upvoted suggestion. Also, I wanted to have a fallback State(rng, x, n) = State(x) so that in most cases it's a bit simpler to define the state for a new type, e.g. State(str::AbstractString)=... instead of State(::AbstractRNG, str::AbstractString, ::Repetition)=... when the first and last arguments are not used, but this is the source of too many method ambiguities currently.

The performance seem now to mostly not show regression! But in some cases, there can be like 10% regression, e.g. for IntSet (or 30% for Set but I have a fix), apparently because of some allocation for the creation of a wrapper). I think this is acceptable, in particular for a "marginal" use case. I will add soon a new commit to try optimize few cases using the new tools (the current state is as close as possible to master in terms of the algorithms). Please don't run nanosoldier yet, I have to issue a PR for BaseBenchmarks first.

I think another nice consequence of this framework is to help solving the open problem of making it easier for RNG implementors to re-use the implementation for some basic types (e.g defining rand(Int) in terms of rand(Float64)), using some traits.

So please feel free to review, I would like to not wait too long before merging to avoid too many rebases.

rfourquet · 2017-11-18T13:10:00Z

I rebased and renamed State to Sample as discussed. I'm not so fond of this name, but I would like to move forward with this, as it's blocking other changes. So I prefer smaller and faster iterations; so I postpone the few improvements which would mitigate the very few small regressions (which are within about 10%, and considered noise by BaseBenchmarks), because it can be done later and it changes slighly the algorithms (whereas currently this PR tried to be as much algorithmically-equivalent as master). Also I would tend to think that this internal API should not be guaranted to be stable in version 1.0. My PR to BaseBenchmarks.jl will be soon merged, then we can run nanosoldier here, and then it would be nice to merge. Thanks!

ViralBShah · 2017-11-19T01:57:49Z

cc @andreasnoack

mschauer · 2017-11-19T08:38:12Z

Any particular reason to take Sample instead of Sampler or was that an oversight? Otherwise of course 💯 to move forward.

rfourquet · 2017-11-19T09:17:24Z

Any particular reason to take Sample instead of Sampler or was that an oversight?

Absolutely an oversight, thanks! (so there was a simple reason why I was annoyed with the name in my previous comment ;-) )

rfourquet · 2017-11-19T09:41:09Z

Oh, and I realize that the State variables where named with the obvious abbreviation st. What is a good name for Sampler variables? (sa doesn't sound so good). Not that it matters so much, and I could just keep st.

rfourquet · 2017-11-27T08:27:52Z

@nanosoldier runbenchmarks("random", vs=":master")

nanosoldier · 2017-11-27T09:30:15Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

rfourquet · 2017-11-27T11:13:33Z

Interesting, I can't reproduce locally the last three results from nanosoldier's report (1 improvement and 2 regressions), so re-running this part:
@nanosoldier runbenchmarks("random" && "types" && ("Complex{Int128}" || "Complex{UInt128}" || "Complex{Float64}")), vs=":master").

The Inf regression in terms of memory is because before there was zero allocations.

I will merge within one or two days if no objection.

rfourquet · 2017-11-27T12:04:20Z

@nanosoldier runbenchmarks("random" && "types" && ("Complex{Int128}" || "Complex{UInt128}" || "Complex{Float64}")), vs=":master")

ararslan · 2017-11-27T18:56:27Z

The syntax of that request is off (one too many )s) and it made Nanosoldier error then stop responding to things.

@nanosoldier runbenchmarks("random" && "types" && ("Complex{Int128}" || "Complex{UInt128}" || "Complex{Float64}"), vs=":master")

rfourquet · 2017-11-27T19:06:00Z

Oups sorry if I confused and broke Nanosoldier!

ararslan · 2017-11-27T19:11:04Z

Nah, no worries. Usually Nanosoldier is better at recovering from things like that, so I'm not sure what the deal is.

nanosoldier · 2017-11-27T19:50:51Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

rfourquet · 2017-11-28T12:15:36Z

There is something strange, which would make me believe Nanosoldier inverted the results in the last run: for three of the results appearing in the second benchmark, which correspond to the last three results of the first benchmark, the numbers for time ration are inverted. Which would show some benchmarking consistency if indeed Nanosoldier confused master vs PR. @ararslan could this be possible that this happened?

Still I would be puzzled, as I can't reproduce that locally. Assuming the first report is more correct, I would then be inclined to merge as is, as the two regressions, for rand(Complex{[U]Int128}) seem marginal enough in their usefulness to not warrant further delay (plus I couldn't even work on fixing them as I don't get the same numbers...).

rfourquet · 2017-11-28T12:19:39Z

Even more interesting (or puzzling): @nalimilan reports exactly the same 3 rand! regressions as last Nanosoldier report above, when his PR is not rand-related...

KristofferC · 2017-11-28T13:14:20Z

I remember seeing something similar related to the #20621 PR. After that was merged, in another, unrelated PR there were huge differences in the readuntil benchmarks. That shouldn't have been possible since the merge commit would run with the same version of readuntil. Can't find the comment where I mentioned it now though.

[ci skip] CI already ran.

rfourquet · 2017-11-30T09:48:58Z

CI was all green except for this one which has a too long log, but that I assume unrelated: https://travis-ci.org/JuliaLang/julia/jobs/309063786.
I just re-pushed an updated commit using Sampler instead of State in its message to reflect the change in the content, but used [ci skip] to not re-run CI before merging (except appveyor insists to run).

rfourquet · 2017-11-30T09:51:28Z

And thanks all for feed-back!

KristofferC · 2018-01-17T11:03:46Z

julia> struct F64 <: AbstractFloat
           x::Float64
       end

julia> Base.rand(rng::AbstractRNG, ::Type{F64}) = F64(rand())

julia> rand(F64)
F64(0.4977101346616646)

julia> rand(F64, 2)
ERROR: ArgumentError: Sampler for this object is not defined
Stacktrace:
 [1] Type at ./random/random.jl:106 [inlined]
 [2] rand at ./random/random.jl:193 [inlined]

What should I do? Is the fallback not working properly?

cc @rfourquet

KristofferC · 2018-01-17T11:16:23Z

The error message is worse than a method error. With a method error I would at least know what function was attempted to get called and what the types were.

rfourquet · 2018-01-19T10:07:57Z

What should I do?

By default, generating random floating point values of type T falls-back to generating from a Random.CloseOpen01{T} value; so you have to define e.g. Random.rand(rng::AbstractRNG, ::Random.SamplerTrivial{Random.CloseOpen01{F64}}) = F64(rand(rng)). If you don't want to generate values in [0, 1), you have to define also a Sampler specialization for F64 (if you need to do so, I will add an example here).

The error message is worse than a method error

You seem to assume that this PR intentionally by-passed what would naturally be a default MethodError, but this is not the case: to make rand work on custom values as simple as possible in most cases, a not-so-trivial machinery involves defining default Sampler fallbacks for types and values, on which rand is called; if such rand is not defined for this Sampler object sp, the system tries to get a Sampler out of sp, which would turn into an infinite loop if the method throwing the error message you see was not defined. If there was a way in Julia to define a function on any arguments excluding a certain type (Sampler in this case), then the default method error would do the job and the custom message here would be un-necessary. I should certainly improve the error message though.

KristofferC · 2018-01-19T10:19:06Z

(if you need to do so, I will add an example here).

Getting personal assistance from a dev is not something we can typically expect people to get. If the whole Sampler / CloseOpen01 etc is a public API then it needs docs + examples.

so you have to define e.g. Random.rand(rng::AbstractRNG, ::Random.SamplerTrivial{Random.CloseOpen01{F64}}) = F64(rand(rng))

This seems like a severe regression in the API for the simple case. Is there any way to have fallbacks so previous definitions keep working?

rfourquet · 2018-01-19T10:44:50Z

If the whole Sampler / CloseOpen01 etc is a public API then it needs docs + examples.

Sure; but as I said in the issue you opened, nothing was documented before, and the system was fragile and not designed for extension (though it worked somewhat in some cases).

This seems like a severe regression in the API for the simple case.

having to write ::Random.SamplerTrivial{Random.CloseOpen01{F64}}) instead of ::Random.CloseOpen{F64}, or even ::Type{F64} is not that severe, when you consider the benefit of the new system. If F64 was not a floating point type, you would have to define rand for ::Random.SamplerType{F64}, which is quite concise, and not so much more verbose than ::Type{F64}.

Is there any way to have fallbacks so previous definitions keep working?

I tried to have that of course, but this ends up in mutually recursive function calls which will stack overflow if no definition is provided for the custom type. I.e. (with some simplification, but it's the idea) : rand(::Type{F64})), if not defined directly, will call rand(::SamplerType{F64}), which would have to call, if not defined, rand(F64) etc. Why should we call rand(::SamplerType{F64}) in the first place? because the system is based on this, and for example rand!(a::Array{F64}) will fill up a with a loop calling rand(SamplerType{F64}).

rfourquet · 2018-01-19T10:49:43Z

Getting personal assistance from a dev is not something we can typically expect people to get

I was just offering you my assistance in case you must update a package or something to 0.7 before I have a chance to document the new system. I didn't mean this is my solution to the documentation issue :)

KristofferC · 2018-01-19T10:51:41Z

I was just offering you my assistance in case you must update a package or something to 0.7 before I have a chance to document the new system.

I see, thank you!

mschauer · 2018-01-19T11:04:01Z

Can Random.rand(rng::AbstractRNG, ::Random.SamplerTrivial{Random.CloseOpen01{T}}) where {T<:AbstractFloat} = rand(rng, T)
be defined as general fallback? The documentation says rand samples from "[0, 1) for floating point numbers".

rfourquet · 2018-01-20T09:38:07Z

(I edited your post, the Random.rand(rng::AbstractRNG, :: was missing)

Can Random.rand(rng::AbstractRNG, ::Random.SamplerTrivial{Random.CloseOpen01{T}}) where {T<:AbstractFloat} = rand(rng, T)
be defined as general fallback?

Not in the current state, as this will also lead to a stackoverflow problem (if rand(rng, T) is not defined). I will check if this could be changed to only having to define Random.rand(rng::AbstractRNG, ::Random.SamplerTrivial{T}), i.e. not having to write CloseOpen01{T}, when T <: AbstractFloat. But even if it stays like this, the current way is not such a big deal in my opinion: I think very few people will end up defining rand for a custom floating point type, it's not something people do on a regular basis.

rfourquet added the randomness Random number generation and the Random stdlib label Oct 2, 2017

rfourquet changed the title ~~random: introduce State to formalize hooking into rand machinery~~ WIP: random: introduce State to formalize hooking into rand machinery Oct 2, 2017

mschauer reviewed Oct 3, 2017

View reviewed changes

rfourquet commented Oct 3, 2017

View reviewed changes

rfourquet force-pushed the rf/rand/state branch from f1296de to 004b12c Compare October 5, 2017 15:00

rfourquet force-pushed the rf/rand/state branch from 004b12c to 04c75b1 Compare October 5, 2017 15:48

rfourquet force-pushed the rf/rand/state branch from 04c75b1 to 1a6807a Compare November 18, 2017 13:01

ViralBShah requested a review from andreasnoack November 19, 2017 01:57

rfourquet force-pushed the rf/rand/state branch from 9dbc720 to e7afa5f Compare November 27, 2017 10:57

rfourquet changed the title ~~WIP: random: introduce State to formalize hooking into rand machinery~~ random: introduce State to formalize hooking into rand machinery Nov 29, 2017

rfourquet force-pushed the rf/rand/state branch 4 times, most recently from 04b519b to 3b9db7c Compare November 30, 2017 09:47

random: introduce Sampler to formalize hooking into rand machinery

7497379

[ci skip] CI already ran.

rfourquet force-pushed the rf/rand/state branch from 3b9db7c to 7497379 Compare November 30, 2017 09:48

rfourquet changed the title ~~random: introduce State to formalize hooking into rand machinery~~ random: introduce Sampler to formalize hooking into rand machinery Nov 30, 2017

rfourquet merged commit 20d30d9 into master Nov 30, 2017

rfourquet deleted the rf/rand/state branch November 30, 2017 09:50

This was referenced Nov 30, 2017

faster rand! for Dict, Set, BitSet #24866

Merged

WIP/RFC: random unleashed #24912

Closed

rfourquet mentioned this pull request Jan 19, 2018

How to define random functions on your own types need to be better documented. #25605

Closed

KristofferC mentioned this pull request Mar 20, 2018

bitrand seems to be much slower on 0.7 #26505

Closed

random: introduce Sampler to formalize hooking into rand machinery #23964

random: introduce Sampler to formalize hooking into rand machinery #23964

Conversation

rfourquet commented Oct 2, 2017 • edited Loading

mschauer commented Oct 2, 2017

mschauer Oct 3, 2017

Choose a reason for hiding this comment

mschauer Oct 3, 2017

Choose a reason for hiding this comment

rfourquet Oct 3, 2017

Choose a reason for hiding this comment

rfourquet commented Oct 3, 2017

rfourquet Oct 3, 2017

Choose a reason for hiding this comment

mschauer Oct 3, 2017

Choose a reason for hiding this comment

rfourquet Oct 3, 2017

Choose a reason for hiding this comment

mschauer Oct 3, 2017

Choose a reason for hiding this comment

mschauer commented Oct 3, 2017

rfourquet commented Oct 3, 2017

vchuravy commented Oct 3, 2017

mschauer commented Oct 3, 2017

mschauer commented Oct 4, 2017 • edited Loading

rfourquet commented Oct 5, 2017

rfourquet commented Nov 18, 2017

ViralBShah commented Nov 19, 2017

mschauer commented Nov 19, 2017

rfourquet commented Nov 19, 2017

rfourquet commented Nov 19, 2017

rfourquet commented Nov 27, 2017

nanosoldier commented Nov 27, 2017

rfourquet commented Nov 27, 2017

rfourquet commented Nov 27, 2017

ararslan commented Nov 27, 2017

rfourquet commented Nov 27, 2017

ararslan commented Nov 27, 2017

nanosoldier commented Nov 27, 2017

rfourquet commented Nov 28, 2017

rfourquet commented Nov 28, 2017

KristofferC commented Nov 28, 2017

rfourquet commented Nov 30, 2017

rfourquet commented Nov 30, 2017

KristofferC commented Jan 17, 2018 • edited Loading

KristofferC commented Jan 17, 2018

rfourquet commented Jan 19, 2018

KristofferC commented Jan 19, 2018

rfourquet commented Jan 19, 2018

rfourquet commented Jan 19, 2018 • edited Loading

KristofferC commented Jan 19, 2018

mschauer commented Jan 19, 2018 • edited by rfourquet Loading

rfourquet commented Jan 20, 2018

random: introduce `Sampler` to formalize hooking into rand machinery #23964

random: introduce `Sampler` to formalize hooking into rand machinery #23964

rfourquet commented Oct 2, 2017 •

edited

Loading

mschauer commented Oct 4, 2017 •

edited

Loading

KristofferC commented Jan 17, 2018 •

edited

Loading

rfourquet commented Jan 19, 2018 •

edited

Loading

mschauer commented Jan 19, 2018 •

edited by rfourquet

Loading