-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression of scalar randn() between Julia 1.4 and 1.5 #37234
Conversation
Should probably at least add a comment to both places about this duplication. |
Good point, I added some comments. |
#= | ||
When defining | ||
`@inline randn(rng::AbstractRNG=default_rng()) = _randn(rng, rand(rng, UInt52Raw()))` | ||
the function call to `_randn` is currently not inlined, resulting in slightly worse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering whether it's really _randn
which fails to be inlined... Looking rapidly at @code_typed randn(MersenneTwister())
doesn't show a call to _randn
if I'm not mistaken. How did you conclude that randn_
is not inlined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deduced it from the @code_llvm
(but I might be wrong here - I do not have much experience with these topics):
In the Julia 1.5 version, there is the following call:
%30 = call double @j__randn_3665(%jl_value_t* nonnull %0, i64 %29) #0
In the Julia 1.4 code, this call is missing, but there are ca. 60 additional lines of LLVM code. I suppose this is the inlined function, but am not sure about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since _randn
is defined as calling randn
, the inliner decides to stop. This is reflected in code_llvm
, but due to the particular way that code_typed
works and is cached, it won't be represented there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix this, I would suggest moving this code duplication into randn_unlikely
instead, so that's the only recursive function and not all these others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
randn_unlikely
is the noinline slow path though. If we moved the code we'd have to inline everything?
This patch looks fine as a last resort, but I would rather postpone merging it until later before the 1.6 release, to give us a chance to understand better what happens exactly. If In case this is considered for backport to the 1.5 release, then this commit could be applied directly to 1.5 for the next 1.5. release, merged for 1.6 later only if we don't find better. |
So by looking again at the output of
whereas with your patch, it looks like
I don't really know what it means, but it looks like inference is capable to infer "constness" of these arrays, leading to better perfs. Would someone know what might cause the difference and whether it's possible to help inference on the master version without duplicating the code? (maybe @Keno ?) If someone wants to try this patch, just do
|
What happens if you make those arrays tuples? Tuples are immutable and might be handled better. |
Good idea, this seem to improve performance a bit on master. But doing the same trick atop this PR also improves things, again thanks to a
|
Perhaps interesting, Julia 1.4: julia> using Random, BenchmarkTools
julia> const DEFAULTRNG = Random.default_rng();
julia> @benchmark Random.default_rng()
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.850 ns (0.00% GC)
median time: 2.857 ns (0.00% GC)
mean time: 2.995 ns (0.00% GC)
maximum time: 5.182 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark randn(DEFAULTRNG)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.592 ns (0.00% GC)
median time: 4.061 ns (0.00% GC)
mean time: 4.077 ns (0.00% GC)
maximum time: 6.759 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark randn()
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 5.309 ns (0.00% GC)
median time: 5.769 ns (0.00% GC)
mean time: 5.792 ns (0.00% GC)
maximum time: 19.508 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000 Julia master: julia> using Random, BenchmarkTools
julia> const DEFAULTRNG = Random.default_rng();
julia> @benchmark Random.default_rng()
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.844 ns (0.00% GC)
median time: 2.848 ns (0.00% GC)
mean time: 2.856 ns (0.00% GC)
maximum time: 12.756 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark randn(DEFAULTRNG)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.864 ns (0.00% GC)
median time: 4.233 ns (0.00% GC)
mean time: 4.255 ns (0.00% GC)
maximum time: 13.951 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark randn()
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.842 ns (0.00% GC)
median time: 7.298 ns (0.00% GC)
mean time: 7.322 ns (0.00% GC)
maximum time: 11.818 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000 |
When inference detects a call cycle, one of two things could happen: 1. It determines that in order for inference to converge it needs to limit the signatures of some methods to something more general, or 2. The cycle is determined to be harmless at the inference level (because though there is a cycle in the CFG there is no dependency cycle of type information). In the first case, we simply disable optimizations, which is sensible, because we're likely to have to recompute some information anyway, when we actually get there dynamically. In the second case however, we do do optimizations, but it's a bit of an unusual case. In particular, inference usually delivers methods to inlining in postorder (meaning callees get optimized before their callers) such that a caller can always inline a callee. However, if there is a cycle, there is of course no unique postorder of functions, since by definition we're looking a locally strongly connected component. In this case, we would just essentially pick an arbitrary order (and moreover, depending on the point at which we enter the cycle and subsequently cached, leading to potential performance instabilities, depending on function order). However, the arbitrary order is quite possibly suboptimal. For example in #36414, we have a cycle randn -> _randn -> randn_unlikely -> rand. In this cycle the author of this code expected both `randn` and `_randn` to inline and annotated the functions as such. However, in 1.5+ the order we happed to pick would have inlined randn_unlikely into _randn (had it not been marked noinline), with a hard call break between randn and _randn, whch is problematic from a performance standpoint. This PR aims to address this by introducing a heuristic: If some functions in a cycle are marked as `@noinline`, we want to make sure to infer these last (since they won't ever be inlined anyway). To ensure this happens, while restoring postorder if this happens to break the cycle, we perform a DFS traversal rooted at any `@noinline` functions and then optimize the functions in the cycle in DFS-postorder. Of course still may still not be a true postorder in the inlining graph (if the `@noinline` functions don't break the cycle), but even in that case, it should be no worse than the default order. Fixes #36414 Closes #37234
@inline randn(rng::AbstractRNG=default_rng()) = _randn(rng, rand(rng, UInt52Raw())) | ||
@inline function randn(rng::AbstractRNG=default_rng()) | ||
#= | ||
When defining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing whitespace here.
=# | ||
@inbounds begin | ||
r = rand(rng, UInt52Raw()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
Rebased in #39319 |
The scalar
randn()
function is significantly slower in Julia 1.5 (and master) than it was in Julia 1.4. This was a side-effect of the drastic speed-up implemented for randn vectors.See #36414
The underlying issue is that a function call is not inlined.
In this PR, this issue is avoided by redefining the scalar version of
randn()
, using the Julia 1.4 code.This brings back the V1.4 performance for
randn()
, but is not the nicest solution. But unfortunately I have not found a better one yet.