-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bit to the GC tag to turn GC in image search O(1) and also increase gc interval when encountering many pointers #49185
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
That was way less work than I expected. In the future we should add convert the bit to a "permanent" generation. That might be beneficial for things like Symbols? |
I churned quite a on trying to add a bit to the gc field and then me and jameson just decided it wasn't worth it. |
@nanosoldier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to do some cleanup to delete the Eytzinger tree functions and also benchmark it on a OmniPackage-like workload.
Overall, SGTM.
The eytzinger tree functions are still used in staticdata, but I guess I can move them there, |
On a slightly tweaked version of OmniPackage:
|
|
(These packages were causing some version incompatibility on my machine, so just commenting out for now). |
That's...disappointing? |
I think it means that the binary tree search recovered most of the performance, but I like this approach better :) |
It might be interesting to see where we are spending the time now, because it seems GC is now just a component instead of the main thing. |
I think part of that might be that the modifications removed the GPU/ML part of the stack which is a lot of packages. |
So I suspect (with no evidence) that the reason for this is cache misses. Using a bit inside the object requires a load from the pointer to check the bit, which can thrash the cache with lots of objects that are scattered throughout memory. On the other hand, the eytzinger tree uses a relatively small side table and an icache and dcache-friendly layout and compares only on the pointer's integer representation, which doesn't load the value and therefore won't thrash the cache in the same way. If this is the case, a more bits-friendly benchmark that will make the eytzinger tree worse is one where there are many super-tiny packages with one object each (since the eytzinger tree will store 2 pointers per package, this makes the side table less cache friendly). Inversely, a more tree-friendly benchmark would be where there's just a giant sysimg with many many objects (since the eytzinger tree won't kill the cache and the bits will need an extra load from every object). |
With full Omnipackage: # 1.8.5
julia> @time include("src/OmniPackage.jl")
97.392319 seconds (405.39 M allocations: 23.826 GiB, 4.93% gc time, 8.59% compilation time: 58% of which was recompilation)
# 1.9.0-rc1 with binary search
julia>: @time include("src/OmniPackage.jl")
50.408919 seconds (67.12 M allocations: 4.045 GiB, 3.31% gc time, 3.76% compilation time: 64% of which was recompilation)
# 1.9.0-rc1 with GC tag
julia> @time include("src/OmniPackage.jl")
49.604748 seconds (67.59 M allocations: 4.092 GiB, 3.22% gc time, 3.76% compilation time: 64% of which was recompilation) It's not really fair to compare the GC time on 1.8 vs 1.9 because the allocations on 1.9 are so much better, but only looking at the total time, 1.9 is doing pretty good. |
@pchintalapudi we do check the same pointer a couple of lines above so if it's the case that we are messing up the caches maybe we can be a little more cache friendly here, but since we don't check too too much between I kinda doubt that the object isn't in the cache. What might happen is that it might mess up with the branch predicition. I will play with vtune a bit and maybe it says something. |
It wouldn't be surprising if we had a lot of cache misses coming from In the lines:
the |
Yeah, not sure if there's something we could do here? You played a bit with prefetching and other things so maybe? Also those numbers I was showing there, I was able to get better performance by tinkering a bit, but we then started doing very large collections which might increase the memory footprint as a whole. The current thing gets us to a similar place to 1.8 . |
@d-netto could I bother you to run the GC benchmarks on this vs master? |
@nanosoldier |
The only question I have for this is that the second GC.gc() call after running the test takes way too long, and it's all in marking. Not sure what could cause that behaviour. It's all in the using Random: seed!
seed!(1)
abstract type Cell end
struct CellA<:Cell
a::Int
end
struct CellB<:Cell
b::String
end
function fillcells!(mc::Array{Cell})
for ind in eachindex(mc)
mc[ind] = ifelse(rand() > 0.5, CellA(ind), CellB(string(ind)))
end
return mc
end
mcells = Array{Cell}(undef, 4000, 4000 )
t1 = @elapsed fillcells!(mcells)
t2 = @elapsed fillcells!(mcells)
println("filling: $t1 s\nfilling again: $t2 s")
@time GC.gc()
@time GC.gc() |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
Just for reference, this also fixes the bad behaviour on #49120 |
Did the second commit in this PR actually get reviewed, or was that tacked on at the end? |
It was confirmed to fix #49120 by making sure that the behavior in #49120 (comment) was fixed. Why, is something going wrong somewhere? |
Nope, I was just curious whether it was included in the benchmarks that had already been run or not oscar-system/Oscar.jl#2187 is a bit strange, but @gbaraldi's already investigating and the root cause isn't clear yet. |
OmniPackage is broken currently so I can't really compare to current master.