Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bit to the GC tag to turn GC in image search O(1) and also increase gc interval when encountering many pointers #49185

Merged
merged 2 commits into from
Mar 31, 2023

Conversation

gbaraldi
Copy link
Member

@gbaraldi gbaraldi commented Mar 29, 2023

OmniPackage is broken currently so I can't really compare to current master.

Copy link
Sponsor Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@d-netto d-netto added GC Garbage collector pkgimage labels Mar 29, 2023
@vchuravy
Copy link
Member

That was way less work than I expected. In the future we should add convert the bit to a "permanent" generation. That might be beneficial for things like Symbols?

@gbaraldi
Copy link
Member Author

I churned quite a on trying to add a bit to the gc field and then me and jameson just decided it wasn't worth it.
But yeah maybe we could set that bit for other permalloced things and try to do smarter things

@vchuravy
Copy link
Member

@nanosoldier runtests(configuration = (julia_flags=["--pkgimages=yes"], buildflags=["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"],), vs_configuration = (julia_flags=["--pkgimages=yes"], buildflags = ["LLVM_ASSERTIONS=1", "FORCE_ASSERTIONS=1"],))

@vchuravy vchuravy requested a review from d-netto March 29, 2023 22:15
@vchuravy vchuravy added the backport 1.9 Change should be backported to release-1.9 label Mar 29, 2023
Copy link
Member

@d-netto d-netto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to do some cleanup to delete the Eytzinger tree functions and also benchmark it on a OmniPackage-like workload.

Overall, SGTM.

@gbaraldi
Copy link
Member Author

The eytzinger tree functions are still used in staticdata, but I guess I can move them there,

@d-netto
Copy link
Member

d-netto commented Mar 29, 2023

On a slightly tweaked version of OmniPackage:

julia> @time include("src/OmniPackage.jl")
 33.007272 seconds (54.86 M allocations: 2.926 GiB, 16.88% gc time, 13.39% compilation time: 81% of which was recompilation)
Main.OmniPackage
julia> @time include("src/OmniPackage.jl")
 32.287114 seconds (54.80 M allocations: 2.935 GiB, 14.86% gc time, 13.74% compilation time: 80% of which was recompilation)
Main.OmniPackage

@d-netto
Copy link
Member

d-netto commented Mar 29, 2023

diff --git a/src/OmniPackage.jl b/src/OmniPackage.jl
index f671cb6..9b4f8fd 100644
--- a/src/OmniPackage.jl
+++ b/src/OmniPackage.jl
@@ -20,7 +20,7 @@ using AbstractMCMC,
   FFTW,
   FillArrays,
   FiniteDiff,
-  Flux,
+  # Flux,
   ForwardDiff,
   GLM,
   GlobalSensitivity,
@@ -36,13 +36,13 @@ using AbstractMCMC,
   Loess,
   MacroTools,
   Markdown,
-  Makie,
+  # Makie,
   MCMCChains,
   MCMCDiagnosticTools,
   MLJ,
   MLJBase,
-  MLJFlux,
-  MLJXGBoostInterface,
+  # MLJFlux,
+  # MLJXGBoostInterface,
   MacroTools,
   Missings,
   ModelingToolkit,

@d-netto
Copy link
Member

d-netto commented Mar 29, 2023

(These packages were causing some version incompatibility on my machine, so just commenting out for now).

@JeffBezanson
Copy link
Sponsor Member

That's...disappointing?

@vchuravy
Copy link
Member

That's...disappointing?

I think it means that the binary tree search recovered most of the performance, but I like this approach better :)

@gbaraldi
Copy link
Member Author

It might be interesting to see where we are spending the time now, because it seems GC is now just a component instead of the main thing.

@oscardssmith
Copy link
Member

I think part of that might be that the modifications removed the GPU/ML part of the stack which is a lot of packages.

@pchintalapudi
Copy link
Member

That's...disappointing?

So I suspect (with no evidence) that the reason for this is cache misses. Using a bit inside the object requires a load from the pointer to check the bit, which can thrash the cache with lots of objects that are scattered throughout memory. On the other hand, the eytzinger tree uses a relatively small side table and an icache and dcache-friendly layout and compares only on the pointer's integer representation, which doesn't load the value and therefore won't thrash the cache in the same way.

If this is the case, a more bits-friendly benchmark that will make the eytzinger tree worse is one where there are many super-tiny packages with one object each (since the eytzinger tree will store 2 pointers per package, this makes the side table less cache friendly). Inversely, a more tree-friendly benchmark would be where there's just a giant sysimg with many many objects (since the eytzinger tree won't kill the cache and the bits will need an extra load from every object).

@KristofferC KristofferC mentioned this pull request Mar 30, 2023
52 tasks
@nanosoldier
Copy link
Collaborator

Your job failed. Consult the server logs for more details (cc @maleadt).

@KristofferC
Copy link
Sponsor Member

With full Omnipackage:

# 1.8.5
julia> @time include("src/OmniPackage.jl")
 97.392319 seconds (405.39 M allocations: 23.826 GiB, 4.93% gc time, 8.59% compilation time: 58% of which was recompilation)

# 1.9.0-rc1 with binary search
julia>: @time include("src/OmniPackage.jl")
 50.408919 seconds (67.12 M allocations: 4.045 GiB, 3.31% gc time, 3.76% compilation time: 64% of which was recompilation)

# 1.9.0-rc1 with GC tag
julia> @time include("src/OmniPackage.jl")
 49.604748 seconds (67.59 M allocations: 4.092 GiB, 3.22% gc time, 3.76% compilation time: 64% of which was recompilation)

It's not really fair to compare the GC time on 1.8 vs 1.9 because the allocations on 1.9 are so much better, but only looking at the total time, 1.9 is doing pretty good.

@gbaraldi
Copy link
Member Author

@pchintalapudi we do check the same pointer a couple of lines above so if it's the case that we are messing up the caches maybe we can be a little more cache friendly here, but since we don't check too too much between I kinda doubt that the object isn't in the cache. What might happen is that it might mess up with the branch predicition. I will play with vtune a bit and maybe it says something.

@gbaraldi
Copy link
Member Author

gbaraldi commented Mar 30, 2023

While playing with this I saw a mark phase in the profiler that looked quite odd.

image
That big slab is gc_try_claim_and_push.
In particular the lines

     *nptr |= 1;
    if (gc_try_setmark_tag(o, GC_MARKED))

With the big one being *nptr |= 1;
The call above the stack for them is gc_mark_objarray which kinda makes sense, this is probably the mfills array from the code in the issue, but I wonder if all those dereferences aren't destroying the cache and if there's some way of doing this better.
It's kind of the worst case being about 25 million boxes.

@d-netto
Copy link
Member

d-netto commented Mar 30, 2023

It wouldn't be surprising if we had a lot of cache misses coming from gc_try_claim_and_push, since we need to fetch the object tag when enqueuing it.

In the lines:

if (!gc_old(o->header) && nptr)
        *nptr |= 1;
if (gc_try_setmark_tag(o, GC_MARKED))
        gc_markqueue_push(mq, obj);

the o->header introduces a data dependency, so I think the profiling could be showing that we are just stalled in the subsequent instructions waiting for the line to be read from memory.

@gbaraldi
Copy link
Member Author

Yeah, not sure if there's something we could do here? You played a bit with prefetching and other things so maybe? Also those numbers I was showing there, I was able to get better performance by tinkering a bit, but we then started doing very large collections which might increase the memory footprint as a whole. The current thing gets us to a similar place to 1.8 .

@gbaraldi
Copy link
Member Author

@d-netto could I bother you to run the GC benchmarks on this vs master?

@gbaraldi
Copy link
Member Author

@nanosoldier runbenchmarks(!"scalar", vs=":master")

@gbaraldi
Copy link
Member Author

The only question I have for this is that the second GC.gc() call after running the test takes way too long, and it's all in marking. Not sure what could cause that behaviour. It's all in the mark_objarray call, but not sure why other versions don't do that

using Random: seed!
seed!(1)

abstract type Cell end

struct CellA<:Cell
    a::Int
end

struct CellB<:Cell
    b::String
end

function fillcells!(mc::Array{Cell})
    for ind in eachindex(mc)
        mc[ind] = ifelse(rand() > 0.5, CellA(ind), CellB(string(ind)))
    end
    return mc
end

mcells = Array{Cell}(undef, 4000, 4000 )
t1 = @elapsed fillcells!(mcells)
t2 = @elapsed fillcells!(mcells)

println("filling: $t1 s\nfilling again: $t2 s")

@time GC.gc()
@time GC.gc()

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@gbaraldi
Copy link
Member Author

Just for reference, this also fixes the bad behaviour on #49120

@KristofferC KristofferC added the merge me PR is reviewed. Merge when all tests are passing label Mar 31, 2023
@gbaraldi gbaraldi changed the title Add bit to the GC tag to turn GC in image search O(1) Add bit to the GC tag to turn GC in image search O(1) and also increase gc interval when encountering many pointers Mar 31, 2023
@KristofferC KristofferC merged commit 8b19f5f into JuliaLang:master Mar 31, 2023
@oscardssmith oscardssmith removed the merge me PR is reviewed. Merge when all tests are passing label Mar 31, 2023
@KristofferC KristofferC removed the backport 1.9 Change should be backported to release-1.9 label Mar 31, 2023
@topolarity
Copy link
Member

Did the second commit in this PR actually get reviewed, or was that tacked on at the end?

@KristofferC
Copy link
Sponsor Member

Did the second commit in this PR actually get reviewed, or was that tacked on at the end?

It was confirmed to fix #49120 by making sure that the behavior in #49120 (comment) was fixed.

Why, is something going wrong somewhere?

@topolarity
Copy link
Member

Why, is something going wrong somewhere?

Nope, I was just curious whether it was included in the benchmarks that had already been run or not

oscar-system/Oscar.jl#2187 is a bit strange, but @gbaraldi's already investigating and the root cause isn't clear yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
don't squash Don't squash merge GC Garbage collector pkgimage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants