optimizer: support early finalization #55990

aviatesk · 2024-10-04T17:51:21Z

Currently, in the finalizer inlining pass, if not all the code between the finalizer registration and the end of the object’s lifetime (i.e., where the finalizer would be inlined) is marked as :nothrow, it simply bails out. However, even in such cases, we can insert a finalize call at the end of the object’s lifetime, allowing us to call the finalizer early if no exceptions occur.

This commit implements this optimization. To do so, it also moves finalize to Core, so the compiler can handle it directly.

vtjnash · 2024-10-04T19:05:32Z

base/compiler/ssair/passes.jl

+            # where the object’s lifetime ends (`insert_idx`): In such cases, we can’t
+            # remove the finalizer registration, but we can still insert a `Core.finalize`
+            # call at `insert_idx` while leaving the registration intact.
+            newinst = add_flag(NewInstruction(Expr(:call, GlobalRef(Core, :finalize), finalizer_stmt.args[3]), Nothing), flag)


This can have a surprising performance cost, since running finalizers are normally O(1), but this call is O(n). We would need to reimplement the way finalizers are registered.

To make this particular case legal, we also need to prove that we are only going to run the expected finalizer that it originally detected being registered (no other finalizers can be registered along side it, though those may be considered lifetime escapes already?) as other finalizers might not be allowed to run on this Task. There is also an issue that inference did the analysis in the tls world, but this runs them in the unknown future world.

To make this particular case legal, we also need to prove that we are only going to run the expected finalizer that it originally detected being registered (no other finalizers can be registered along side it, though those may be considered lifetime escapes already?

Regarding this, the escape check in sroa_mutables! ensures that the object in question only has a single finalizer.

There is also an issue that inference did the analysis in the tls world, but this runs them in the unknown future world.

The docstring for finalizer states:

Note that there is no guaranteed world age for the execution of f. It may be called in the world age in which the finalizer was registered or any later world age.

So, it should be acceptable for the world in which the finalizer is inferred and the world in which the finalizer is executed to be different?

Regarding this, the escape check in sroa_mutables! ensures that the object in question only has a single finalizer.

Okay good. I wasn't sure if the analysis was total (guaranteed that there was no other finalizer), or simply that it only uses one of the finalizers, depending on what it can detect locally

So, it should be acceptable for the world in which the finalizer is inferred and the world in which the finalizer is executed to be different?

It is acceptable to be different, but therefore not acceptable for inference to make decisions based on its guaranteed behavior, and then to not enforce the world of the runtime. We could probably change finalize, but the main concern is that finalize is already slow and undesirable compared to letting the GC do its work to sweep them as a batch, and we don't necessarily want to make finalizer slower just to support that function. So we can try to forward the function in finalizer directly here, but we need some (fast) way to also undo that finalizer registration (or delay it until either an error happens or some other branch which won't execute this finalize call insertion location)

Yeah, it looks like finalize is pretty slow. I tried to optimize it to some extent in 754a8bd, but it’s still about 2.5 times slower compared to master in the following benchmark:

mutable struct AtomicCounter @atomic count::Int end const counter = AtomicCounter(0) function withfinalizer(x) xs = finalizer(Ref(x)) do obj Base.@assume_effects :nothrow :notaskstate @atomic counter.count += obj[] end println(devnull, xs[]) return xs[] end @benchmark withfinalizer(1)

(even though I confirmed that withfinalizer(1) certainly hits the fast path added in the commit)

So, as you mentioned, we’ll probably need to implement a way to cancel the finalizer registration or delay the registration itself. The former sounds a bit simpler, so I’ll explore that approach.

I tried out the approach of canceling the finalizer registration (27adeb4). What do you think?

aviatesk · 2024-10-10T07:43:12Z

Instead of simply inserting finalize, I changed the approach to inline the finalizer body and cancel the finalizer registration (by preparing new builtin Core._cancel_finalizer).
It looks like this PR is now achieving pretty good performance.

mutable struct AtomicCounter
    @atomic count::Int
end
const counter = AtomicCounter(0)
const _throw_or_not = Ref(false)
@noinline throw_or_noop() = _throw_or_not[] ? error("") : nothing
function withfinalizer(x)
    xs = finalizer(Ref(x)) do obj
        Base.@assume_effects :nothrow :notaskstate
        @atomic counter.count += obj[]
    end
    throw_or_noop()
    return xs[]+=1
end
@benchmark withfinalizer(0)

master

julia> @benchmark withfinalizer(1)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  11.177 ns … 124.200 μs  ┊ GC (min … max):  0.00% … 39.95%
 Time  (median):     11.928 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   46.081 ns ±   1.765 μs  ┊ GC (mean ± σ):  30.60% ±  0.84%

   ▃▄▅▇▇█▆▆▄▅▆▆▅▅▃▃▂▁ ▁                 ▁▁ ▁▂▃▁▁▁ ▁            ▃
  ▇██████████████████████▇█▇█▇█▇▇▆▇▇█▇█▇███████████▇▆▇▆▄▆▃▄▁▄▄ █
  11.2 ns       Histogram: log(frequency) by time      17.4 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

this PR

julia> @benchmark withfinalizer(1)
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min … max):  17.327 ns …  1.416 μs  ┊ GC (min … max): 0.00% … 97.78%
 Time  (median):     18.329 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.802 ns ± 16.372 ns  ┊ GC (mean ± σ):  1.35% ±  1.65%

           ▂█▇▂                                                
  ▂▂▂▂▂▂▃▅▆████▇▅▃▂▂▃▄▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  17.3 ns         Histogram: frequency by time        22.6 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

vtjnash · 2024-10-10T17:37:15Z

src/gc-common.c

+}
+
+// Remove the finalizer for `o` from `ct->ptls->finalizers` and cancel that finalizer.
+// `lookup_idx` is the index at which the finalizer was registered in `ct->ptls->finalizers`


This will just be a random pointer though. After the next GC we will have no idea where the finalizer index will be without scanning for it.

Yes. If a GC occurs between the finalizer registration and the call to Core._cancel_finalizer, this index becomes meaningless, but in that case, the slow path of jl_cancel_finalizer will handle it by performing a full scan. lookup_idx is only used for the fast path.

But doesn't that means we expect this to normally hit the slow case, except when microbenchmarking?

no. There are plenty of benchmarks (e.g. BigInt) where we expect many objects to die before GC has run.

vtjnash

I am still very unsure that we should do this. It could completely change the order in which finalizers are run, since currently they can be assumed to run in LIFO dominator order (e.g. DFS post-order), but afterwards, we may lose that assumed ordering?

aviatesk · 2024-10-11T08:35:14Z

This PR implements an optimization that inlines the target finalizer call and, instead of removing the finalizer registration like before, cancels the registration by calling Core._cancel_finalizer. With finalizer cancellation, only the target finalizer (and its corresponding object) is removed from the finalizers list, so I believe it leaves the order of the other finalizers unchanged, but am I misunderstanding something here?

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible (#55990).

src/gc-common.c

vtjnash

Comments from offline discussion:

we could allow this for any finalizer function, not just ones with favorable effects
we should usually use the shim in gc-common.c to call the finalizers, so that they also may respect the finalizers-disabled flag. This could even be integrated with the call to cancel it
when canceling a finalizer, we can use the idx as a hint for the index to find it at and scan backwards from there in the list until we find the right one to cancel
after cancelling a finalizer, we can trim any trailing null values from the ptls list, to avoid the index from getting too far off in some typical cases
even with this, we are still guaranteeing that inner objects with finalizers registered before outer objects with finalizers will run in LIFO order, since finalizer itself is an escape and must be treated as such (even if it gets removed by optimizations later) and so inner objects cannot die before their containers

Currently, in the finalizer inlining pass, if not all the code between the finalizer registration and the end of the object’s lifetime (i.e., where the finalizer would be inlined) is marked as `:nothrow`, it simply bails out. However, even in such cases, we can insert a `finalize` call at the end of the object’s lifetime, allowing us to call the finalizer early if no exceptions occur. This commit implements this optimization. To do so, it also moves `finalize` to `Core`, so the compiler can handle it directly.

@atomic

…ize` This seems to perform well. ```julia mutable struct AtomicCounter @atomic count::Int end const counter = AtomicCounter(0) function withfinalizer(x) xs = finalizer(Ref(x)) do obj Base.@assume_effects :nothrow :notaskstate @atomic counter.count += obj[] end println(devnull, xs[]) xs[] = xs[] println(devnull, xs[]) return xs[] end @benchmark withfinalizer(100) ``` > master ``` BenchmarkTools.Trial: 10000 samples with 867 evaluations. Range (min … max): 135.717 ns … 18.617 μs ┊ GC (min … max): 0.00% … 47.51% Time (median): 144.992 ns ┊ GC (median): 0.00% Time (mean ± σ): 192.949 ns ± 687.513 ns ┊ GC (mean ± σ): 11.57% ± 3.23% ▇ ▃█▅▂ ▁▁▂█▇▂████▇▆▅▄▄▄▅▄▄▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂ 136 ns Histogram: frequency by time 201 ns < Memory estimate: 208 bytes, allocs estimate: 7. ``` > this commit ``` BenchmarkTools.Trial: 10000 samples with 964 evaluations. Range (min … max): 83.506 ns … 3.408 μs ┊ GC (min … max): 0.00% … 96.85% Time (median): 88.780 ns ┊ GC (median): 0.00% Time (mean ± σ): 101.797 ns ± 137.067 ns ┊ GC (mean ± σ): 10.26% ± 7.31% ▄█▃▅ ▄▃████▅▅▆▅▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▁▁▁▁▁▁▂▁▂▂▁▂▂ ▃ 83.5 ns Histogram: frequency by time 145 ns < Memory estimate: 208 bytes, allocs estimate: 7. ```

aviatesk requested review from vtjnash and Keno October 4, 2024 17:51

aviatesk force-pushed the avi/early-finalize-insertion branch from 241dc56 to 522c754 Compare October 4, 2024 17:52

vtjnash reviewed Oct 4, 2024

View reviewed changes

aviatesk force-pushed the avi/early-finalize-insertion branch 9 times, most recently from 54411f1 to bf08f1e Compare October 10, 2024 07:26

aviatesk force-pushed the avi/early-finalize-insertion branch 4 times, most recently from 27adeb4 to 4824376 Compare October 10, 2024 16:49

vtjnash reviewed Oct 10, 2024

View reviewed changes

aviatesk force-pushed the avi/early-finalize-insertion branch from 4824376 to 83ba537 Compare October 11, 2024 08:26

aviatesk mentioned this pull request Oct 16, 2024

optimizer: allow EA-powered finalizer inlining #55954

Merged

aviatesk force-pushed the avi/early-finalize-insertion branch from 83ba537 to 0001fd2 Compare October 16, 2024 12:16

vtjnash reviewed Oct 16, 2024

View reviewed changes

src/gc-common.c Outdated Show resolved Hide resolved

aviatesk force-pushed the avi/early-finalize-insertion branch 3 times, most recently from 6c182a6 to d3cab63 Compare October 21, 2024 08:12

aviatesk changed the title ~~optimizer: early finalize insertion~~ optimizer: support early finalization Oct 21, 2024

aviatesk force-pushed the avi/early-finalize-insertion branch 3 times, most recently from dac914c to 5dab57d Compare October 23, 2024 12:42

vtjnash reviewed Oct 23, 2024

View reviewed changes

aviatesk added 5 commits October 29, 2024 19:09

call _cancel_finalizer a bit earlier

084d5bf

ignore ThrownEscape

4d0a5cc

just null out cancelled finalizers

4790db8

aviatesk force-pushed the avi/early-finalize-insertion branch from 5dab57d to 4790db8 Compare October 29, 2024 10:09

pxl-th mentioned this pull request Nov 18, 2024

cuda gpu memory usage increasing in time FluxML/Flux.jl#2523

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimizer: support early finalization #55990

optimizer: support early finalization #55990

aviatesk commented Oct 4, 2024

vtjnash Oct 4, 2024

aviatesk Oct 7, 2024

vtjnash Oct 8, 2024

aviatesk Oct 9, 2024

aviatesk Oct 10, 2024 •

edited

Loading

aviatesk commented Oct 10, 2024

vtjnash Oct 10, 2024

aviatesk Oct 11, 2024

vtjnash Oct 16, 2024

oscardssmith Oct 16, 2024

vtjnash left a comment

aviatesk commented Oct 11, 2024 •

edited

Loading

vtjnash left a comment

optimizer: support early finalization #55990

Are you sure you want to change the base?

optimizer: support early finalization #55990

Conversation

aviatesk commented Oct 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aviatesk Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

aviatesk commented Oct 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vtjnash left a comment

Choose a reason for hiding this comment

aviatesk commented Oct 11, 2024 • edited Loading

vtjnash left a comment

Choose a reason for hiding this comment

aviatesk Oct 10, 2024 •

edited

Loading

aviatesk commented Oct 11, 2024 •

edited

Loading