-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimizer: support early finalization #55990
base: master
Are you sure you want to change the base?
Conversation
241dc56
to
522c754
Compare
base/compiler/ssair/passes.jl
Outdated
# where the object’s lifetime ends (`insert_idx`): In such cases, we can’t | ||
# remove the finalizer registration, but we can still insert a `Core.finalize` | ||
# call at `insert_idx` while leaving the registration intact. | ||
newinst = add_flag(NewInstruction(Expr(:call, GlobalRef(Core, :finalize), finalizer_stmt.args[3]), Nothing), flag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can have a surprising performance cost, since running finalizers are normally O(1), but this call is O(n). We would need to reimplement the way finalizers are registered.
To make this particular case legal, we also need to prove that we are only going to run the expected finalizer that it originally detected being registered (no other finalizers can be registered along side it, though those may be considered lifetime escapes already?) as other finalizers might not be allowed to run on this Task. There is also an issue that inference did the analysis in the tls world, but this runs them in the unknown future world.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this particular case legal, we also need to prove that we are only going to run the expected finalizer that it originally detected being registered (no other finalizers can be registered along side it, though those may be considered lifetime escapes already?
Regarding this, the escape check in sroa_mutables!
ensures that the object in question only has a single finalizer.
There is also an issue that inference did the analysis in the tls world, but this runs them in the unknown future world.
The docstring for finalizer
states:
Note that there is no guaranteed world age for the execution of
f
. It may be called in the world age in which the finalizer was registered or any later world age.
So, it should be acceptable for the world in which the finalizer is inferred and the world in which the finalizer is executed to be different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding this, the escape check in sroa_mutables! ensures that the object in question only has a single finalizer.
Okay good. I wasn't sure if the analysis was total (guaranteed that there was no other finalizer), or simply that it only uses one of the finalizers, depending on what it can detect locally
So, it should be acceptable for the world in which the finalizer is inferred and the world in which the finalizer is executed to be different?
It is acceptable to be different, but therefore not acceptable for inference to make decisions based on its guaranteed behavior, and then to not enforce the world of the runtime. We could probably change finalize
, but the main concern is that finalize
is already slow and undesirable compared to letting the GC do its work to sweep them as a batch, and we don't necessarily want to make finalizer
slower just to support that function. So we can try to forward the function in finalizer
directly here, but we need some (fast) way to also undo that finalizer registration (or delay it until either an error happens or some other branch which won't execute this finalize call insertion location)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it looks like finalize
is pretty slow. I tried to optimize it to some extent in 754a8bd, but it’s still about 2.5 times slower compared to master
in the following benchmark:
mutable struct AtomicCounter
@atomic count::Int
end
const counter = AtomicCounter(0)
function withfinalizer(x)
xs = finalizer(Ref(x)) do obj
Base.@assume_effects :nothrow :notaskstate
@atomic counter.count += obj[]
end
println(devnull, xs[])
return xs[]
end
@benchmark withfinalizer(1)
(even though I confirmed that withfinalizer(1)
certainly hits the fast path added in the commit)
So, as you mentioned, we’ll probably need to implement a way to cancel the finalizer registration or delay the registration itself. The former sounds a bit simpler, so I’ll explore that approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried out the approach of canceling the finalizer registration (27adeb4). What do you think?
54411f1
to
bf08f1e
Compare
Instead of simply inserting mutable struct AtomicCounter
@atomic count::Int
end
const counter = AtomicCounter(0)
const _throw_or_not = Ref(false)
@noinline throw_or_noop() = _throw_or_not[] ? error("") : nothing
function withfinalizer(x)
xs = finalizer(Ref(x)) do obj
Base.@assume_effects :nothrow :notaskstate
@atomic counter.count += obj[]
end
throw_or_noop()
return xs[]+=1
end
@benchmark withfinalizer(0)
|
27adeb4
to
4824376
Compare
} | ||
|
||
// Remove the finalizer for `o` from `ct->ptls->finalizers` and cancel that finalizer. | ||
// `lookup_idx` is the index at which the finalizer was registered in `ct->ptls->finalizers` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will just be a random pointer though. After the next GC we will have no idea where the finalizer index will be without scanning for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If a GC occurs between the finalizer registration and the call to Core._cancel_finalizer
, this index becomes meaningless, but in that case, the slow path of jl_cancel_finalizer
will handle it by performing a full scan. lookup_idx
is only used for the fast path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But doesn't that means we expect this to normally hit the slow case, except when microbenchmarking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no. There are plenty of benchmarks (e.g. BigInt) where we expect many objects to die before GC has run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still very unsure that we should do this. It could completely change the order in which finalizers are run, since currently they can be assumed to run in LIFO dominator order (e.g. DFS post-order), but afterwards, we may lose that assumed ordering?
4824376
to
83ba537
Compare
This PR implements an optimization that inlines the target finalizer call and, instead of removing the finalizer registration like before, cancels the registration by calling |
E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible (#55990).
83ba537
to
0001fd2
Compare
6c182a6
to
d3cab63
Compare
finalize
insertiondac914c
to
5dab57d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments from offline discussion:
- we could allow this for any finalizer function, not just ones with favorable effects
- we should usually use the shim in gc-common.c to call the finalizers, so that they also may respect the finalizers-disabled flag. This could even be integrated with the call to cancel it
- when canceling a finalizer, we can use the idx as a hint for the index to find it at and scan backwards from there in the list until we find the right one to cancel
- after cancelling a finalizer, we can trim any trailing null values from the ptls list, to avoid the index from getting too far off in some typical cases
- even with this, we are still guaranteeing that inner objects with finalizers registered before outer objects with finalizers will run in LIFO order, since
finalizer
itself is an escape and must be treated as such (even if it gets removed by optimizations later) and so inner objects cannot die before their containers
Currently, in the finalizer inlining pass, if not all the code between the finalizer registration and the end of the object’s lifetime (i.e., where the finalizer would be inlined) is marked as `:nothrow`, it simply bails out. However, even in such cases, we can insert a `finalize` call at the end of the object’s lifetime, allowing us to call the finalizer early if no exceptions occur. This commit implements this optimization. To do so, it also moves `finalize` to `Core`, so the compiler can handle it directly.
…ize` This seems to perform well. ```julia mutable struct AtomicCounter @atomic count::Int end const counter = AtomicCounter(0) function withfinalizer(x) xs = finalizer(Ref(x)) do obj Base.@assume_effects :nothrow :notaskstate @atomic counter.count += obj[] end println(devnull, xs[]) xs[] = xs[] println(devnull, xs[]) return xs[] end @benchmark withfinalizer(100) ``` > master ``` BenchmarkTools.Trial: 10000 samples with 867 evaluations. Range (min … max): 135.717 ns … 18.617 μs ┊ GC (min … max): 0.00% … 47.51% Time (median): 144.992 ns ┊ GC (median): 0.00% Time (mean ± σ): 192.949 ns ± 687.513 ns ┊ GC (mean ± σ): 11.57% ± 3.23% ▇ ▃█▅▂ ▁▁▂█▇▂████▇▆▅▄▄▄▅▄▄▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂ 136 ns Histogram: frequency by time 201 ns < Memory estimate: 208 bytes, allocs estimate: 7. ``` > this commit ``` BenchmarkTools.Trial: 10000 samples with 964 evaluations. Range (min … max): 83.506 ns … 3.408 μs ┊ GC (min … max): 0.00% … 96.85% Time (median): 88.780 ns ┊ GC (median): 0.00% Time (mean ± σ): 101.797 ns ± 137.067 ns ┊ GC (mean ± σ): 10.26% ± 7.31% ▄█▃▅ ▄▃████▅▅▆▅▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▁▁▁▁▁▁▂▁▂▂▁▂▂ ▃ 83.5 ns Histogram: frequency by time 145 ns < Memory estimate: 208 bytes, allocs estimate: 7. ```
5dab57d
to
4790db8
Compare
Currently, in the finalizer inlining pass, if not all the code between the finalizer registration and the end of the object’s lifetime (i.e., where the finalizer would be inlined) is marked as
:nothrow
, it simply bails out. However, even in such cases, we can insert afinalize
call at the end of the object’s lifetime, allowing us to call the finalizer early if no exceptions occur.This commit implements this optimization. To do so, it also moves
finalize
toCore
, so the compiler can handle it directly.