Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimizer: support early finalization #55990

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

aviatesk
Copy link
Member

@aviatesk aviatesk commented Oct 4, 2024

Currently, in the finalizer inlining pass, if not all the code between the finalizer registration and the end of the object’s lifetime (i.e., where the finalizer would be inlined) is marked as :nothrow, it simply bails out. However, even in such cases, we can insert a finalize call at the end of the object’s lifetime, allowing us to call the finalizer early if no exceptions occur.

This commit implements this optimization. To do so, it also moves finalize to Core, so the compiler can handle it directly.

@aviatesk aviatesk requested review from vtjnash and Keno October 4, 2024 17:51
@aviatesk aviatesk force-pushed the avi/early-finalize-insertion branch from 241dc56 to 522c754 Compare October 4, 2024 17:52
# where the object’s lifetime ends (`insert_idx`): In such cases, we can’t
# remove the finalizer registration, but we can still insert a `Core.finalize`
# call at `insert_idx` while leaving the registration intact.
newinst = add_flag(NewInstruction(Expr(:call, GlobalRef(Core, :finalize), finalizer_stmt.args[3]), Nothing), flag)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can have a surprising performance cost, since running finalizers are normally O(1), but this call is O(n). We would need to reimplement the way finalizers are registered.

To make this particular case legal, we also need to prove that we are only going to run the expected finalizer that it originally detected being registered (no other finalizers can be registered along side it, though those may be considered lifetime escapes already?) as other finalizers might not be allowed to run on this Task. There is also an issue that inference did the analysis in the tls world, but this runs them in the unknown future world.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this particular case legal, we also need to prove that we are only going to run the expected finalizer that it originally detected being registered (no other finalizers can be registered along side it, though those may be considered lifetime escapes already?

Regarding this, the escape check in sroa_mutables! ensures that the object in question only has a single finalizer.

There is also an issue that inference did the analysis in the tls world, but this runs them in the unknown future world.

The docstring for finalizer states:

Note that there is no guaranteed world age for the execution of f. It may be called in the world age in which the finalizer was registered or any later world age.

So, it should be acceptable for the world in which the finalizer is inferred and the world in which the finalizer is executed to be different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this, the escape check in sroa_mutables! ensures that the object in question only has a single finalizer.

Okay good. I wasn't sure if the analysis was total (guaranteed that there was no other finalizer), or simply that it only uses one of the finalizers, depending on what it can detect locally

So, it should be acceptable for the world in which the finalizer is inferred and the world in which the finalizer is executed to be different?

It is acceptable to be different, but therefore not acceptable for inference to make decisions based on its guaranteed behavior, and then to not enforce the world of the runtime. We could probably change finalize, but the main concern is that finalize is already slow and undesirable compared to letting the GC do its work to sweep them as a batch, and we don't necessarily want to make finalizer slower just to support that function. So we can try to forward the function in finalizer directly here, but we need some (fast) way to also undo that finalizer registration (or delay it until either an error happens or some other branch which won't execute this finalize call insertion location)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it looks like finalize is pretty slow. I tried to optimize it to some extent in 754a8bd, but it’s still about 2.5 times slower compared to master in the following benchmark:

mutable struct AtomicCounter
    @atomic count::Int
end

const counter = AtomicCounter(0)

function withfinalizer(x)
    xs = finalizer(Ref(x)) do obj
        Base.@assume_effects :nothrow :notaskstate
        @atomic counter.count += obj[]
    end
    println(devnull, xs[])
    return xs[]
end

@benchmark withfinalizer(1)

(even though I confirmed that withfinalizer(1) certainly hits the fast path added in the commit)

So, as you mentioned, we’ll probably need to implement a way to cancel the finalizer registration or delay the registration itself. The former sounds a bit simpler, so I’ll explore that approach.

Copy link
Member Author

@aviatesk aviatesk Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried out the approach of canceling the finalizer registration (27adeb4). What do you think?

@aviatesk aviatesk force-pushed the avi/early-finalize-insertion branch 9 times, most recently from 54411f1 to bf08f1e Compare October 10, 2024 07:26
@aviatesk
Copy link
Member Author

Instead of simply inserting finalize, I changed the approach to inline the finalizer body and cancel the finalizer registration (by preparing new builtin Core._cancel_finalizer).
It looks like this PR is now achieving pretty good performance.

mutable struct AtomicCounter
    @atomic count::Int
end
const counter = AtomicCounter(0)
const _throw_or_not = Ref(false)
@noinline throw_or_noop() = _throw_or_not[] ? error("") : nothing
function withfinalizer(x)
    xs = finalizer(Ref(x)) do obj
        Base.@assume_effects :nothrow :notaskstate
        @atomic counter.count += obj[]
    end
    throw_or_noop()
    return xs[]+=1
end
@benchmark withfinalizer(0)

master

julia> @benchmark withfinalizer(1)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  11.177 ns … 124.200 μs  ┊ GC (min … max):  0.00% … 39.95%
 Time  (median):     11.928 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   46.081 ns ±   1.765 μs  ┊ GC (mean ± σ):  30.60% ±  0.84%

   ▃▄▅▇▇█▆▆▄▅▆▆▅▅▃▃▂▁ ▁                 ▁▁ ▁▂▃▁▁▁ ▁            ▃
  ▇██████████████████████▇█▇█▇█▇▇▆▇▇█▇█▇███████████▇▆▇▆▄▆▃▄▁▄▄ █
  11.2 ns       Histogram: log(frequency) by time      17.4 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

this PR

julia> @benchmark withfinalizer(1)
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min … max):  17.327 ns …  1.416 μs  ┊ GC (min … max): 0.00% … 97.78%
 Time  (median):     18.329 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.802 ns ± 16.372 ns  ┊ GC (mean ± σ):  1.35% ±  1.65%

           ▂█▇▂                                                
  ▂▂▂▂▂▂▃▅▆████▇▅▃▂▂▃▄▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  17.3 ns         Histogram: frequency by time        22.6 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

@aviatesk aviatesk force-pushed the avi/early-finalize-insertion branch 4 times, most recently from 27adeb4 to 4824376 Compare October 10, 2024 16:49
}

// Remove the finalizer for `o` from `ct->ptls->finalizers` and cancel that finalizer.
// `lookup_idx` is the index at which the finalizer was registered in `ct->ptls->finalizers`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will just be a random pointer though. After the next GC we will have no idea where the finalizer index will be without scanning for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If a GC occurs between the finalizer registration and the call to Core._cancel_finalizer, this index becomes meaningless, but in that case, the slow path of jl_cancel_finalizer will handle it by performing a full scan. lookup_idx is only used for the fast path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But doesn't that means we expect this to normally hit the slow case, except when microbenchmarking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. There are plenty of benchmarks (e.g. BigInt) where we expect many objects to die before GC has run.

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still very unsure that we should do this. It could completely change the order in which finalizers are run, since currently they can be assumed to run in LIFO dominator order (e.g. DFS post-order), but afterwards, we may lose that assumed ordering?

@aviatesk
Copy link
Member Author

aviatesk commented Oct 11, 2024

This PR implements an optimization that inlines the target finalizer call and, instead of removing the finalizer registration like before, cancels the registration by calling Core._cancel_finalizer. With finalizer cancellation, only the target finalizer (and its corresponding object) is removed from the finalizers list, so I believe it leaves the order of the other finalizers unchanged, but am I misunderstanding something here?

aviatesk added a commit that referenced this pull request Oct 16, 2024
E.g. this allows `finalizer` inlining in the following case:
```julia
mutable struct ForeignBuffer{T}
    const ptr::Ptr{T}
end
const foreign_buffer_finalized = Ref(false)
function foreign_alloc(::Type{T}, length) where T
    ptr = Libc.malloc(sizeof(T) * length)
    ptr = Base.unsafe_convert(Ptr{T}, ptr)
    obj = ForeignBuffer{T}(ptr)
    return finalizer(obj) do obj
        Base.@assume_effects :notaskstate :nothrow
        foreign_buffer_finalized[] = true
        Libc.free(obj.ptr)
    end
end
function f_EA_finalizer(N::Int)
    workspace = foreign_alloc(Float64, N)
    GC.@preserve workspace begin
        (;ptr) = workspace
        Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr)
    end
end
```
```julia
julia> @code_typed f_EA_finalizer(42)
CodeInfo(
1 ── %1  = Base.mul_int(8, N)::Int64
│    %2  = Core.lshr_int(%1, 63)::Int64
│    %3  = Core.trunc_int(Core.UInt8, %2)::UInt8
│    %4  = Core.eq_int(%3, 0x01)::Bool
└───       goto #3 if not %4
2 ──       invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{}
└───       unreachable
3 ──       goto #4
4 ── %9  = Core.bitcast(Core.UInt64, %1)::UInt64
└───       goto #5
5 ──       goto #6
6 ──       goto #7
7 ──       goto #8
8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing}
└───       goto #9
9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64}
│    %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64}
└───       goto #10
10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17)))
│    %20 = Base.getfield(%17, :ptr)::Ptr{Float64}
│          invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing
│          $(Expr(:gc_preserve_end, :(%19)))
│    %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool}
│          Base.setfield!(%23, :x, true)::Bool
│    %25 = Base.getfield(%17, :ptr)::Ptr{Float64}
│    %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing}
│          $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing
└───       return nothing
) => Nothing
```

However, this is still a WIP. Before merging, I want to improve EA's
precision a bit and at least fix the test case that is currently marked
as `broken`. I also need to check its impact on compiler performance.

Additionally, I believe this feature is not yet practical. In
particular, there is still significant room for improvement in the
following areas:
- EA's interprocedural capabilities: currently EA is performed ad-hoc
for limited frames because of latency reasons, which significantly
reduces its precision in the presence of interprocedural calls.
- Relaxing the `:nothrow` check for finalizer inlining: the current
algorithm requires `:nothrow`-ness on all paths from the allocation of
the mutable struct to its last use, which is not practical for
real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary
optimizations such as inserting a `finalize` call after the last use
might still be possible (#55990).
src/gc-common.c Outdated Show resolved Hide resolved
@aviatesk aviatesk force-pushed the avi/early-finalize-insertion branch 3 times, most recently from 6c182a6 to d3cab63 Compare October 21, 2024 08:12
@aviatesk aviatesk changed the title optimizer: early finalize insertion optimizer: support early finalization Oct 21, 2024
@aviatesk aviatesk force-pushed the avi/early-finalize-insertion branch 3 times, most recently from dac914c to 5dab57d Compare October 23, 2024 12:42
Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments from offline discussion:

  • we could allow this for any finalizer function, not just ones with favorable effects
  • we should usually use the shim in gc-common.c to call the finalizers, so that they also may respect the finalizers-disabled flag. This could even be integrated with the call to cancel it
  • when canceling a finalizer, we can use the idx as a hint for the index to find it at and scan backwards from there in the list until we find the right one to cancel
  • after cancelling a finalizer, we can trim any trailing null values from the ptls list, to avoid the index from getting too far off in some typical cases
  • even with this, we are still guaranteeing that inner objects with finalizers registered before outer objects with finalizers will run in LIFO order, since finalizer itself is an escape and must be treated as such (even if it gets removed by optimizations later) and so inner objects cannot die before their containers

Currently, in the finalizer inlining pass, if not all the code between
the finalizer registration and the end of the object’s lifetime (i.e.,
where the finalizer would be inlined) is marked as `:nothrow`, it simply
bails out. However, even in such cases, we can insert a `finalize` call
at the end of the object’s lifetime, allowing us to call the finalizer
early if no exceptions occur.

This commit implements this optimization. To do so, it also moves
`finalize` to `Core`, so the compiler can handle it directly.
…ize`

This seems to perform well.

```julia
mutable struct AtomicCounter
    @atomic count::Int
end
const counter = AtomicCounter(0)

function withfinalizer(x)
    xs = finalizer(Ref(x)) do obj
        Base.@assume_effects :nothrow :notaskstate
        @atomic counter.count += obj[]
    end
    println(devnull, xs[])
    xs[] = xs[]
    println(devnull, xs[])
    return xs[]
end

@benchmark withfinalizer(100)
```

> master
```
BenchmarkTools.Trial: 10000 samples with 867 evaluations.
 Range (min … max):  135.717 ns …  18.617 μs  ┊ GC (min … max):  0.00% … 47.51%
 Time  (median):     144.992 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   192.949 ns ± 687.513 ns  ┊ GC (mean ± σ):  11.57% ±  3.23%

     ▇  ▃█▅▂
  ▁▁▂█▇▂████▇▆▅▄▄▄▅▄▄▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  136 ns           Histogram: frequency by time          201 ns <

 Memory estimate: 208 bytes, allocs estimate: 7.
```

> this commit
```
BenchmarkTools.Trial: 10000 samples with 964 evaluations.
 Range (min … max):   83.506 ns …   3.408 μs  ┊ GC (min … max):  0.00% … 96.85%
 Time  (median):      88.780 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   101.797 ns ± 137.067 ns  ┊ GC (mean ± σ):  10.26% ±  7.31%

    ▄█▃▅
  ▄▃████▅▅▆▅▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▁▁▁▁▁▁▂▁▂▂▁▂▂ ▃
  83.5 ns          Histogram: frequency by time          145 ns <

 Memory estimate: 208 bytes, allocs estimate: 7.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants