Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load path #7

Closed
StefanKarpinski opened this issue Apr 27, 2011 · 4 comments
Closed

load path #7

StefanKarpinski opened this issue Apr 27, 2011 · 4 comments
Assignees

Comments

@StefanKarpinski
Copy link
Sponsor Member

Currently we have a half-baked, implicit load path for .j files loaded via the builtin load() function. It should be possible to add a list of directories to look in for .j files. The rule should probably be that names without / in them are looked up via the load path, while names beginning with / are considered to be absolute and names with / anywhere else are relative to the current directory.

@ghost ghost assigned StefanKarpinski Apr 27, 2011
@JeffBezanson
Copy link
Sponsor Member

Can be done as part of the redesign discussed in issue #65.

@StefanKarpinski
Copy link
Sponsor Member Author

Yeah, I can start looking at how to do these two together. Would be a big combined win.

JeffBezanson added a commit that referenced this issue Feb 6, 2012
old load() is now called include()
this also makes it easy to work on #7
array constructor tweaks
@JeffBezanson
Copy link
Sponsor Member

This is already in util.j, but there is no user interface to it. What should it be?

@StefanKarpinski
Copy link
Sponsor Member Author

Ah, ok. This is pretty easy now. I think should just have

const LOAD_PATH = String["", "$JULIA_HOME/", "$JULIA_HOME/j/"]

and always use that to look for files that get loaded. People can just manipulate the contents of that array like any other array. It's pretty straightforward. Does seem reasonable to you?

vtjnash pushed a commit to vtjnash/julia that referenced this issue Jun 1, 2012
aviatesk added a commit that referenced this issue Oct 8, 2022
This commit tries to fix and improve performance for calling keyword
funcs whose arguments types are not fully known but `@nospecialize`-d.

The final result would look like (this particular example is taken from
our Julia-level compiler implementation):
```julia
abstract type CallInfo end
struct NoCallInfo <: CallInfo end
struct NewInstruction
    stmt::Any
    type::Any
    info::CallInfo
    line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location
    flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion
    function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo),
                            line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing})
        return new(stmt, type, info, line, flag)
    end
end
@nospecialize
function NewInstruction(newinst::NewInstruction;
    stmt=newinst.stmt,
    type=newinst.type,
    info::CallInfo=newinst.info,
    line::Union{Int32,Nothing}=newinst.line,
    flag::Union{UInt8,Nothing}=newinst.flag)
    return NewInstruction(stmt, type, info, line, flag)
end
@Specialize

using BenchmarkTools
struct VirtualKwargs
    stmt::Any
    type::Any
    info::CallInfo
end
vkws = VirtualKwargs(nothing, Any, NoCallInfo())
newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing)
runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info)
@benchmark runner($newinst, $vkws)
```

> on master
```
BenchmarkTools.Trial: 10000 samples with 186 evaluations.
 Range (min … max):  559.898 ns …   4.173 μs  ┊ GC (min … max): 0.00% … 85.29%
 Time  (median):     605.608 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   638.170 ns ± 125.080 ns  ┊ GC (mean ± σ):  0.06% ±  0.85%

  █▇▂▆▄  ▁█▇▄▂                                                  ▂
  ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █
  560 ns        Histogram: log(frequency) by time       1.23 μs <

 Memory estimate: 32 bytes, allocs estimate: 2.
```

> on this commit
```julia
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.080 ns … 83.177 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.098 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.118 ns ±  0.885 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▅▇█▆▅▄▂
  ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃
  3.08 ns        Histogram: frequency by time        3.19 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
```

So for this particular case it achieves roughly 200x speed up.
This is because this commit allows inlining of a call to keyword sorter
as well as removal of `NamedTuple` call.

Especially this commit is composed of the following improvements:
- Add early return case for `structdiff`:
  This change improves the return type inference for a case when
  compared `NamedTuple`s are type unstable but there is no difference
  in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s.
  And in such case the optimizer will remove `structdiff` and succeeding
  `pairs` calls, letting the keyword sorter to be inlined.
- Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it
  directly forms `:splatnew` allocation rather than redirects to the
  general `NamedTuple` constructor, that could be confused for abstract
  input tuple type.
- Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types.
  This improvement lets `inline_splatnew` to handle more abstract
  `NamedTuple`s, especially whose names are fully known but its fields
  tuple type is abstract.

Those improvements are combined to allow our SROA pass to optimize away
`NamedTuple` and `tuple` calls generated for keyword argument handling.
E.g. the IR for the example `NewInstruction` constructor is now fairly
optimized, like:
```julia
julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info
           NewInstruction(newinst; stmt, type, info)
       end |> only
2 1 ── %1  = Base.getfield(_2, :line)::Union{Nothing, Int32}                    │╻╷  Type##kw
  │    %2  = Base.getfield(_2, :flag)::Union{Nothing, UInt8}                    ││┃   getproperty
  │    %3  = (isa)(%1, Nothing)::Bool                                           ││
  │    %4  = (isa)(%2, Nothing)::Bool                                           ││
  │    %5  = (Core.Intrinsics.and_int)(%3, %4)::Bool                            ││
  └───       goto #3 if not %5                                                  ││
  2 ── %7  = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction   NewInstruction
  └───       goto #10                                                           ││
  3 ── %9  = (isa)(%1, Int32)::Bool                                             ││
  │    %10 = (isa)(%2, Nothing)::Bool                                           ││
  │    %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool                           ││
  └───       goto #5 if not %11                                                 ││
  4 ── %13 = π (%1, Int32)                                                      ││
  │    %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻   NewInstruction
  └───       goto #10                                                           ││
  5 ── %16 = (isa)(%1, Nothing)::Bool                                           ││
  │    %17 = (isa)(%2, UInt8)::Bool                                             ││
  │    %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool                          ││
  └───       goto #7 if not %18                                                 ││
  6 ── %20 = π (%2, UInt8)                                                      ││
  │    %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻   NewInstruction
  └───       goto #10                                                           ││
  7 ── %23 = (isa)(%1, Int32)::Bool                                             ││
  │    %24 = (isa)(%2, UInt8)::Bool                                             ││
  │    %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool                          ││
  └───       goto #9 if not %25                                                 ││
  8 ── %27 = π (%1, Int32)                                                      ││
  │    %28 = π (%2, UInt8)                                                      ││
  │    %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction    │││╻   NewInstruction
  └───       goto #10                                                           ││
  9 ──       Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{}
  └───       unreachable                                                        ││
  10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction      ││
  └───       goto #11                                                           ││
  11 ─       return %33                                                         │
   => NewInstruction
```
aviatesk added a commit that referenced this issue Oct 8, 2022
This commit tries to fix and improve performance for calling keyword
funcs whose arguments types are not fully known but `@nospecialize`-d.

The final result would look like (this particular example is taken from
our Julia-level compiler implementation):
```julia
abstract type CallInfo end
struct NoCallInfo <: CallInfo end
struct NewInstruction
    stmt::Any
    type::Any
    info::CallInfo
    line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location
    flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion
    function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo),
                            line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing})
        return new(stmt, type, info, line, flag)
    end
end
@nospecialize
function NewInstruction(newinst::NewInstruction;
    stmt=newinst.stmt,
    type=newinst.type,
    info::CallInfo=newinst.info,
    line::Union{Int32,Nothing}=newinst.line,
    flag::Union{UInt8,Nothing}=newinst.flag)
    return NewInstruction(stmt, type, info, line, flag)
end
@Specialize

using BenchmarkTools
struct VirtualKwargs
    stmt::Any
    type::Any
    info::CallInfo
end
vkws = VirtualKwargs(nothing, Any, NoCallInfo())
newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing)
runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info)
@benchmark runner($newinst, $vkws)
```

> on master
```
BenchmarkTools.Trial: 10000 samples with 186 evaluations.
 Range (min … max):  559.898 ns …   4.173 μs  ┊ GC (min … max): 0.00% … 85.29%
 Time  (median):     605.608 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   638.170 ns ± 125.080 ns  ┊ GC (mean ± σ):  0.06% ±  0.85%

  █▇▂▆▄  ▁█▇▄▂                                                  ▂
  ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █
  560 ns        Histogram: log(frequency) by time       1.23 μs <

 Memory estimate: 32 bytes, allocs estimate: 2.
```

> on this commit
```julia
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.080 ns … 83.177 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.098 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.118 ns ±  0.885 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▅▇█▆▅▄▂
  ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃
  3.08 ns        Histogram: frequency by time        3.19 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
```

So for this particular case it achieves roughly 200x speed up.
This is because this commit allows inlining of a call to keyword sorter
as well as removal of `NamedTuple` call.

Especially this commit is composed of the following improvements:
- Add early return case for `structdiff`:
  This change improves the return type inference for a case when
  compared `NamedTuple`s are type unstable but there is no difference
  in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s.
  And in such case the optimizer will remove `structdiff` and succeeding
  `pairs` calls, letting the keyword sorter to be inlined.
- Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it
  directly forms `:splatnew` allocation rather than redirects to the
  general `NamedTuple` constructor, that could be confused for abstract
  input tuple type.
- Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types.
  This improvement lets `inline_splatnew` to handle more abstract
  `NamedTuple`s, especially whose names are fully known but its fields
  tuple type is abstract.

Those improvements are combined to allow our SROA pass to optimize away
`NamedTuple` and `tuple` calls generated for keyword argument handling.
E.g. the IR for the example `NewInstruction` constructor is now fairly
optimized, like:
```julia
julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info
           NewInstruction(newinst; stmt, type, info)
       end |> only
2 1 ── %1  = Base.getfield(_2, :line)::Union{Nothing, Int32}                    │╻╷  Type##kw
  │    %2  = Base.getfield(_2, :flag)::Union{Nothing, UInt8}                    ││┃   getproperty
  │    %3  = (isa)(%1, Nothing)::Bool                                           ││
  │    %4  = (isa)(%2, Nothing)::Bool                                           ││
  │    %5  = (Core.Intrinsics.and_int)(%3, %4)::Bool                            ││
  └───       goto #3 if not %5                                                  ││
  2 ── %7  = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction   NewInstruction
  └───       goto #10                                                           ││
  3 ── %9  = (isa)(%1, Int32)::Bool                                             ││
  │    %10 = (isa)(%2, Nothing)::Bool                                           ││
  │    %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool                           ││
  └───       goto #5 if not %11                                                 ││
  4 ── %13 = π (%1, Int32)                                                      ││
  │    %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻   NewInstruction
  └───       goto #10                                                           ││
  5 ── %16 = (isa)(%1, Nothing)::Bool                                           ││
  │    %17 = (isa)(%2, UInt8)::Bool                                             ││
  │    %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool                          ││
  └───       goto #7 if not %18                                                 ││
  6 ── %20 = π (%2, UInt8)                                                      ││
  │    %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻   NewInstruction
  └───       goto #10                                                           ││
  7 ── %23 = (isa)(%1, Int32)::Bool                                             ││
  │    %24 = (isa)(%2, UInt8)::Bool                                             ││
  │    %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool                          ││
  └───       goto #9 if not %25                                                 ││
  8 ── %27 = π (%1, Int32)                                                      ││
  │    %28 = π (%2, UInt8)                                                      ││
  │    %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction    │││╻   NewInstruction
  └───       goto #10                                                           ││
  9 ──       Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{}
  └───       unreachable                                                        ││
  10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction      ││
  └───       goto #11                                                           ││
  11 ─       return %33                                                         │
   => NewInstruction
```
aviatesk added a commit that referenced this issue Oct 8, 2022
This commit tries to fix and improve performance for calling keyword
funcs whose arguments types are not fully known but `@nospecialize`-d.

The final result would look like (this particular example is taken from
our Julia-level compiler implementation):
```julia
abstract type CallInfo end
struct NoCallInfo <: CallInfo end
struct NewInstruction
    stmt::Any
    type::Any
    info::CallInfo
    line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location
    flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion
    function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo),
                            line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing})
        return new(stmt, type, info, line, flag)
    end
end
@nospecialize
function NewInstruction(newinst::NewInstruction;
    stmt=newinst.stmt,
    type=newinst.type,
    info::CallInfo=newinst.info,
    line::Union{Int32,Nothing}=newinst.line,
    flag::Union{UInt8,Nothing}=newinst.flag)
    return NewInstruction(stmt, type, info, line, flag)
end
@Specialize

using BenchmarkTools
struct VirtualKwargs
    stmt::Any
    type::Any
    info::CallInfo
end
vkws = VirtualKwargs(nothing, Any, NoCallInfo())
newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing)
runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info)
@benchmark runner($newinst, $vkws)
```

> on master
```
BenchmarkTools.Trial: 10000 samples with 186 evaluations.
 Range (min … max):  559.898 ns …   4.173 μs  ┊ GC (min … max): 0.00% … 85.29%
 Time  (median):     605.608 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   638.170 ns ± 125.080 ns  ┊ GC (mean ± σ):  0.06% ±  0.85%

  █▇▂▆▄  ▁█▇▄▂                                                  ▂
  ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █
  560 ns        Histogram: log(frequency) by time       1.23 μs <

 Memory estimate: 32 bytes, allocs estimate: 2.
```

> on this commit
```julia
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.080 ns … 83.177 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.098 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.118 ns ±  0.885 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▅▇█▆▅▄▂
  ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃
  3.08 ns        Histogram: frequency by time        3.19 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
```

So for this particular case it achieves roughly 200x speed up.
This is because this commit allows inlining of a call to keyword sorter
as well as removal of `NamedTuple` call.

Especially this commit is composed of the following improvements:
- Add early return case for `structdiff`:
  This change improves the return type inference for a case when
  compared `NamedTuple`s are type unstable but there is no difference
  in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s.
  And in such case the optimizer will remove `structdiff` and succeeding
  `pairs` calls, letting the keyword sorter to be inlined.
- Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it
  directly forms `:splatnew` allocation rather than redirects to the
  general `NamedTuple` constructor, that could be confused for abstract
  input tuple type.
- Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types.
  This improvement lets `inline_splatnew` to handle more abstract
  `NamedTuple`s, especially whose names are fully known but its fields
  tuple type is abstract.

Those improvements are combined to allow our SROA pass to optimize away
`NamedTuple` and `tuple` calls generated for keyword argument handling.
E.g. the IR for the example `NewInstruction` constructor is now fairly
optimized, like:
```julia
julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info
           NewInstruction(newinst; stmt, type, info)
       end |> only
2 1 ── %1  = Base.getfield(_2, :line)::Union{Nothing, Int32}                    │╻╷  Type##kw
  │    %2  = Base.getfield(_2, :flag)::Union{Nothing, UInt8}                    ││┃   getproperty
  │    %3  = (isa)(%1, Nothing)::Bool                                           ││
  │    %4  = (isa)(%2, Nothing)::Bool                                           ││
  │    %5  = (Core.Intrinsics.and_int)(%3, %4)::Bool                            ││
  └───       goto #3 if not %5                                                  ││
  2 ── %7  = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction   NewInstruction
  └───       goto #10                                                           ││
  3 ── %9  = (isa)(%1, Int32)::Bool                                             ││
  │    %10 = (isa)(%2, Nothing)::Bool                                           ││
  │    %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool                           ││
  └───       goto #5 if not %11                                                 ││
  4 ── %13 = π (%1, Int32)                                                      ││
  │    %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻   NewInstruction
  └───       goto #10                                                           ││
  5 ── %16 = (isa)(%1, Nothing)::Bool                                           ││
  │    %17 = (isa)(%2, UInt8)::Bool                                             ││
  │    %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool                          ││
  └───       goto #7 if not %18                                                 ││
  6 ── %20 = π (%2, UInt8)                                                      ││
  │    %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻   NewInstruction
  └───       goto #10                                                           ││
  7 ── %23 = (isa)(%1, Int32)::Bool                                             ││
  │    %24 = (isa)(%2, UInt8)::Bool                                             ││
  │    %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool                          ││
  └───       goto #9 if not %25                                                 ││
  8 ── %27 = π (%1, Int32)                                                      ││
  │    %28 = π (%2, UInt8)                                                      ││
  │    %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction    │││╻   NewInstruction
  └───       goto #10                                                           ││
  9 ──       Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{}
  └───       unreachable                                                        ││
  10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction      ││
  └───       goto #11                                                           ││
  11 ─       return %33                                                         │
   => NewInstruction
```
aviatesk added a commit that referenced this issue Oct 8, 2022
This commit tries to fix and improve performance for calling keyword
funcs whose arguments types are not fully known but `@nospecialize`-d.

The final result would look like (this particular example is taken from
our Julia-level compiler implementation):
```julia
abstract type CallInfo end
struct NoCallInfo <: CallInfo end
struct NewInstruction
    stmt::Any
    type::Any
    info::CallInfo
    line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location
    flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion
    function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo),
                            line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing})
        return new(stmt, type, info, line, flag)
    end
end
@nospecialize
function NewInstruction(newinst::NewInstruction;
    stmt=newinst.stmt,
    type=newinst.type,
    info::CallInfo=newinst.info,
    line::Union{Int32,Nothing}=newinst.line,
    flag::Union{UInt8,Nothing}=newinst.flag)
    return NewInstruction(stmt, type, info, line, flag)
end
@Specialize

using BenchmarkTools
struct VirtualKwargs
    stmt::Any
    type::Any
    info::CallInfo
end
vkws = VirtualKwargs(nothing, Any, NoCallInfo())
newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing)
runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info)
@benchmark runner($newinst, $vkws)
```

> on master
```
BenchmarkTools.Trial: 10000 samples with 186 evaluations.
 Range (min … max):  559.898 ns …   4.173 μs  ┊ GC (min … max): 0.00% … 85.29%
 Time  (median):     605.608 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   638.170 ns ± 125.080 ns  ┊ GC (mean ± σ):  0.06% ±  0.85%

  █▇▂▆▄  ▁█▇▄▂                                                  ▂
  ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █
  560 ns        Histogram: log(frequency) by time       1.23 μs <

 Memory estimate: 32 bytes, allocs estimate: 2.
```

> on this commit
```julia
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.080 ns … 83.177 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.098 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.118 ns ±  0.885 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▅▇█▆▅▄▂
  ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃
  3.08 ns        Histogram: frequency by time        3.19 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
```

So for this particular case it achieves roughly 200x speed up.
This is because this commit allows inlining of a call to keyword sorter
as well as removal of `NamedTuple` call.

Especially this commit is composed of the following improvements:
- Add early return case for `structdiff`:
  This change improves the return type inference for a case when
  compared `NamedTuple`s are type unstable but there is no difference
  in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s.
  And in such case the optimizer will remove `structdiff` and succeeding
  `pairs` calls, letting the keyword sorter to be inlined.
- Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it
  directly forms `:splatnew` allocation rather than redirects to the
  general `NamedTuple` constructor, that could be confused for abstract
  input tuple type.
- Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types.
  This improvement lets `inline_splatnew` to handle more abstract
  `NamedTuple`s, especially whose names are fully known but its fields
  tuple type is abstract.

Those improvements are combined to allow our SROA pass to optimize away
`NamedTuple` and `tuple` calls generated for keyword argument handling.
E.g. the IR for the example `NewInstruction` constructor is now fairly
optimized, like:
```julia
julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info
           NewInstruction(newinst; stmt, type, info)
       end |> only
2 1 ── %1  = Base.getfield(_2, :line)::Union{Nothing, Int32}                    │╻╷  Type##kw
  │    %2  = Base.getfield(_2, :flag)::Union{Nothing, UInt8}                    ││┃   getproperty
  │    %3  = (isa)(%1, Nothing)::Bool                                           ││
  │    %4  = (isa)(%2, Nothing)::Bool                                           ││
  │    %5  = (Core.Intrinsics.and_int)(%3, %4)::Bool                            ││
  └───       goto #3 if not %5                                                  ││
  2 ── %7  = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction   NewInstruction
  └───       goto #10                                                           ││
  3 ── %9  = (isa)(%1, Int32)::Bool                                             ││
  │    %10 = (isa)(%2, Nothing)::Bool                                           ││
  │    %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool                           ││
  └───       goto #5 if not %11                                                 ││
  4 ── %13 = π (%1, Int32)                                                      ││
  │    %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻   NewInstruction
  └───       goto #10                                                           ││
  5 ── %16 = (isa)(%1, Nothing)::Bool                                           ││
  │    %17 = (isa)(%2, UInt8)::Bool                                             ││
  │    %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool                          ││
  └───       goto #7 if not %18                                                 ││
  6 ── %20 = π (%2, UInt8)                                                      ││
  │    %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻   NewInstruction
  └───       goto #10                                                           ││
  7 ── %23 = (isa)(%1, Int32)::Bool                                             ││
  │    %24 = (isa)(%2, UInt8)::Bool                                             ││
  │    %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool                          ││
  └───       goto #9 if not %25                                                 ││
  8 ── %27 = π (%1, Int32)                                                      ││
  │    %28 = π (%2, UInt8)                                                      ││
  │    %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction    │││╻   NewInstruction
  └───       goto #10                                                           ││
  9 ──       Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{}
  └───       unreachable                                                        ││
  10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction      ││
  └───       goto #11                                                           ││
  11 ─       return %33                                                         │
   => NewInstruction
```
aviatesk added a commit that referenced this issue Oct 8, 2022
This commit tries to fix and improve performance for calling keyword
funcs whose arguments types are not fully known but `@nospecialize`-d.

The final result would look like (this particular example is taken from
our Julia-level compiler implementation):
```julia
abstract type CallInfo end
struct NoCallInfo <: CallInfo end
struct NewInstruction
    stmt::Any
    type::Any
    info::CallInfo
    line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location
    flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion
    function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo),
                            line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing})
        return new(stmt, type, info, line, flag)
    end
end
@nospecialize
function NewInstruction(newinst::NewInstruction;
    stmt=newinst.stmt,
    type=newinst.type,
    info::CallInfo=newinst.info,
    line::Union{Int32,Nothing}=newinst.line,
    flag::Union{UInt8,Nothing}=newinst.flag)
    return NewInstruction(stmt, type, info, line, flag)
end
@Specialize

using BenchmarkTools
struct VirtualKwargs
    stmt::Any
    type::Any
    info::CallInfo
end
vkws = VirtualKwargs(nothing, Any, NoCallInfo())
newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing)
runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info)
@benchmark runner($newinst, $vkws)
```

> on master
```
BenchmarkTools.Trial: 10000 samples with 186 evaluations.
 Range (min … max):  559.898 ns …   4.173 μs  ┊ GC (min … max): 0.00% … 85.29%
 Time  (median):     605.608 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   638.170 ns ± 125.080 ns  ┊ GC (mean ± σ):  0.06% ±  0.85%

  █▇▂▆▄  ▁█▇▄▂                                                  ▂
  ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █
  560 ns        Histogram: log(frequency) by time       1.23 μs <

 Memory estimate: 32 bytes, allocs estimate: 2.
```

> on this commit
```julia
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.080 ns … 83.177 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.098 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.118 ns ±  0.885 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▅▇█▆▅▄▂
  ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃
  3.08 ns        Histogram: frequency by time        3.19 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
```

So for this particular case it achieves roughly 200x speed up.
This is because this commit allows inlining of a call to keyword sorter
as well as removal of `NamedTuple` call.

Especially this commit is composed of the following improvements:
- Add early return case for `structdiff`:
  This change improves the return type inference for a case when
  compared `NamedTuple`s are type unstable but there is no difference
  in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s.
  And in such case the optimizer will remove `structdiff` and succeeding
  `pairs` calls, letting the keyword sorter to be inlined.
- Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it
  directly forms `:splatnew` allocation rather than redirects to the
  general `NamedTuple` constructor, that could be confused for abstract
  input tuple type.
- Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types.
  This improvement lets `inline_splatnew` to handle more abstract
  `NamedTuple`s, especially whose names are fully known but its fields
  tuple type is abstract.

Those improvements are combined to allow our SROA pass to optimize away
`NamedTuple` and `tuple` calls generated for keyword argument handling.
E.g. the IR for the example `NewInstruction` constructor is now fairly
optimized, like:
```julia
julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info
           NewInstruction(newinst; stmt, type, info)
       end |> only
2 1 ── %1  = Base.getfield(_2, :line)::Union{Nothing, Int32}                    │╻╷  Type##kw
  │    %2  = Base.getfield(_2, :flag)::Union{Nothing, UInt8}                    ││┃   getproperty
  │    %3  = (isa)(%1, Nothing)::Bool                                           ││
  │    %4  = (isa)(%2, Nothing)::Bool                                           ││
  │    %5  = (Core.Intrinsics.and_int)(%3, %4)::Bool                            ││
  └───       goto #3 if not %5                                                  ││
  2 ── %7  = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction   NewInstruction
  └───       goto #10                                                           ││
  3 ── %9  = (isa)(%1, Int32)::Bool                                             ││
  │    %10 = (isa)(%2, Nothing)::Bool                                           ││
  │    %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool                           ││
  └───       goto #5 if not %11                                                 ││
  4 ── %13 = π (%1, Int32)                                                      ││
  │    %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻   NewInstruction
  └───       goto #10                                                           ││
  5 ── %16 = (isa)(%1, Nothing)::Bool                                           ││
  │    %17 = (isa)(%2, UInt8)::Bool                                             ││
  │    %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool                          ││
  └───       goto #7 if not %18                                                 ││
  6 ── %20 = π (%2, UInt8)                                                      ││
  │    %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻   NewInstruction
  └───       goto #10                                                           ││
  7 ── %23 = (isa)(%1, Int32)::Bool                                             ││
  │    %24 = (isa)(%2, UInt8)::Bool                                             ││
  │    %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool                          ││
  └───       goto #9 if not %25                                                 ││
  8 ── %27 = π (%1, Int32)                                                      ││
  │    %28 = π (%2, UInt8)                                                      ││
  │    %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction    │││╻   NewInstruction
  └───       goto #10                                                           ││
  9 ──       Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{}
  └───       unreachable                                                        ││
  10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction      ││
  └───       goto #11                                                           ││
  11 ─       return %33                                                         │
   => NewInstruction
```
github-merge-queue bot pushed a commit that referenced this issue Jul 15, 2023
This makes it easier to correlate LLVM IR with the originating source
code by including both argument name and argument type in the LLVM
argument variable.

<details>
<summary>Example 1</summary>

```julia
julia> function f(a, b, c, d, g...)
           e = a + b + c + d
           f = does_not_exist(e) + e
           f
       end
f (generic function with 1 method)

julia> @code_llvm f(0,0,0,0,0)
```
```llvm
;  @ REPL[1]:1 within `f`
define nonnull {}* @julia_f_141(i64 signext %"a::Int64", i64 signext %"b::Int64", i64 signext %"c::Int64", i64 signext %"d::Int64", i64 signext %"g[0]::Int64") #0 {
top:
  %0 = alloca [2 x {}*], align 8
  %gcframe3 = alloca [4 x {}*], align 16
  %gcframe3.sub = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe3, i64 0, i64 0
  %1 = bitcast [4 x {}*]* %gcframe3 to i8*
  call void @llvm.memset.p0i8.i64(i8* align 16 %1, i8 0, i64 32, i1 true)
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #7
  %tls_ppgcstack = getelementptr i8, i8* %thread_ptr, i64 -8
  %2 = bitcast i8* %tls_ppgcstack to {}****
  %tls_pgcstack = load {}***, {}**** %2, align 8
;  @ REPL[1]:3 within `f`
  %3 = bitcast [4 x {}*]* %gcframe3 to i64*
  store i64 8, i64* %3, align 16
  %4 = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe3, i64 0, i64 1
  %5 = bitcast {}** %4 to {}***
  %6 = load {}**, {}*** %tls_pgcstack, align 8
  store {}** %6, {}*** %5, align 8
  %7 = bitcast {}*** %tls_pgcstack to {}***
  store {}** %gcframe3.sub, {}*** %7, align 8
  %Main.does_not_exist.cached = load atomic {}*, {}** @0 unordered, align 8
  %iscached.not = icmp eq {}* %Main.does_not_exist.cached, null
  br i1 %iscached.not, label %notfound, label %found

notfound:                                         ; preds = %top
  %Main.does_not_exist.found = call {}* @ijl_get_binding_or_error({}* nonnull inttoptr (i64 139831437630272 to {}*), {}* nonnull inttoptr (i64 139831600565400 to {}*))
  store atomic {}* %Main.does_not_exist.found, {}** @0 release, align 8
  br label %found

found:                                            ; preds = %notfound, %top
  %Main.does_not_exist = phi {}* [ %Main.does_not_exist.cached, %top ], [ %Main.does_not_exist.found, %notfound ]
  %8 = bitcast {}* %Main.does_not_exist to {}**
  %does_not_exist.checked = load atomic {}*, {}** %8 unordered, align 8
  %.not = icmp eq {}* %does_not_exist.checked, null
  br i1 %.not, label %err, label %ok

err:                                              ; preds = %found
  call void @ijl_undefined_var_error({}* inttoptr (i64 139831600565400 to {}*))
  unreachable

ok:                                               ; preds = %found
  %.sub = getelementptr inbounds [2 x {}*], [2 x {}*]* %0, i64 0, i64 0
;  @ REPL[1]:2 within `f`
; ┌ @ operators.jl:587 within `+` @ int.jl:87
   %9 = add i64 %"b::Int64", %"a::Int64"
   %10 = add i64 %9, %"c::Int64"
; │ @ operators.jl:587 within `+`
; │┌ @ operators.jl:544 within `afoldl`
; ││┌ @ int.jl:87 within `+`
     %11 = add i64 %10, %"d::Int64"
     %12 = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe3, i64 0, i64 3
     store {}* %does_not_exist.checked, {}** %12, align 8
; └└└
;  @ REPL[1]:3 within `f`
  %13 = call nonnull {}* @ijl_box_int64(i64 signext %11)
  %14 = getelementptr inbounds [4 x {}*], [4 x {}*]* %gcframe3, i64 0, i64 2
  store {}* %13, {}** %14, align 16
  store {}* %13, {}** %.sub, align 8
  %15 = call nonnull {}* @ijl_apply_generic({}* nonnull %does_not_exist.checked, {}** nonnull %.sub, i32 1)
  store {}* %15, {}** %12, align 8
  %16 = call nonnull {}* @ijl_box_int64(i64 signext %11)
  store {}* %16, {}** %14, align 16
  store {}* %15, {}** %.sub, align 8
  %17 = getelementptr inbounds [2 x {}*], [2 x {}*]* %0, i64 0, i64 1
  store {}* %16, {}** %17, align 8
  %18 = call nonnull {}* @ijl_apply_generic({}* inttoptr (i64 139831370516384 to {}*), {}** nonnull %.sub, i32 2)
  %19 = load {}*, {}** %4, align 8
  %20 = bitcast {}*** %tls_pgcstack to {}**
  store {}* %19, {}** %20, align 8
;  @ REPL[1]:4 within `f`
  ret {}* %18
}
```
</details>

<details>
<summary>Example 2</summary>

```julia
julia> function g(a, b, c, d; kwarg=0)
           a + b + c + d + kwarg
       end
g (generic function with 1 method)

julia> @code_llvm g(0,0,0,0,kwarg=0)
```
```llvm
;  @ REPL[3]:1 within `g`
define i64 @julia_g_160([1 x i64]* nocapture noundef nonnull readonly align 8 dereferenceable(8) %"#1::NamedTuple", i64 signext %"a::Int64", i64 signext %"b::Int64", i64 signext %"c::Int64", i64 signext %"d::Int64") #0 {
top:
  %0 = getelementptr inbounds [1 x i64], [1 x i64]* %"#1::NamedTuple", i64 0, i64 0
; ┌ @ REPL[3]:2 within `#g#1`
; │┌ @ operators.jl:587 within `+` @ int.jl:87
    %1 = add i64 %"b::Int64", %"a::Int64"
    %2 = add i64 %1, %"c::Int64"
; ││ @ operators.jl:587 within `+`
; ││┌ @ operators.jl:544 within `afoldl`
; │││┌ @ int.jl:87 within `+`
      %3 = add i64 %2, %"d::Int64"
; │││└
; │││ @ operators.jl:545 within `afoldl`
; │││┌ @ int.jl:87 within `+`
      %unbox = load i64, i64* %0, align 8
      %4 = add i64 %3, %unbox
; └└└└
  ret i64 %4
}
```
</details>
vchuravy added a commit that referenced this issue Jul 31, 2023
Pass the types to the allocator functions.

-------

Before this PR, we were missing the types for allocations in two cases:

1. allocations from codegen
2. allocations in `gc_managed_realloc_`

The second one is easy: those are always used for buffers, right?

For the first one: we extend the allocation functions called from
codegen, to take the type as a parameter, and set the tag there.

I kept the old interfaces around, since I think that they cannot be
removed due to supporting legacy code?

------

An example of the generated code:
```julia
  %ptls_field6 = getelementptr inbounds {}**, {}*** %4, i64 2
  %13 = bitcast {}*** %ptls_field6 to i8**
  %ptls_load78 = load i8*, i8** %13, align 8
  %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc_typed(i8* %ptls_load78, i32 1184, i32 32, i64 4366152144) #7
```

Fixes #43688.
Fixes #45268.

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
NHDaly added a commit that referenced this issue Aug 28, 2023
Pass the types to the allocator functions.

-------

Before this PR, we were missing the types for allocations in two cases:

1. allocations from codegen
2. allocations in `gc_managed_realloc_`

The second one is easy: those are always used for buffers, right?

For the first one: we extend the allocation functions called from
codegen, to take the type as a parameter, and set the tag there.

I kept the old interfaces around, since I think that they cannot be
removed due to supporting legacy code?

------

An example of the generated code:
```julia
  %ptls_field6 = getelementptr inbounds {}**, {}*** %4, i64 2
  %13 = bitcast {}*** %ptls_field6 to i8**
  %ptls_load78 = load i8*, i8** %13, align 8
  %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc_typed(i8* %ptls_load78, i32 1184, i32 32, i64 4366152144) #7
```

Fixes #43688.
Fixes #45268.

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
NHDaly added a commit that referenced this issue Sep 19, 2023
Backports PR #50337 for RAI julia v1.9.2

Original description:

===================

Pass the types to the allocator functions.

-------

Before this PR, we were missing the types for allocations in two cases:

1. allocations from codegen
2. allocations in `gc_managed_realloc_`

The second one is easy: those are always used for buffers, right?

For the first one: we extend the allocation functions called from
codegen, to take the type as a parameter, and set the tag there.

I kept the old interfaces around, since I think that they cannot be
removed due to supporting legacy code?

------

An example of the generated code:
```julia
  %ptls_field6 = getelementptr inbounds {}**, {}*** %4, i64 2
  %13 = bitcast {}*** %ptls_field6 to i8**
  %ptls_load78 = load i8*, i8** %13, align 8
  %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc_typed(i8* %ptls_load78, i32 1184, i32 32, i64 4366152144) #7
```

Fixes #43688.
Fixes #45268.

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
Keno added a commit that referenced this issue Oct 9, 2023
Keno pushed a commit that referenced this issue Oct 9, 2023
 Fix `getproperty` calls and handle Core.Typeof(Vararg)
Keno added a commit that referenced this issue Nov 2, 2023
This is part of the work to address #51352 by attempting to allow
the compiler to perform SRAO on persistent data structures like
`PersistentDict` as if they were regular immutable data structures.
These sorts of data structures have very complicated internals
(with lots of mutation, memory sharing, etc.), but a relatively
simple interface. As such, it is unlikely that our compiler will
have sufficient power to optimize this interface by analyzing
the implementation.

We thus need to come up with some other mechanism that gives the
compiler license to perform the requisite optimization. One way
would be to just hardcode `PersistentDict` into the compiler,
optimizing it like any of the other builtin datatypes. However,
this is of course very unsatisfying. At the other end of the
spectrum would be something like a generic rewrite rule system
(e-graphs anyone?) that would let the PersistentDict
implementation declare its interface to the compiler and the
compiler would use this for optimization (in a perfect world,
the actual rewrite would then be checked using some sort of
formal methods). I think that would be interesting, but we're
very far from even being able to design something like that
(at least in Base - experiments with external AbstractInterpreters
in this direction are encouraged).

This PR tries to come up with a reasonable middle ground, where
the compiler gets some knowledge of the protocol hardcoded without
having to know about the implementation details of the data structure.

The basic ideas is that `Core` provides some magic generic functions
that implementations can extend. Semantically, they are not special.
They dispatch as usual, and implementations are expected to work
properly even in the absence of any compiler optimizations.

However, the compiler is semantically permitted to perform structural
optimization using these magic generic functions. In the concrete
case, this PR introduces the `KeyValue` interface which consists
of two generic functions, `get` and `set`. The core optimization
is that the compiler is allowed to rewrite any occurrence of
`get(set(x, k, v), k)` into `v` without additional legality checks.
In particular, the compiler performs no type checks, conversions, etc.
The higher level implementation code is expected to do all that.

This approach closely matches the general direction we've been taking
in external AbstractInterpreters for embedding additional semantics
and optimization opportunities into Julia code (although we generally
use methods there, rather than full generic functions), so I think
we have some evidence that this sort of approach works reasonably well.

Nevertheless, this is certainly an experiment and the interface is
explicitly declared unstable.

## Current Status

This is fully working and implemented, but the optimization currently
bails on anything but the simplest cases. Filling all those cases in
is not particularly hard, but should be done along with a more invasive
refactoring of SROA, so we should figure out the general direction
here first and then we can finish all that up in a follow-up cleanup.

## Obligatory benchmark
Before:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  32.940 ns …  28.754 μs  ┊ GC (min … max):  0.00% … 99.76%
 Time  (median):     49.647 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   57.519 ns ± 333.275 ns  ┊ GC (mean ± σ):  10.81% ±  2.22%

        ▃█▅               ▁▃▅▅▃▁                ▁▃▂   ▂
  ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃
  32.9 ns         Histogram: frequency by time         68.6 ns <

 Memory estimate: 128 bytes, allocs estimate: 4.

julia> @code_typed foo()
CodeInfo(
1 ─ %1  = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %2  = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64}
│   %3  = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64}
│   %4  = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %5  = $(Expr(:boundscheck, true))::Bool
└──       goto #5 if not %5
2 ─ %7  = Base.sub_int(1, 1)::Int64
│   %8  = Base.bitcast(UInt64, %7)::UInt64
│   %9  = Base.getfield(%4, :size)::Tuple{Int64}
│   %10 = $(Expr(:boundscheck, true))::Bool
│   %11 = Base.getfield(%9, 1, %10)::Int64
│   %12 = Base.bitcast(UInt64, %11)::UInt64
│   %13 = Base.ult_int(%8, %12)::Bool
└──       goto #4 if not %13
3 ─       goto #5
4 ─ %16 = Core.tuple(1)::Tuple{Int64}
│         invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{}
└──       unreachable
5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│         Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
└──       goto #6
6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32
│   %24 = Base.or_int(%23, 0x00010000)::UInt32
│         Base.setfield!(%2, :bitmap, %24)::UInt32
└──       goto #7
7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64}
└──       goto #8
8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64}, 🅰️:Symbol)::Int64
└──       return %29
```

After:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.459 ns … 11.320 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.460 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.469 ns ±  0.183 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂    █                                              ▁    █ ▂
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █
  2.46 ns      Histogram: log(frequency) by time     2.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @code_typed foo()
CodeInfo(
1 ─     return 1
```
aviatesk pushed a commit that referenced this issue Nov 16, 2023
This is part of the work to address #51352 by attempting to allow
the compiler to perform SRAO on persistent data structures like
`PersistentDict` as if they were regular immutable data structures.
These sorts of data structures have very complicated internals
(with lots of mutation, memory sharing, etc.), but a relatively
simple interface. As such, it is unlikely that our compiler will
have sufficient power to optimize this interface by analyzing
the implementation.

We thus need to come up with some other mechanism that gives the
compiler license to perform the requisite optimization. One way
would be to just hardcode `PersistentDict` into the compiler,
optimizing it like any of the other builtin datatypes. However,
this is of course very unsatisfying. At the other end of the
spectrum would be something like a generic rewrite rule system
(e-graphs anyone?) that would let the PersistentDict
implementation declare its interface to the compiler and the
compiler would use this for optimization (in a perfect world,
the actual rewrite would then be checked using some sort of
formal methods). I think that would be interesting, but we're
very far from even being able to design something like that
(at least in Base - experiments with external AbstractInterpreters
in this direction are encouraged).

This PR tries to come up with a reasonable middle ground, where
the compiler gets some knowledge of the protocol hardcoded without
having to know about the implementation details of the data structure.

The basic ideas is that `Core` provides some magic generic functions
that implementations can extend. Semantically, they are not special.
They dispatch as usual, and implementations are expected to work
properly even in the absence of any compiler optimizations.

However, the compiler is semantically permitted to perform structural
optimization using these magic generic functions. In the concrete
case, this PR introduces the `KeyValue` interface which consists
of two generic functions, `get` and `set`. The core optimization
is that the compiler is allowed to rewrite any occurrence of
`get(set(x, k, v), k)` into `v` without additional legality checks.
In particular, the compiler performs no type checks, conversions, etc.
The higher level implementation code is expected to do all that.

This approach closely matches the general direction we've been taking
in external AbstractInterpreters for embedding additional semantics
and optimization opportunities into Julia code (although we generally
use methods there, rather than full generic functions), so I think
we have some evidence that this sort of approach works reasonably well.

Nevertheless, this is certainly an experiment and the interface is
explicitly declared unstable.

This is fully working and implemented, but the optimization currently
bails on anything but the simplest cases. Filling all those cases in
is not particularly hard, but should be done along with a more invasive
refactoring of SROA, so we should figure out the general direction
here first and then we can finish all that up in a follow-up cleanup.

Before:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  32.940 ns …  28.754 μs  ┊ GC (min … max):  0.00% … 99.76%
 Time  (median):     49.647 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   57.519 ns ± 333.275 ns  ┊ GC (mean ± σ):  10.81% ±  2.22%

        ▃█▅               ▁▃▅▅▃▁                ▁▃▂   ▂
  ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃
  32.9 ns         Histogram: frequency by time         68.6 ns <

 Memory estimate: 128 bytes, allocs estimate: 4.

julia> @code_typed foo()
CodeInfo(
1 ─ %1  = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %2  = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64}
│   %3  = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64}
│   %4  = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %5  = $(Expr(:boundscheck, true))::Bool
└──       goto #5 if not %5
2 ─ %7  = Base.sub_int(1, 1)::Int64
│   %8  = Base.bitcast(UInt64, %7)::UInt64
│   %9  = Base.getfield(%4, :size)::Tuple{Int64}
│   %10 = $(Expr(:boundscheck, true))::Bool
│   %11 = Base.getfield(%9, 1, %10)::Int64
│   %12 = Base.bitcast(UInt64, %11)::UInt64
│   %13 = Base.ult_int(%8, %12)::Bool
└──       goto #4 if not %13
3 ─       goto #5
4 ─ %16 = Core.tuple(1)::Tuple{Int64}
│         invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{}
└──       unreachable
5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│         Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
└──       goto #6
6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32
│   %24 = Base.or_int(%23, 0x00010000)::UInt32
│         Base.setfield!(%2, :bitmap, %24)::UInt32
└──       goto #7
7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64}
└──       goto #8
8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64}, 🅰️:Symbol)::Int64
└──       return %29
```

After:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.459 ns … 11.320 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.460 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.469 ns ±  0.183 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂    █                                              ▁    █ ▂
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █
  2.46 ns      Histogram: log(frequency) by time     2.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @code_typed foo()
CodeInfo(
1 ─     return 1
```
Keno added a commit that referenced this issue Nov 20, 2023
This is part of the work to address #51352 by attempting to allow
the compiler to perform SRAO on persistent data structures like
`PersistentDict` as if they were regular immutable data structures.
These sorts of data structures have very complicated internals
(with lots of mutation, memory sharing, etc.), but a relatively
simple interface. As such, it is unlikely that our compiler will
have sufficient power to optimize this interface by analyzing
the implementation.

We thus need to come up with some other mechanism that gives the
compiler license to perform the requisite optimization. One way
would be to just hardcode `PersistentDict` into the compiler,
optimizing it like any of the other builtin datatypes. However,
this is of course very unsatisfying. At the other end of the
spectrum would be something like a generic rewrite rule system
(e-graphs anyone?) that would let the PersistentDict
implementation declare its interface to the compiler and the
compiler would use this for optimization (in a perfect world,
the actual rewrite would then be checked using some sort of
formal methods). I think that would be interesting, but we're
very far from even being able to design something like that
(at least in Base - experiments with external AbstractInterpreters
in this direction are encouraged).

This PR tries to come up with a reasonable middle ground, where
the compiler gets some knowledge of the protocol hardcoded without
having to know about the implementation details of the data structure.

The basic ideas is that `Core` provides some magic generic functions
that implementations can extend. Semantically, they are not special.
They dispatch as usual, and implementations are expected to work
properly even in the absence of any compiler optimizations.

However, the compiler is semantically permitted to perform structural
optimization using these magic generic functions. In the concrete
case, this PR introduces the `KeyValue` interface which consists
of two generic functions, `get` and `set`. The core optimization
is that the compiler is allowed to rewrite any occurrence of
`get(set(x, k, v), k)` into `v` without additional legality checks.
In particular, the compiler performs no type checks, conversions, etc.
The higher level implementation code is expected to do all that.

This approach closely matches the general direction we've been taking
in external AbstractInterpreters for embedding additional semantics
and optimization opportunities into Julia code (although we generally
use methods there, rather than full generic functions), so I think
we have some evidence that this sort of approach works reasonably well.

Nevertheless, this is certainly an experiment and the interface is
explicitly declared unstable.

This is fully working and implemented, but the optimization currently
bails on anything but the simplest cases. Filling all those cases in
is not particularly hard, but should be done along with a more invasive
refactoring of SROA, so we should figure out the general direction
here first and then we can finish all that up in a follow-up cleanup.

Before:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  32.940 ns …  28.754 μs  ┊ GC (min … max):  0.00% … 99.76%
 Time  (median):     49.647 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   57.519 ns ± 333.275 ns  ┊ GC (mean ± σ):  10.81% ±  2.22%

        ▃█▅               ▁▃▅▅▃▁                ▁▃▂   ▂
  ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃
  32.9 ns         Histogram: frequency by time         68.6 ns <

 Memory estimate: 128 bytes, allocs estimate: 4.

julia> @code_typed foo()
CodeInfo(
1 ─ %1  = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %2  = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64}
│   %3  = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64}
│   %4  = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %5  = $(Expr(:boundscheck, true))::Bool
└──       goto #5 if not %5
2 ─ %7  = Base.sub_int(1, 1)::Int64
│   %8  = Base.bitcast(UInt64, %7)::UInt64
│   %9  = Base.getfield(%4, :size)::Tuple{Int64}
│   %10 = $(Expr(:boundscheck, true))::Bool
│   %11 = Base.getfield(%9, 1, %10)::Int64
│   %12 = Base.bitcast(UInt64, %11)::UInt64
│   %13 = Base.ult_int(%8, %12)::Bool
└──       goto #4 if not %13
3 ─       goto #5
4 ─ %16 = Core.tuple(1)::Tuple{Int64}
│         invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{}
└──       unreachable
5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│         Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
└──       goto #6
6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32
│   %24 = Base.or_int(%23, 0x00010000)::UInt32
│         Base.setfield!(%2, :bitmap, %24)::UInt32
└──       goto #7
7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64}
└──       goto #8
8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64}, 🅰️:Symbol)::Int64
└──       return %29
```

After:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.459 ns … 11.320 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.460 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.469 ns ±  0.183 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂    █                                              ▁    █ ▂
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █
  2.46 ns      Histogram: log(frequency) by time     2.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @code_typed foo()
CodeInfo(
1 ─     return 1
```
Keno added a commit that referenced this issue Nov 26, 2023
This is part of the work to address #51352 by attempting to allow
the compiler to perform SRAO on persistent data structures like
`PersistentDict` as if they were regular immutable data structures.
These sorts of data structures have very complicated internals
(with lots of mutation, memory sharing, etc.), but a relatively
simple interface. As such, it is unlikely that our compiler will
have sufficient power to optimize this interface by analyzing
the implementation.

We thus need to come up with some other mechanism that gives the
compiler license to perform the requisite optimization. One way
would be to just hardcode `PersistentDict` into the compiler,
optimizing it like any of the other builtin datatypes. However,
this is of course very unsatisfying. At the other end of the
spectrum would be something like a generic rewrite rule system
(e-graphs anyone?) that would let the PersistentDict
implementation declare its interface to the compiler and the
compiler would use this for optimization (in a perfect world,
the actual rewrite would then be checked using some sort of
formal methods). I think that would be interesting, but we're
very far from even being able to design something like that
(at least in Base - experiments with external AbstractInterpreters
in this direction are encouraged).

This PR tries to come up with a reasonable middle ground, where
the compiler gets some knowledge of the protocol hardcoded without
having to know about the implementation details of the data structure.

The basic ideas is that `Core` provides some magic generic functions
that implementations can extend. Semantically, they are not special.
They dispatch as usual, and implementations are expected to work
properly even in the absence of any compiler optimizations.

However, the compiler is semantically permitted to perform structural
optimization using these magic generic functions. In the concrete
case, this PR introduces the `KeyValue` interface which consists
of two generic functions, `get` and `set`. The core optimization
is that the compiler is allowed to rewrite any occurrence of
`get(set(x, k, v), k)` into `v` without additional legality checks.
In particular, the compiler performs no type checks, conversions, etc.
The higher level implementation code is expected to do all that.

This approach closely matches the general direction we've been taking
in external AbstractInterpreters for embedding additional semantics
and optimization opportunities into Julia code (although we generally
use methods there, rather than full generic functions), so I think
we have some evidence that this sort of approach works reasonably well.

Nevertheless, this is certainly an experiment and the interface is
explicitly declared unstable.

This is fully working and implemented, but the optimization currently
bails on anything but the simplest cases. Filling all those cases in
is not particularly hard, but should be done along with a more invasive
refactoring of SROA, so we should figure out the general direction
here first and then we can finish all that up in a follow-up cleanup.

Before:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  32.940 ns …  28.754 μs  ┊ GC (min … max):  0.00% … 99.76%
 Time  (median):     49.647 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   57.519 ns ± 333.275 ns  ┊ GC (mean ± σ):  10.81% ±  2.22%

        ▃█▅               ▁▃▅▅▃▁                ▁▃▂   ▂
  ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃
  32.9 ns         Histogram: frequency by time         68.6 ns <

 Memory estimate: 128 bytes, allocs estimate: 4.

julia> @code_typed foo()
CodeInfo(
1 ─ %1  = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %2  = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64}
│   %3  = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64}
│   %4  = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %5  = $(Expr(:boundscheck, true))::Bool
└──       goto #5 if not %5
2 ─ %7  = Base.sub_int(1, 1)::Int64
│   %8  = Base.bitcast(UInt64, %7)::UInt64
│   %9  = Base.getfield(%4, :size)::Tuple{Int64}
│   %10 = $(Expr(:boundscheck, true))::Bool
│   %11 = Base.getfield(%9, 1, %10)::Int64
│   %12 = Base.bitcast(UInt64, %11)::UInt64
│   %13 = Base.ult_int(%8, %12)::Bool
└──       goto #4 if not %13
3 ─       goto #5
4 ─ %16 = Core.tuple(1)::Tuple{Int64}
│         invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{}
└──       unreachable
5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│         Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
└──       goto #6
6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32
│   %24 = Base.or_int(%23, 0x00010000)::UInt32
│         Base.setfield!(%2, :bitmap, %24)::UInt32
└──       goto #7
7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64}
└──       goto #8
8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64}, 🅰️:Symbol)::Int64
└──       return %29
```

After:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.459 ns … 11.320 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.460 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.469 ns ±  0.183 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂    █                                              ▁    █ ▂
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █
  2.46 ns      Histogram: log(frequency) by time     2.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @code_typed foo()
CodeInfo(
1 ─     return 1
```
Keno added a commit that referenced this issue Nov 27, 2023
This is part of the work to address #51352 by attempting to allow the
compiler to perform SRAO on persistent data structures like
`PersistentDict` as if they were regular immutable data structures.
These sorts of data structures have very complicated internals (with
lots of mutation, memory sharing, etc.), but a relatively simple
interface. As such, it is unlikely that our compiler will have
sufficient power to optimize this interface by analyzing the
implementation.

We thus need to come up with some other mechanism that gives the
compiler license to perform the requisite optimization. One way would be
to just hardcode `PersistentDict` into the compiler, optimizing it like
any of the other builtin datatypes. However, this is of course very
unsatisfying. At the other end of the spectrum would be something like a
generic rewrite rule system (e-graphs anyone?) that would let the
PersistentDict implementation declare its interface to the compiler and
the compiler would use this for optimization (in a perfect world, the
actual rewrite would then be checked using some sort of formal methods).
I think that would be interesting, but we're very far from even being
able to design something like that (at least in Base - experiments with
external AbstractInterpreters in this direction are encouraged).

This PR tries to come up with a reasonable middle ground, where the
compiler gets some knowledge of the protocol hardcoded without having to
know about the implementation details of the data structure.

The basic ideas is that `Core` provides some magic generic functions
that implementations can extend. Semantically, they are not special.
They dispatch as usual, and implementations are expected to work
properly even in the absence of any compiler optimizations.

However, the compiler is semantically permitted to perform structural
optimization using these magic generic functions. In the concrete case,
this PR introduces the `KeyValue` interface which consists of two
generic functions, `get` and `set`. The core optimization is that the
compiler is allowed to rewrite any occurrence of `get(set(x, k, v), k)`
into `v` without additional legality checks. In particular, the compiler
performs no type checks, conversions, etc. The higher level
implementation code is expected to do all that.

This approach closely matches the general direction we've been taking in
external AbstractInterpreters for embedding additional semantics and
optimization opportunities into Julia code (although we generally use
methods there, rather than full generic functions), so I think we have
some evidence that this sort of approach works reasonably well.

Nevertheless, this is certainly an experiment and the interface is
explicitly declared unstable.

## Current Status

This is fully working and implemented, but the optimization currently
bails on anything but the simplest cases. Filling all those cases in is
not particularly hard, but should be done along with a more invasive
refactoring of SROA, so we should figure out the general direction here
first and then we can finish all that up in a follow-up cleanup.

## Obligatory benchmark
Before:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  32.940 ns …  28.754 μs  ┊ GC (min … max):  0.00% … 99.76%
 Time  (median):     49.647 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   57.519 ns ± 333.275 ns  ┊ GC (mean ± σ):  10.81% ±  2.22%

        ▃█▅               ▁▃▅▅▃▁                ▁▃▂   ▂
  ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃
  32.9 ns         Histogram: frequency by time         68.6 ns <

 Memory estimate: 128 bytes, allocs estimate: 4.

julia> @code_typed foo()
CodeInfo(
1 ─ %1  = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %2  = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64}
│   %3  = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64}
│   %4  = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %5  = $(Expr(:boundscheck, true))::Bool
└──       goto #5 if not %5
2 ─ %7  = Base.sub_int(1, 1)::Int64
│   %8  = Base.bitcast(UInt64, %7)::UInt64
│   %9  = Base.getfield(%4, :size)::Tuple{Int64}
│   %10 = $(Expr(:boundscheck, true))::Bool
│   %11 = Base.getfield(%9, 1, %10)::Int64
│   %12 = Base.bitcast(UInt64, %11)::UInt64
│   %13 = Base.ult_int(%8, %12)::Bool
└──       goto #4 if not %13
3 ─       goto #5
4 ─ %16 = Core.tuple(1)::Tuple{Int64}
│         invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{}
└──       unreachable
5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│         Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
└──       goto #6
6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32
│   %24 = Base.or_int(%23, 0x00010000)::UInt32
│         Base.setfield!(%2, :bitmap, %24)::UInt32
└──       goto #7
7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64}
└──       goto #8
8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64}, 🅰️:Symbol)::Int64
└──       return %29
```

After:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.459 ns … 11.320 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.460 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.469 ns ±  0.183 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂    █                                              ▁    █ ▂
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █
  2.46 ns      Histogram: log(frequency) by time     2.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @code_typed foo()
CodeInfo(
1 ─     return 1
```
mkitti pushed a commit to mkitti/julia that referenced this issue Dec 9, 2023
This is part of the work to address JuliaLang#51352 by attempting to allow the
compiler to perform SRAO on persistent data structures like
`PersistentDict` as if they were regular immutable data structures.
These sorts of data structures have very complicated internals (with
lots of mutation, memory sharing, etc.), but a relatively simple
interface. As such, it is unlikely that our compiler will have
sufficient power to optimize this interface by analyzing the
implementation.

We thus need to come up with some other mechanism that gives the
compiler license to perform the requisite optimization. One way would be
to just hardcode `PersistentDict` into the compiler, optimizing it like
any of the other builtin datatypes. However, this is of course very
unsatisfying. At the other end of the spectrum would be something like a
generic rewrite rule system (e-graphs anyone?) that would let the
PersistentDict implementation declare its interface to the compiler and
the compiler would use this for optimization (in a perfect world, the
actual rewrite would then be checked using some sort of formal methods).
I think that would be interesting, but we're very far from even being
able to design something like that (at least in Base - experiments with
external AbstractInterpreters in this direction are encouraged).

This PR tries to come up with a reasonable middle ground, where the
compiler gets some knowledge of the protocol hardcoded without having to
know about the implementation details of the data structure.

The basic ideas is that `Core` provides some magic generic functions
that implementations can extend. Semantically, they are not special.
They dispatch as usual, and implementations are expected to work
properly even in the absence of any compiler optimizations.

However, the compiler is semantically permitted to perform structural
optimization using these magic generic functions. In the concrete case,
this PR introduces the `KeyValue` interface which consists of two
generic functions, `get` and `set`. The core optimization is that the
compiler is allowed to rewrite any occurrence of `get(set(x, k, v), k)`
into `v` without additional legality checks. In particular, the compiler
performs no type checks, conversions, etc. The higher level
implementation code is expected to do all that.

This approach closely matches the general direction we've been taking in
external AbstractInterpreters for embedding additional semantics and
optimization opportunities into Julia code (although we generally use
methods there, rather than full generic functions), so I think we have
some evidence that this sort of approach works reasonably well.

Nevertheless, this is certainly an experiment and the interface is
explicitly declared unstable.

## Current Status

This is fully working and implemented, but the optimization currently
bails on anything but the simplest cases. Filling all those cases in is
not particularly hard, but should be done along with a more invasive
refactoring of SROA, so we should figure out the general direction here
first and then we can finish all that up in a follow-up cleanup.

## Obligatory benchmark
Before:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  32.940 ns …  28.754 μs  ┊ GC (min … max):  0.00% … 99.76%
 Time  (median):     49.647 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   57.519 ns ± 333.275 ns  ┊ GC (mean ± σ):  10.81% ±  2.22%

        ▃█▅               ▁▃▅▅▃▁                ▁▃▂   ▂
  ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃
  32.9 ns         Histogram: frequency by time         68.6 ns <

 Memory estimate: 128 bytes, allocs estimate: 4.

julia> @code_typed foo()
CodeInfo(
1 ─ %1  = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %2  = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64}
│   %3  = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64}
│   %4  = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %5  = $(Expr(:boundscheck, true))::Bool
└──       goto JuliaLang#5 if not %5
2 ─ %7  = Base.sub_int(1, 1)::Int64
│   %8  = Base.bitcast(UInt64, %7)::UInt64
│   %9  = Base.getfield(%4, :size)::Tuple{Int64}
│   %10 = $(Expr(:boundscheck, true))::Bool
│   %11 = Base.getfield(%9, 1, %10)::Int64
│   %12 = Base.bitcast(UInt64, %11)::UInt64
│   %13 = Base.ult_int(%8, %12)::Bool
└──       goto JuliaLang#4 if not %13
3 ─       goto JuliaLang#5
4 ─ %16 = Core.tuple(1)::Tuple{Int64}
│         invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{}
└──       unreachable
5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│         Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
└──       goto JuliaLang#6
6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32
│   %24 = Base.or_int(%23, 0x00010000)::UInt32
│         Base.setfield!(%2, :bitmap, %24)::UInt32
└──       goto JuliaLang#7
7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64}
└──       goto JuliaLang#8
8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64}, 🅰️:Symbol)::Int64
└──       return %29
```

After:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.459 ns … 11.320 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.460 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.469 ns ±  0.183 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂    █                                              ▁    █ ▂
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █
  2.46 ns      Histogram: log(frequency) by time     2.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @code_typed foo()
CodeInfo(
1 ─     return 1
```
NHDaly added a commit that referenced this issue May 22, 2024
Pass the types to the allocator functions.

-------

Before this PR, we were missing the types for allocations in two cases:

1. allocations from codegen
2. allocations in `gc_managed_realloc_`

The second one is easy: those are always used for buffers, right?

For the first one: we extend the allocation functions called from
codegen, to take the type as a parameter, and set the tag there.

I kept the old interfaces around, since I think that they cannot be
removed due to supporting legacy code?

------

An example of the generated code:
```julia
  %ptls_field6 = getelementptr inbounds {}**, {}*** %4, i64 2
  %13 = bitcast {}*** %ptls_field6 to i8**
  %ptls_load78 = load i8*, i8** %13, align 8
  %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc_typed(i8* %ptls_load78, i32 1184, i32 32, i64 4366152144) #7
```

Fixes #43688.
Fixes #45268.

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
Keno pushed a commit that referenced this issue Jun 5, 2024
aviatesk added a commit that referenced this issue Oct 1, 2024
E.g. this allows `finalizer` inlining in the following case:
```julia
mutable struct ForeignBuffer{T}
    const ptr::Ptr{T}
end
const foreign_buffer_finalized = Ref(false)
function foreign_alloc(::Type{T}, length) where T
    ptr = Libc.malloc(sizeof(T) * length)
    ptr = Base.unsafe_convert(Ptr{T}, ptr)
    obj = ForeignBuffer{T}(ptr)
    return finalizer(obj) do obj
        Base.@assume_effects :notaskstate :nothrow
        foreign_buffer_finalized[] = true
        Libc.free(obj.ptr)
    end
end
function f_EA_finalizer(N::Int)
    workspace = foreign_alloc(Float64, N)
    GC.@preserve workspace begin
        (;ptr) = workspace
        Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr)
    end
end
```
```julia
julia> @code_typed f_EA_finalizer(42)
CodeInfo(
1 ── %1  = Base.mul_int(8, N)::Int64
│    %2  = Core.lshr_int(%1, 63)::Int64
│    %3  = Core.trunc_int(Core.UInt8, %2)::UInt8
│    %4  = Core.eq_int(%3, 0x01)::Bool
└───       goto #3 if not %4
2 ──       invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{}
└───       unreachable
3 ──       goto #4
4 ── %9  = Core.bitcast(Core.UInt64, %1)::UInt64
└───       goto #5
5 ──       goto #6
6 ──       goto #7
7 ──       goto #8
8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing}
└───       goto #9
9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64}
│    %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64}
└───       goto #10
10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17)))
│    %20 = Base.getfield(%17, :ptr)::Ptr{Float64}
│          invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing
│          $(Expr(:gc_preserve_end, :(%19)))
│    %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool}
│          Base.setfield!(%23, :x, true)::Bool
│    %25 = Base.getfield(%17, :ptr)::Ptr{Float64}
│    %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing}
│          $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing
└───       return nothing
) => Nothing
```

However, this is still a WIP. Before merging, I want to improve EA's
precision a bit and at least fix the test case that is currently marked as
`broken`. I also need to check its impact on compiler performance.

Additionally, I believe this feature is not yet practical.
In particular, there is still significant room for improvement in the
following areas:
- EA's interprocedural capabilities: currently EA is performed ad-hoc
  for limited frames because of latency reasons, which significantly
  reduces its precision in the presence of interprocedural calls.
- Relaxing the `:nothrow` check for finalizer inlining: the current
  algorithm requires `:nothrow`-ness on all paths from the allocation of
  the mutable struct to its last use, which is not practical for
  real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary
  optimizations such as inserting a `finalize` call after the last use
  might still be possible.
aviatesk added a commit that referenced this issue Oct 1, 2024
E.g. this allows `finalizer` inlining in the following case:
```julia
mutable struct ForeignBuffer{T}
    const ptr::Ptr{T}
end
const foreign_buffer_finalized = Ref(false)
function foreign_alloc(::Type{T}, length) where T
    ptr = Libc.malloc(sizeof(T) * length)
    ptr = Base.unsafe_convert(Ptr{T}, ptr)
    obj = ForeignBuffer{T}(ptr)
    return finalizer(obj) do obj
        Base.@assume_effects :notaskstate :nothrow
        foreign_buffer_finalized[] = true
        Libc.free(obj.ptr)
    end
end
function f_EA_finalizer(N::Int)
    workspace = foreign_alloc(Float64, N)
    GC.@preserve workspace begin
        (;ptr) = workspace
        Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr)
    end
end
```
```julia
julia> @code_typed f_EA_finalizer(42)
CodeInfo(
1 ── %1  = Base.mul_int(8, N)::Int64
│    %2  = Core.lshr_int(%1, 63)::Int64
│    %3  = Core.trunc_int(Core.UInt8, %2)::UInt8
│    %4  = Core.eq_int(%3, 0x01)::Bool
└───       goto #3 if not %4
2 ──       invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{}
└───       unreachable
3 ──       goto #4
4 ── %9  = Core.bitcast(Core.UInt64, %1)::UInt64
└───       goto #5
5 ──       goto #6
6 ──       goto #7
7 ──       goto #8
8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing}
└───       goto #9
9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64}
│    %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64}
└───       goto #10
10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17)))
│    %20 = Base.getfield(%17, :ptr)::Ptr{Float64}
│          invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing
│          $(Expr(:gc_preserve_end, :(%19)))
│    %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool}
│          Base.setfield!(%23, :x, true)::Bool
│    %25 = Base.getfield(%17, :ptr)::Ptr{Float64}
│    %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing}
│          $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing
└───       return nothing
) => Nothing
```

However, this is still a WIP. Before merging, I want to improve EA's
precision a bit and at least fix the test case that is currently marked as
`broken`. I also need to check its impact on compiler performance.

Additionally, I believe this feature is not yet practical.
In particular, there is still significant room for improvement in the
following areas:
- EA's interprocedural capabilities: currently EA is performed ad-hoc
  for limited frames because of latency reasons, which significantly
  reduces its precision in the presence of interprocedural calls.
- Relaxing the `:nothrow` check for finalizer inlining: the current
  algorithm requires `:nothrow`-ness on all paths from the allocation of
  the mutable struct to its last use, which is not practical for
  real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary
  optimizations such as inserting a `finalize` call after the last use
  might still be possible.
aviatesk added a commit that referenced this issue Oct 2, 2024
E.g. this allows `finalizer` inlining in the following case:
```julia
mutable struct ForeignBuffer{T}
    const ptr::Ptr{T}
end
const foreign_buffer_finalized = Ref(false)
function foreign_alloc(::Type{T}, length) where T
    ptr = Libc.malloc(sizeof(T) * length)
    ptr = Base.unsafe_convert(Ptr{T}, ptr)
    obj = ForeignBuffer{T}(ptr)
    return finalizer(obj) do obj
        Base.@assume_effects :notaskstate :nothrow
        foreign_buffer_finalized[] = true
        Libc.free(obj.ptr)
    end
end
function f_EA_finalizer(N::Int)
    workspace = foreign_alloc(Float64, N)
    GC.@preserve workspace begin
        (;ptr) = workspace
        Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr)
    end
end
```
```julia
julia> @code_typed f_EA_finalizer(42)
CodeInfo(
1 ── %1  = Base.mul_int(8, N)::Int64
│    %2  = Core.lshr_int(%1, 63)::Int64
│    %3  = Core.trunc_int(Core.UInt8, %2)::UInt8
│    %4  = Core.eq_int(%3, 0x01)::Bool
└───       goto #3 if not %4
2 ──       invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{}
└───       unreachable
3 ──       goto #4
4 ── %9  = Core.bitcast(Core.UInt64, %1)::UInt64
└───       goto #5
5 ──       goto #6
6 ──       goto #7
7 ──       goto #8
8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing}
└───       goto #9
9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64}
│    %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64}
└───       goto #10
10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17)))
│    %20 = Base.getfield(%17, :ptr)::Ptr{Float64}
│          invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing
│          $(Expr(:gc_preserve_end, :(%19)))
│    %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool}
│          Base.setfield!(%23, :x, true)::Bool
│    %25 = Base.getfield(%17, :ptr)::Ptr{Float64}
│    %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing}
│          $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing
└───       return nothing
) => Nothing
```

However, this is still a WIP. Before merging, I want to improve EA's
precision a bit and at least fix the test case that is currently marked as
`broken`. I also need to check its impact on compiler performance.

Additionally, I believe this feature is not yet practical.
In particular, there is still significant room for improvement in the
following areas:
- EA's interprocedural capabilities: currently EA is performed ad-hoc
  for limited frames because of latency reasons, which significantly
  reduces its precision in the presence of interprocedural calls.
- Relaxing the `:nothrow` check for finalizer inlining: the current
  algorithm requires `:nothrow`-ness on all paths from the allocation of
  the mutable struct to its last use, which is not practical for
  real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary
  optimizations such as inserting a `finalize` call after the last use
  might still be possible.
aviatesk added a commit that referenced this issue Oct 2, 2024
E.g. this allows `finalizer` inlining in the following case:
```julia
mutable struct ForeignBuffer{T}
    const ptr::Ptr{T}
end
const foreign_buffer_finalized = Ref(false)
function foreign_alloc(::Type{T}, length) where T
    ptr = Libc.malloc(sizeof(T) * length)
    ptr = Base.unsafe_convert(Ptr{T}, ptr)
    obj = ForeignBuffer{T}(ptr)
    return finalizer(obj) do obj
        Base.@assume_effects :notaskstate :nothrow
        foreign_buffer_finalized[] = true
        Libc.free(obj.ptr)
    end
end
function f_EA_finalizer(N::Int)
    workspace = foreign_alloc(Float64, N)
    GC.@preserve workspace begin
        (;ptr) = workspace
        Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr)
    end
end
```
```julia
julia> @code_typed f_EA_finalizer(42)
CodeInfo(
1 ── %1  = Base.mul_int(8, N)::Int64
│    %2  = Core.lshr_int(%1, 63)::Int64
│    %3  = Core.trunc_int(Core.UInt8, %2)::UInt8
│    %4  = Core.eq_int(%3, 0x01)::Bool
└───       goto #3 if not %4
2 ──       invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{}
└───       unreachable
3 ──       goto #4
4 ── %9  = Core.bitcast(Core.UInt64, %1)::UInt64
└───       goto #5
5 ──       goto #6
6 ──       goto #7
7 ──       goto #8
8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing}
└───       goto #9
9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64}
│    %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64}
└───       goto #10
10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17)))
│    %20 = Base.getfield(%17, :ptr)::Ptr{Float64}
│          invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing
│          $(Expr(:gc_preserve_end, :(%19)))
│    %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool}
│          Base.setfield!(%23, :x, true)::Bool
│    %25 = Base.getfield(%17, :ptr)::Ptr{Float64}
│    %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing}
│          $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing
└───       return nothing
) => Nothing
```

However, this is still a WIP. Before merging, I want to improve EA's
precision a bit and at least fix the test case that is currently marked as
`broken`. I also need to check its impact on compiler performance.

Additionally, I believe this feature is not yet practical.
In particular, there is still significant room for improvement in the
following areas:
- EA's interprocedural capabilities: currently EA is performed ad-hoc
  for limited frames because of latency reasons, which significantly
  reduces its precision in the presence of interprocedural calls.
- Relaxing the `:nothrow` check for finalizer inlining: the current
  algorithm requires `:nothrow`-ness on all paths from the allocation of
  the mutable struct to its last use, which is not practical for
  real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary
  optimizations such as inserting a `finalize` call after the last use
  might still be possible.
aviatesk added a commit that referenced this issue Oct 2, 2024
E.g. this allows `finalizer` inlining in the following case:
```julia
mutable struct ForeignBuffer{T}
    const ptr::Ptr{T}
end
const foreign_buffer_finalized = Ref(false)
function foreign_alloc(::Type{T}, length) where T
    ptr = Libc.malloc(sizeof(T) * length)
    ptr = Base.unsafe_convert(Ptr{T}, ptr)
    obj = ForeignBuffer{T}(ptr)
    return finalizer(obj) do obj
        Base.@assume_effects :notaskstate :nothrow
        foreign_buffer_finalized[] = true
        Libc.free(obj.ptr)
    end
end
function f_EA_finalizer(N::Int)
    workspace = foreign_alloc(Float64, N)
    GC.@preserve workspace begin
        (;ptr) = workspace
        Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr)
    end
end
```
```julia
julia> @code_typed f_EA_finalizer(42)
CodeInfo(
1 ── %1  = Base.mul_int(8, N)::Int64
│    %2  = Core.lshr_int(%1, 63)::Int64
│    %3  = Core.trunc_int(Core.UInt8, %2)::UInt8
│    %4  = Core.eq_int(%3, 0x01)::Bool
└───       goto #3 if not %4
2 ──       invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{}
└───       unreachable
3 ──       goto #4
4 ── %9  = Core.bitcast(Core.UInt64, %1)::UInt64
└───       goto #5
5 ──       goto #6
6 ──       goto #7
7 ──       goto #8
8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing}
└───       goto #9
9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64}
│    %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64}
└───       goto #10
10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17)))
│    %20 = Base.getfield(%17, :ptr)::Ptr{Float64}
│          invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing
│          $(Expr(:gc_preserve_end, :(%19)))
│    %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool}
│          Base.setfield!(%23, :x, true)::Bool
│    %25 = Base.getfield(%17, :ptr)::Ptr{Float64}
│    %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing}
│          $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing
└───       return nothing
) => Nothing
```

However, this is still a WIP. Before merging, I want to improve EA's
precision a bit and at least fix the test case that is currently marked as
`broken`. I also need to check its impact on compiler performance.

Additionally, I believe this feature is not yet practical.
In particular, there is still significant room for improvement in the
following areas:
- EA's interprocedural capabilities: currently EA is performed ad-hoc
  for limited frames because of latency reasons, which significantly
  reduces its precision in the presence of interprocedural calls.
- Relaxing the `:nothrow` check for finalizer inlining: the current
  algorithm requires `:nothrow`-ness on all paths from the allocation of
  the mutable struct to its last use, which is not practical for
  real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary
  optimizations such as inserting a `finalize` call after the last use
  might still be possible.
aviatesk added a commit that referenced this issue Oct 4, 2024
E.g. this allows `finalizer` inlining in the following case:
```julia
mutable struct ForeignBuffer{T}
    const ptr::Ptr{T}
end
const foreign_buffer_finalized = Ref(false)
function foreign_alloc(::Type{T}, length) where T
    ptr = Libc.malloc(sizeof(T) * length)
    ptr = Base.unsafe_convert(Ptr{T}, ptr)
    obj = ForeignBuffer{T}(ptr)
    return finalizer(obj) do obj
        Base.@assume_effects :notaskstate :nothrow
        foreign_buffer_finalized[] = true
        Libc.free(obj.ptr)
    end
end
function f_EA_finalizer(N::Int)
    workspace = foreign_alloc(Float64, N)
    GC.@preserve workspace begin
        (;ptr) = workspace
        Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr)
    end
end
```
```julia
julia> @code_typed f_EA_finalizer(42)
CodeInfo(
1 ── %1  = Base.mul_int(8, N)::Int64
│    %2  = Core.lshr_int(%1, 63)::Int64
│    %3  = Core.trunc_int(Core.UInt8, %2)::UInt8
│    %4  = Core.eq_int(%3, 0x01)::Bool
└───       goto #3 if not %4
2 ──       invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{}
└───       unreachable
3 ──       goto #4
4 ── %9  = Core.bitcast(Core.UInt64, %1)::UInt64
└───       goto #5
5 ──       goto #6
6 ──       goto #7
7 ──       goto #8
8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing}
└───       goto #9
9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64}
│    %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64}
└───       goto #10
10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17)))
│    %20 = Base.getfield(%17, :ptr)::Ptr{Float64}
│          invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing
│          $(Expr(:gc_preserve_end, :(%19)))
│    %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool}
│          Base.setfield!(%23, :x, true)::Bool
│    %25 = Base.getfield(%17, :ptr)::Ptr{Float64}
│    %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing}
│          $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing
└───       return nothing
) => Nothing
```

However, this is still a WIP. Before merging, I want to improve EA's
precision a bit and at least fix the test case that is currently marked as
`broken`. I also need to check its impact on compiler performance.

Additionally, I believe this feature is not yet practical.
In particular, there is still significant room for improvement in the
following areas:
- EA's interprocedural capabilities: currently EA is performed ad-hoc
  for limited frames because of latency reasons, which significantly
  reduces its precision in the presence of interprocedural calls.
- Relaxing the `:nothrow` check for finalizer inlining: the current
  algorithm requires `:nothrow`-ness on all paths from the allocation of
  the mutable struct to its last use, which is not practical for
  real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary
  optimizations such as inserting a `finalize` call after the last use
  might still be possible.
aviatesk added a commit that referenced this issue Oct 4, 2024
E.g. this allows `finalizer` inlining in the following case:
```julia
mutable struct ForeignBuffer{T}
    const ptr::Ptr{T}
end
const foreign_buffer_finalized = Ref(false)
function foreign_alloc(::Type{T}, length) where T
    ptr = Libc.malloc(sizeof(T) * length)
    ptr = Base.unsafe_convert(Ptr{T}, ptr)
    obj = ForeignBuffer{T}(ptr)
    return finalizer(obj) do obj
        Base.@assume_effects :notaskstate :nothrow
        foreign_buffer_finalized[] = true
        Libc.free(obj.ptr)
    end
end
function f_EA_finalizer(N::Int)
    workspace = foreign_alloc(Float64, N)
    GC.@preserve workspace begin
        (;ptr) = workspace
        Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr)
    end
end
```
```julia
julia> @code_typed f_EA_finalizer(42)
CodeInfo(
1 ── %1  = Base.mul_int(8, N)::Int64
│    %2  = Core.lshr_int(%1, 63)::Int64
│    %3  = Core.trunc_int(Core.UInt8, %2)::UInt8
│    %4  = Core.eq_int(%3, 0x01)::Bool
└───       goto #3 if not %4
2 ──       invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{}
└───       unreachable
3 ──       goto #4
4 ── %9  = Core.bitcast(Core.UInt64, %1)::UInt64
└───       goto #5
5 ──       goto #6
6 ──       goto #7
7 ──       goto #8
8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing}
└───       goto #9
9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64}
│    %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64}
└───       goto #10
10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17)))
│    %20 = Base.getfield(%17, :ptr)::Ptr{Float64}
│          invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing
│          $(Expr(:gc_preserve_end, :(%19)))
│    %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool}
│          Base.setfield!(%23, :x, true)::Bool
│    %25 = Base.getfield(%17, :ptr)::Ptr{Float64}
│    %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing}
│          $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing
└───       return nothing
) => Nothing
```

However, this is still a WIP. Before merging, I want to improve EA's
precision a bit and at least fix the test case that is currently marked as
`broken`. I also need to check its impact on compiler performance.

Additionally, I believe this feature is not yet practical.
In particular, there is still significant room for improvement in the
following areas:
- EA's interprocedural capabilities: currently EA is performed ad-hoc
  for limited frames because of latency reasons, which significantly
  reduces its precision in the presence of interprocedural calls.
- Relaxing the `:nothrow` check for finalizer inlining: the current
  algorithm requires `:nothrow`-ness on all paths from the allocation of
  the mutable struct to its last use, which is not practical for
  real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary
  optimizations such as inserting a `finalize` call after the last use
  might still be possible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants