Skip to content

Commit

Permalink
Allocation Profiler: Types for all allocations (#50337) (#34)
Browse files Browse the repository at this point in the history
Backports PR #50337 for RAI julia v1.9.2

Original description:

===================

Pass the types to the allocator functions.

-------

Before this PR, we were missing the types for allocations in two cases:

1. allocations from codegen
2. allocations in `gc_managed_realloc_`

The second one is easy: those are always used for buffers, right?

For the first one: we extend the allocation functions called from
codegen, to take the type as a parameter, and set the tag there.

I kept the old interfaces around, since I think that they cannot be
removed due to supporting legacy code?

------

An example of the generated code:
```julia
  %ptls_field6 = getelementptr inbounds {}**, {}*** %4, i64 2
  %13 = bitcast {}*** %ptls_field6 to i8**
  %ptls_load78 = load i8*, i8** %13, align 8
  %box = call noalias nonnull dereferenceable(32) {}* @ijl_gc_pool_alloc_typed(i8* %ptls_load78, i32 1184, i32 32, i64 4366152144) #7
```

Fixes #43688.
Fixes #45268.

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
  • Loading branch information
NHDaly and vchuravy authored Aug 31, 2023
1 parent 97251d3 commit cbb076d
Show file tree
Hide file tree
Showing 13 changed files with 242 additions and 53 deletions.
117 changes: 116 additions & 1 deletion doc/src/manual/profile.md
Original file line number Diff line number Diff line change
Expand Up @@ -310,7 +310,122 @@ the amount of memory allocated by each line of code.

### Line-by-Line Allocation Tracking

To measure allocation line-by-line, start Julia with the `--track-allocation=<setting>` command-line
While [`@time`](@ref) logs high-level stats about memory usage and garbage collection over the course
of evaluating an expression, it can be useful to log each garbage collection event, to get an
intuitive sense of how often the garbage collector is running, how long it's running each time,
and how much garbage it collects each time. This can be enabled with
[`GC.enable_logging(true)`](@ref), which causes Julia to log to stderr every time
a garbage collection happens.

### [Allocation Profiler](@id allocation-profiler)

!!! compat "Julia 1.8"
This functionality requires at least Julia 1.8.

The allocation profiler records the stack trace, type, and size of each
allocation while it is running. It can be invoked with
[`Profile.Allocs.@profile`](@ref).

This information about the allocations is returned as an array of `Alloc`
objects, wrapped in an `AllocResults` object. The best way to visualize these is
currently with the [PProf.jl](https://github.com/JuliaPerf/PProf.jl) and
[ProfileCanvas.jl](https://github.com/pfitzseb/ProfileCanvas.jl) packages, which
can visualize the call stacks which are making the most allocations.

The allocation profiler does have significant overhead, so a `sample_rate`
argument can be passed to speed it up by making it skip some allocations.
Passing `sample_rate=1.0` will make it record everything (which is slow);
`sample_rate=0.1` will record only 10% of the allocations (faster), etc.

!!! compat "Julia 1.11"

Older versions of Julia could not capture types in all cases. In older versions of
Julia, if you see an allocation of type `Profile.Allocs.UnknownType`, it means that
the profiler doesn't know what type of object was allocated. This mainly happened when
the allocation was coming from generated code produced by the compiler. See
[issue #43688](https://github.com/JuliaLang/julia/issues/43688) for more info.

Since Julia 1.11, all allocations should have a type reported.

For more details on how to use this tool, please see the following talk from JuliaCon 2022:
https://www.youtube.com/watch?v=BFvpwC8hEWQ

##### Allocation Profiler Example

In this simple example, we use PProf to visualize the alloc profile. You could use another
visualization tool instead. We collect the profile (specifying a sample rate), then we visualize it.
```julia
using Profile, PProf
Profile.Allocs.clear()
Profile.Allocs.@profile sample_rate=0.0001 my_function()
PProf.Allocs.pprof()
```

Here is a more in-depth example, showing how we can tune the sample rate. A
good number of samples to aim for is around 1 - 10 thousand. Too many, and the
profile visualizer can get overwhelmed, and profiling will be slow. Too few,
and you don't have a representative sample.


```julia-repl
julia> import Profile
julia> @time my_function() # Estimate allocations from a (second-run) of the function
0.110018 seconds (1.50 M allocations: 58.725 MiB, 17.17% gc time)
500000
julia> Profile.Allocs.clear()
julia> Profile.Allocs.@profile sample_rate=0.001 begin # 1.5 M * 0.001 = ~1.5K allocs.
my_function()
end
500000
julia> prof = Profile.Allocs.fetch(); # If you want, you can also manually inspect the results.
julia> length(prof.allocs) # Confirm we have expected number of allocations.
1515
julia> using PProf # Now, visualize with an external tool, like PProf or ProfileCanvas.
julia> PProf.Allocs.pprof(prof; from_c=false) # You can optionally pass in a previously fetched profile result.
Analyzing 1515 allocation samples... 100%|████████████████████████████████| Time: 0:00:00
Main binary filename not available.
Serving web UI on http://localhost:62261
"alloc-profile.pb.gz"
```
Then you can view the profile by navigating to http://localhost:62261, and the profile is saved to disk.
See PProf package for more options.

##### Allocation Profiling Tips

As stated above, aim for around 1-10 thousand samples in your profile.

Note that we are uniformly sampling in the space of _all allocations_, and are not weighting
our samples by the size of the allocation. So a given allocation profile may not give a
representative profile of where most bytes are allocated in your program, unless you had set
`sample_rate=1`.

Allocations can come from users directly constructing objects, but can also come from inside
the runtime or be inserted into compiled code to handle type instability. Looking at the
"source code" view can be helpful to isolate them, and then other external tools such as
[`Cthulhu.jl`](https://github.com/JuliaDebug/Cthulhu.jl) can be useful for identifying the
cause of the allocation.

##### Allocation Profile Visualization Tools

There are several profiling visualization tools now that can all display Allocation
Profiles. Here is a small list of some of the main ones we know about:
- [PProf.jl](https://github.com/JuliaPerf/PProf.jl)
- [ProfileCanvas.jl](https://github.com/pfitzseb/ProfileCanvas.jl)
- VSCode's built-in profile visualizer (`@profview_allocs`) [docs needed]
- Viewing the results directly in the REPL
- You can inspect the results in the REPL via [`Profile.Allocs.fetch()`](@ref), to view
the stacktrace and type of each allocation.

#### Line-by-Line Allocation Tracking

An alternative way to measure allocations is to start Julia with the `--track-allocation=<setting>` command-line
option, for which you can choose `none` (the default, do not measure allocation), `user` (measure
memory allocation everywhere except Julia's core code), or `all` (measure memory allocation at
each line of Julia code). Allocation gets measured for each line of compiled code. When you quit
Expand Down
1 change: 1 addition & 0 deletions src/gc-alloc-profiler.h
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ void _maybe_record_alloc_to_profile(jl_value_t *val, size_t size, jl_datatype_t

extern int g_alloc_profile_enabled;

// This should only be used from _deprecated_ code paths. We shouldn't see UNKNOWN anymore.
#define jl_gc_unknown_type_tag ((jl_datatype_t*)0xdeadaa03)

static inline void maybe_record_alloc_to_profile(jl_value_t *val, size_t size, jl_datatype_t *typ) JL_NOTSAFEPOINT {
Expand Down
24 changes: 21 additions & 3 deletions src/gc.c
Original file line number Diff line number Diff line change
Expand Up @@ -1177,14 +1177,21 @@ static inline jl_value_t *jl_gc_big_alloc_inner(jl_ptls_t ptls, size_t sz)
return jl_valueof(&v->header);
}

// Instrumented version of jl_gc_big_alloc_inner, called into by LLVM-generated code.
// Deprecated version, supported for legacy code.
JL_DLLEXPORT jl_value_t *jl_gc_big_alloc(jl_ptls_t ptls, size_t sz)
{
jl_value_t *val = jl_gc_big_alloc_inner(ptls, sz);

maybe_record_alloc_to_profile(val, sz, jl_gc_unknown_type_tag);
return val;
}
// Instrumented version of jl_gc_big_alloc_inner, called into by LLVM-generated code.
JL_DLLEXPORT jl_value_t *jl_gc_big_alloc_instrumented(jl_ptls_t ptls, size_t sz, jl_value_t *type)
{
jl_value_t *val = jl_gc_big_alloc_inner(ptls, sz);
maybe_record_alloc_to_profile(val, sz, (jl_datatype_t*)type);
return val;
}

// This wrapper exists only to prevent `jl_gc_big_alloc_inner` from being inlined into
// its callers. We provide an external-facing interface for callers, and inline `jl_gc_big_alloc_inner`
Expand Down Expand Up @@ -1492,7 +1499,7 @@ static inline jl_value_t *jl_gc_pool_alloc_inner(jl_ptls_t ptls, int pool_offset
return jl_valueof(v);
}

// Instrumented version of jl_gc_pool_alloc_inner, called into by LLVM-generated code.
// Deprecated version, supported for legacy code.
JL_DLLEXPORT jl_value_t *jl_gc_pool_alloc(jl_ptls_t ptls, int pool_offset,
int osize)
{
Expand All @@ -1501,6 +1508,14 @@ JL_DLLEXPORT jl_value_t *jl_gc_pool_alloc(jl_ptls_t ptls, int pool_offset,
maybe_record_alloc_to_profile(val, osize, jl_gc_unknown_type_tag);
return val;
}
// Instrumented version of jl_gc_pool_alloc_inner, called into by LLVM-generated code.
JL_DLLEXPORT jl_value_t *jl_gc_pool_alloc_instrumented(jl_ptls_t ptls, int pool_offset,
int osize, jl_value_t* type)
{
jl_value_t *val = jl_gc_pool_alloc_inner(ptls, pool_offset, osize);
maybe_record_alloc_to_profile(val, osize, (jl_datatype_t*)type);
return val;
}

// This wrapper exists only to prevent `jl_gc_pool_alloc_inner` from being inlined into
// its callers. We provide an external-facing interface for callers, and inline `jl_gc_pool_alloc_inner`
Expand Down Expand Up @@ -4052,7 +4067,10 @@ static void *gc_managed_realloc_(jl_ptls_t ptls, void *d, size_t sz, size_t olds
SetLastError(last_error);
#endif
errno = last_errno;
maybe_record_alloc_to_profile((jl_value_t*)b, sz, jl_gc_unknown_type_tag);
// gc_managed_realloc_ is currently used exclusively for resizing array buffers.
if (allocsz > oldsz) {
maybe_record_alloc_to_profile((jl_value_t*)b, allocsz - oldsz, (jl_datatype_t*)jl_buff_tag);
}
return b;
}

Expand Down
2 changes: 2 additions & 0 deletions src/jl_exported_funcs.inc
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,7 @@
XX(jl_gc_alloc_3w) \
XX(jl_gc_alloc_typed) \
XX(jl_gc_big_alloc) \
XX(jl_gc_big_alloc_instrumented) \
XX(jl_gc_collect) \
XX(jl_gc_conservative_gc_support_enabled) \
XX(jl_gc_counted_calloc) \
Expand All @@ -186,6 +187,7 @@
XX(jl_gc_new_weakref_th) \
XX(jl_gc_num) \
XX(jl_gc_pool_alloc) \
XX(jl_gc_pool_alloc_instrumented) \
XX(jl_gc_queue_multiroot) \
XX(jl_gc_queue_root) \
XX(jl_gc_safepoint) \
Expand Down
9 changes: 5 additions & 4 deletions src/llvm-final-gc-lowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -205,12 +205,13 @@ Value *FinalLowerGC::lowerQueueGCBinding(CallInst *target, Function &F)
Value *FinalLowerGC::lowerGCAllocBytes(CallInst *target, Function &F)
{
++GCAllocBytesCount;
assert(target->arg_size() == 2);
assert(target->arg_size() == 3);
CallInst *newI;

IRBuilder<> builder(target);
builder.SetCurrentDebugLocation(target->getDebugLoc());
auto ptls = target->getArgOperand(0);
auto type = target->getArgOperand(2);
Attribute derefAttr;

if (auto CI = dyn_cast<ConstantInt>(target->getArgOperand(1))) {
Expand All @@ -221,19 +222,19 @@ Value *FinalLowerGC::lowerGCAllocBytes(CallInst *target, Function &F)
if (offset < 0) {
newI = builder.CreateCall(
bigAllocFunc,
{ ptls, ConstantInt::get(getSizeTy(F.getContext()), sz + sizeof(void*)) });
{ ptls, ConstantInt::get(getSizeTy(F.getContext()), sz + sizeof(void*)), type });
derefAttr = Attribute::getWithDereferenceableBytes(F.getContext(), sz + sizeof(void*));
}
else {
auto pool_offs = ConstantInt::get(Type::getInt32Ty(F.getContext()), offset);
auto pool_osize = ConstantInt::get(Type::getInt32Ty(F.getContext()), osize);
newI = builder.CreateCall(poolAllocFunc, { ptls, pool_offs, pool_osize });
newI = builder.CreateCall(poolAllocFunc, { ptls, pool_offs, pool_osize, type });
derefAttr = Attribute::getWithDereferenceableBytes(F.getContext(), osize);
}
} else {
auto size = builder.CreateZExtOrTrunc(target->getArgOperand(1), getSizeTy(F.getContext()));
size = builder.CreateAdd(size, ConstantInt::get(getSizeTy(F.getContext()), sizeof(void*)));
newI = builder.CreateCall(allocTypedFunc, { ptls, size, ConstantPointerNull::get(Type::getInt8PtrTy(F.getContext())) });
newI = builder.CreateCall(allocTypedFunc, { ptls, size, type });
derefAttr = Attribute::getWithDereferenceableBytes(F.getContext(), sizeof(void*));
}
newI->setAttributes(newI->getCalledFunction()->getAttributes());
Expand Down
47 changes: 30 additions & 17 deletions src/llvm-late-gc-lowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2324,22 +2324,6 @@ bool LateLowerGCFrame::CleanupIR(Function &F, State *S, bool *CFGModified) {
IRBuilder<> builder(CI);
builder.SetCurrentDebugLocation(CI->getDebugLoc());

// Create a call to the `julia.gc_alloc_bytes` intrinsic, which is like
// `julia.gc_alloc_obj` except it doesn't set the tag.
auto allocBytesIntrinsic = getOrDeclare(jl_intrinsics::GCAllocBytes);
auto ptlsLoad = get_current_ptls_from_task(builder, CI->getArgOperand(0), tbaa_gcframe);
auto ptls = builder.CreateBitCast(ptlsLoad, Type::getInt8PtrTy(builder.getContext()));
auto newI = builder.CreateCall(
allocBytesIntrinsic,
{
ptls,
builder.CreateIntCast(
CI->getArgOperand(1),
allocBytesIntrinsic->getFunctionType()->getParamType(1),
false)
});
newI->takeName(CI);

// LLVM alignment/bit check is not happy about addrspacecast and refuse
// to remove write barrier because of it.
// We pretty much only load using `T_size` so try our best to strip
Expand Down Expand Up @@ -2378,7 +2362,36 @@ bool LateLowerGCFrame::CleanupIR(Function &F, State *S, bool *CFGModified) {
builder.CreateAlignmentAssumption(DL, tag, 16);
}
}
// Set the tag.

// Create a call to the `julia.gc_alloc_bytes` intrinsic, which is like
// `julia.gc_alloc_obj` except it specializes the call based on the constant
// size of the object to allocate, to save one indirection, and doesn't set
// the type tag. (Note that if the size is not a constant, it will call
// gc_alloc_obj, and will redundantly set the tag.)
auto allocBytesIntrinsic = getOrDeclare(jl_intrinsics::GCAllocBytes);
auto ptlsLoad = get_current_ptls_from_task(builder, CI->getArgOperand(0), tbaa_gcframe);
auto ptls = builder.CreateBitCast(ptlsLoad, Type::getInt8PtrTy(builder.getContext()));
auto newI = builder.CreateCall(
allocBytesIntrinsic,
{
ptls,
builder.CreateIntCast(
CI->getArgOperand(1),
allocBytesIntrinsic->getFunctionType()->getParamType(1),
false),
builder.CreatePtrToInt(tag, T_size),
});
newI->takeName(CI);

// Now, finally, set the tag. We do this in IR instead of in the C alloc
// function, to provide possible optimization opportunities. (I think? TBH
// the most recent editor of this code is not entirely clear on why we
// prefer to set the tag in the generated code. Providing optimziation
// opportunities is the most likely reason; the tradeoff is slightly
// larger code size and increased compilation time, compiling this
// instruction at every allocation site, rather than once in the C alloc
// function.)
auto &M = *builder.GetInsertBlock()->getModule();
StoreInst *store = builder.CreateAlignedStore(
tag, EmitTagPtr(builder, tag_type, newI), Align(sizeof(size_t)));
store->setOrdering(AtomicOrdering::Unordered);
Expand Down
34 changes: 21 additions & 13 deletions src/llvm-pass-helpers.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,12 @@ namespace jl_intrinsics {
static const char *QUEUE_GC_ROOT_NAME = "julia.queue_gc_root";
static const char *QUEUE_GC_BINDING_NAME = "julia.queue_gc_binding";

static auto T_size_t(const JuliaPassContext &context) {
return sizeof(size_t) == sizeof(uint32_t) ?
Type::getInt32Ty(context.getLLVMContext()) :
Type::getInt64Ty(context.getLLVMContext());
}

// Annotates a function with attributes suitable for GC allocation
// functions. Specifically, the return value is marked noalias and nonnull.
// The allocation size is set to the first argument.
Expand Down Expand Up @@ -150,9 +156,8 @@ namespace jl_intrinsics {
FunctionType::get(
context.T_prjlvalue,
{ Type::getInt8PtrTy(context.getLLVMContext()),
sizeof(size_t) == sizeof(uint32_t) ?
Type::getInt32Ty(context.getLLVMContext()) :
Type::getInt64Ty(context.getLLVMContext()) },
T_size_t(context),
T_size_t(context) }, // type
false),
Function::ExternalLinkage,
GC_ALLOC_BYTES_NAME);
Expand Down Expand Up @@ -227,12 +232,18 @@ namespace jl_intrinsics {
}

namespace jl_well_known {
static const char *GC_BIG_ALLOC_NAME = XSTR(jl_gc_big_alloc);
static const char *GC_POOL_ALLOC_NAME = XSTR(jl_gc_pool_alloc);
static const char *GC_BIG_ALLOC_NAME = XSTR(jl_gc_big_alloc_instrumented);
static const char *GC_POOL_ALLOC_NAME = XSTR(jl_gc_pool_alloc_instrumented);
static const char *GC_QUEUE_ROOT_NAME = XSTR(jl_gc_queue_root);
static const char *GC_QUEUE_BINDING_NAME = XSTR(jl_gc_queue_binding);
static const char *GC_ALLOC_TYPED_NAME = XSTR(jl_gc_alloc_typed);

static auto T_size_t(const JuliaPassContext &context) {
return sizeof(size_t) == sizeof(uint32_t) ?
Type::getInt32Ty(context.getLLVMContext()) :
Type::getInt64Ty(context.getLLVMContext());
}

using jl_intrinsics::addGCAllocAttributes;

const WellKnownFunctionDescription GCBigAlloc(
Expand All @@ -242,9 +253,8 @@ namespace jl_well_known {
FunctionType::get(
context.T_prjlvalue,
{ Type::getInt8PtrTy(context.getLLVMContext()),
sizeof(size_t) == sizeof(uint32_t) ?
Type::getInt32Ty(context.getLLVMContext()) :
Type::getInt64Ty(context.getLLVMContext()) },
T_size_t(context),
T_size_t(context) },
false),
Function::ExternalLinkage,
GC_BIG_ALLOC_NAME);
Expand All @@ -258,7 +268,7 @@ namespace jl_well_known {
auto poolAllocFunc = Function::Create(
FunctionType::get(
context.T_prjlvalue,
{ Type::getInt8PtrTy(context.getLLVMContext()), Type::getInt32Ty(context.getLLVMContext()), Type::getInt32Ty(context.getLLVMContext()) },
{ Type::getInt8PtrTy(context.getLLVMContext()), Type::getInt32Ty(context.getLLVMContext()), Type::getInt32Ty(context.getLLVMContext()), T_size_t(context) },
false),
Function::ExternalLinkage,
GC_POOL_ALLOC_NAME);
Expand Down Expand Up @@ -301,10 +311,8 @@ namespace jl_well_known {
FunctionType::get(
context.T_prjlvalue,
{ Type::getInt8PtrTy(context.getLLVMContext()),
sizeof(size_t) == sizeof(uint32_t) ?
Type::getInt32Ty(context.getLLVMContext()) :
Type::getInt64Ty(context.getLLVMContext()),
Type::getInt8PtrTy(context.getLLVMContext()) },
T_size_t(context),
T_size_t(context) }, // type
false),
Function::ExternalLinkage,
GC_ALLOC_TYPED_NAME);
Expand Down
Loading

0 comments on commit cbb076d

Please sign in to comment.