-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[compiler-rt][ctx_profile] Add the instrumented contextual profiling APIs #89838
Conversation
…APIs APIs for contextual profiling. (Tracking Issue: llvm#89287, RFC referenced there)
✅ With the latest revision this PR passed the C/C++ code formatter. |
} | ||
|
||
uint64_t *counters() { | ||
ContextNode *addr_after = &(this[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here and in the subContexts method and maybe in the class header about the expected layout of the counters and subContexts, since they aren't data members?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
}; | ||
|
||
/// ContextRoots are allocated by LLVM for entrypoints. LLVM is only concerned | ||
/// with allocating and zero-initializing the global value for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "the global value for it"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as in GlobalValue. fixed.
// | ||
/// TLS where LLVM stores the pointer of the called value, as part of lowering a | ||
/// llvm.instrprof.callsite | ||
extern __thread void *volatile __llvm_ctx_profile_expected_callee[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are there 2 values for the expected callee? Ditto for the __llvm_ctx_profile_callsite below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says in the first comment under extern "C"
- position 1 is for scratch. It's so we don't pollute the values for normal collection when signal handlers happen. Normally we'd implement a stack - but we don't need to because we don't care about anything that happens when we go to sctach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only see index [0] being accessed for both __llvm_ctx_profile_expected_callee and __llvm_ctx_profile_callsite, other than in the tests. When does index [1] get used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the llvm side. See PR #90821. The runtime never looks at index 1, but having the 2 indices simplifies the lowering logic: LLVM determines the index upon getting the value from here (so in the function entry basic block) and then just writes at that index for each callsite without needing to test every time if it's scratch.
/// Used to obtain the profile. The Writer is called for each root ContextNode, | ||
/// with the ContextRoot::Taken taken. The Writer is responsible for traversing | ||
/// the structure underneath. | ||
/// The Writer's first parameter is an object it knows about, which is what the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by "an object it knows about"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's the closure... rephrased.
TEST_F(ContextTest, ScratchNoCollection) { | ||
EXPECT_EQ(__llvm_ctx_profile_current_context_root, nullptr); | ||
int OpaqueValue = 0; | ||
// this would be the very first function executing this. the TLS is empty, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify - I assume it is scratch because start context hasn't been called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. clarified.
__llvm_ctx_profile_expected_callee[0] = &OpaqueValue; | ||
__llvm_ctx_profile_callsite[0] = &Ctx->subContexts()[2]; | ||
auto *Subctx = __llvm_ctx_profile_get_context(&OtherOpaqueValue, 2, 3, 1); | ||
EXPECT_TRUE(isScratch(Subctx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment about why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess because you passed a different pointer to get_context, but it would be good to add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. ptal.
/// called for any other function than entry points, in the entry BB of such | ||
/// function. Same consideration about LSB of returned value as .._start_context | ||
__ctx_profile::ContextNode * | ||
__llvm_ctx_profile_get_context(void *Callee, __ctx_profile::GUID Guid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What type is Callee in practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a function address. So void *
is appropriate.
int OpaqueValue = 0; | ||
const bool IsScratch = isScratch(Ctx); | ||
EXPECT_FALSE(IsScratch); | ||
__llvm_ctx_profile_expected_callee[0] = &OpaqueValue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OpaqueValues in these tests are stand ins for callee values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. should I rename - maybe "FakeCallee"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah FakeCallee would be more self explanatory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
__llvm_ctx_profile_callsite[1] = &Subctx->subContexts()[0]; | ||
|
||
auto *Subctx2 = __llvm_ctx_profile_get_context(&ThirdOpaqueValue, 3, 0, 0); | ||
EXPECT_TRUE(isScratch(Subctx2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto about adding comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/// | ||
/// The layout is as follows: | ||
/// | ||
/// [[statically-declared fields][counters vector][subcontexts vector]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think interpreting a subcontext ContextNode as part of the same block of memory immediately following counters of width uint64_t is undefined behaviour. In this case I think we get lucky because the first element of the ContextNode struct is a GUID (uint64_t) and thus the alignment of the ContextNode struct and the alignment of each individual counter matches.
I am far from an expert on undefined behaviour and my understanding is based on reading (5) in https://en.cppreference.com/w/cpp/language/reinterpret_cast and https://en.cppreference.com/w/c/language/object. Please feel free to correct me if you think otherwise.
I think the motivation is that the single chunk based layout simplifies the instrumentation GEP computations. A few questions:
- Can we test the ContextNode runtime implementation with sanitizers to expose any alignment issues?
- Have you considered an alternative design which does not make it easy to run into such issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow. The subcontext vector is of pointers to subcontexts. Maybe I should clarify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said - the rest isn't by accident, but we can statically check that the alignment checks out (added). Basically:
- the Arenas are 8-aligned because that's how __sanitizer::InternalAlloc returns by default, but added the explicit flag for clarity
- the memory we start bump-allocating in the Arena is 8-aligned by construction - added a static_assert for maintainability
- the choice of types for ContextNode's header is explicit (to maintain the invariant) - and added a static assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, then this does not apply since we are reinterpreting 64-bit values as pointers. Yes, enhancing the documentation to clarify would help. I still think it's a good idea to run (some of) the unit tests through ubsan.
// The counters vector starts right after the static header. | ||
uint64_t *counters() { | ||
ContextNode *addr_after = &(this[1]); | ||
return reinterpret_cast<uint64_t *>(reinterpret_cast<char *>(addr_after)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the additional reinterpret_cast to char* required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not. Fixed.
|
||
// utility to taint a pointer by setting the LSB. There is an assumption | ||
// throughout that the addresses of contexts are even (really, they should be | ||
// align(8), but "even"-ness is the minimum assumption) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the align(8)
nature is load bearing, see my other comment about reinterpreting the address immediately following the counters array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and __sanitizer::InternalAlloc align-8s if you pass 0 alignment (the default). We can force that though, for clarity.
/// | ||
/// The layout is as follows: | ||
/// | ||
/// [[statically-declared fields][counters vector][subcontexts vector]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, then this does not apply since we are reinterpreting 64-bit values as pointers. Yes, enhancing the documentation to clarify would help. I still think it's a good idea to run (some of) the unit tests through ubsan.
for (uint32_t I = 0; I < NrCounters; ++I) | ||
counters()[I] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just use memset since you know the begin and end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__sanitizer::internal_memset is the best we can do. I'm not super-sure we really want that though (would be happy with std::memset) - because see FIXME here about removing more of the deps to sanitizer_common - not that this would be too hard to fix; but also don't we have a pass in LLVM that would effectivelly recognize this as memset and replace with the intrinsic? I.e. may not achieve much.
Left a FIXME to do std::memset though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I don't think the compiler will be able to reason through the counters() implementation to be able to do this automatically though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm but wait for snehasish to approve as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with some minor comments.
for (uint32_t I = 0; I < NrCounters; ++I) | ||
counters()[I] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I don't think the compiler will be able to reason through the counters() implementation to be able to do this automatically though.
|
||
// Verify maintenance to ContextNode doesn't change this invariant, which makes | ||
// sure the inlined vectors are appropriately aligned. | ||
static_assert(sizeof(ContextNode) % Alignment == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be alignof(ContextNode)
. See the example in https://en.cppreference.com/w/c/language/object for structures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, fixed here and in a few more places.
// Note that this is node-by-node aggregation, i.e. summing counters of nodes | ||
// at the same position in the graph, not flattening. | ||
// Threads that cannot lock Taken (fail TryLock) are given a "scratch context" | ||
// - a buffer they can clobber, safely from a memory access perspective. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then the data written to the scratch context is just dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - and expanded in the comment.
The design allows for that not be the case - because "scratch"-ness is first and foremost about not trying to build subcontexts, and is captured by tainting the pointer value (pointer to the memory treated as context), but right now, we drop that info.
// regardless. | ||
::__sanitizer::StaticSpinMutex Taken; | ||
|
||
// Avoid surprises due to (unlikely) StaticSpinMutex changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more descriptive message of why we care about the size here is preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
EXPECT_NE(Ctx, nullptr); | ||
EXPECT_NE(Root.CurrentMem, nullptr); | ||
EXPECT_EQ(Root.FirstMemBlock, Root.CurrentMem); | ||
EXPECT_EQ(Ctx->size(), sizeof(ContextNode) + 10 * sizeof(uint64_t) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ctx->size()
will segfault even if the expectation on L35 fails since that does not fail the test. L35 should be an ASSERT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
auto *Root = AllContextRoots[I]; | ||
__sanitizer::GenericScopedLock<__sanitizer::StaticSpinMutex> TakenLock( | ||
&Root->Taken); | ||
if (!validate(Root)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to validate on all builds? It might be problematic for server apps which expect profile dumping to be quick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's quick right now, we can relax later, but this wasn't a problem so far.
Follow-up from PR ##89838. Some build bots warn-as-error about signed/unsigned comparison in CtxInstrProfilingTest. Example: https://lab.llvm.org/buildbot/#/builders/37/builds/34610
APIs for contextual profiling.
ContextNode
is the call context-specific counter buffer.ContextRoot
is associated to those functions that constitute roots into interesting call graphs, and is the object on which we hang offArena
s for allocatingContextNode
s, as well as theContextNode
corresponding to such functions. Graphs ofContextNode
s are accessible by one thread at a time.(Tracking Issue: #89287, more details in the RFC referenced there)