Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add randomized allocation sampling #100356

Closed
wants to merge 95 commits into from

Conversation

chrisnas
Copy link
Contributor

@chrisnas chrisnas commented Mar 27, 2024

This PR adds a new feature, randomized allocation sampling, that can be used for low overhead allocation sampling with good probabilistic error bounds. We expect APM tools and other scenarios that do memory profiling in development or production will be interested in this. See docs/design/features/RandomizedAllocationSampling.md for more details about what this is and how it works.

This PR is a continuation from work originally in #98167. @Maoni0 and @jkotas had some discussion at that point to agree on the overall development direction.

There are a variety of test results showing some performance comparisons and statistical sampling distributions in src/tests/tracing/eventpipe/randomizedallocationsampling/manual/testing_results/. These are available for reviewers to see but will be deleted prior to merging the PR.

Current status:

  • The CoreCLR support is mostly complete but still tracking down some CI failures now that the code is being built and tested in a wider variety of configs
  • NativeAOT support is in progress: the new event is emitted (currently looking for a better randomizer + measure statistical distribution but more complicated since the name of the type is not known)

Note:
An option could be to merge the Core part now to secure the feature in .NET 9 and continue the NativeAOT that might not be finished for the cut. Note that the fix to emit AllocationTick in NativeAOT is now merged so it could help if the new randomized sampling is not available.

Guide to the changes:

  • The majority of changed files in the product code are simple refactoring to accommodate the new ee_alloc_context struct. This struct wraps a gc_alloc_context and adds one additional field combined_limit. All of the fast allocation helpers that previously referred to alloc_limit now refer to combined_limit.
  • A few more interesting changes are:
    • src/coreclr/vm/gcheaputilities.h - Adds the definition of ee_alloc_context which tracks the state of randomized sampling and generates new random numbers for the sampler.
    • src/coreclr/vm/gchelpers.cpp - Modifications of the Alloc() method check if sampling is enabled, check if a new allocation should be sampled, and fire the sampling event if so. There is also a separate change AllocateSzArray() that simplifies aligning double arrays and removes a potentially unnecessary min_obj padding allocation there.
    • src/coreclr/vm/gcenv.ee.cpp - Modifies GCToEEInterface::GcEnumAllocContexts() to ensure that the combined_limit field is kept up-to-date with potential alloc_context changes made by the GC.
    • src/coreclr/gc/gc.cpp - The GC_ALLOC_ALIGN8 is now always supported. Previously it was only supported when FEATURE_64BIT_ALIGNMENT was enabled but we wanted to use it more broadly.

noahfalk and others added 17 commits February 8, 2024 07:52
…ampling feature. This commit is

only to gather feedback on the direction and is not intended to be checked in.

The is a resumption of some of the original work that occured in dotnet#85750
although this is a different implementation approach. The overall goal is to do lightweight allocation
sampling that can produce unbiased approximations of what was actually allocated. Our current ETW AllocationTick
sampling is periodic but not randomized. Extrapolating from the AllocationTick samples can lead to unbounded
estimation errors in theory and also substantial estimation error observed in practice.

This commit is primarily to get feedback on adding a new adjustable limit pointer on
the coreclr EE side of allocation contexts. This allows breaking out of the fast path allocation helpers at an
arbitrarily selected sampling point that need not align with the standard GC allocation
quantum for SOH. With sampling off the fast_helper_alloc_limit would always be
identical to alloc_limit, but when sampling is on we would occasionally have them
diverge.

The key changes are:
gcheaputilities.h - defines a new ee_alloc_context which augments the existing gc_alloc_context with the extra fast helper limit field
gcheaputilities.h and threads.h - replace the global and per-thread storage of gc_alloc_context with the expanded ee_alloc_context
  to store the new field.
jithelpers.cpp, jitinterfacex86.cpp, and amd64 asm stuff - refactors fast path alloc helpers to use the new limit field instead
gchelpers.cpp - Updated Alloc() function recognizes when fast helper limit was exceeded to do some kind of sampling callback
  and update the fast helper limit when needed.

This commit doesn't contain any logic that actually turns the sampling on, determines the limit values needed for sampling,
or issues callbacks for sampled objects. That would come later if folks think this is a reasonable approach to intercept
allocations.
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 27, 2024
@jkotas
Copy link
Member

jkotas commented Jul 9, 2024

You may consider peeling off some of the changes into separate PR to make this easier to land. For example, the changes in the GC proper around the 64-bit aligned allocs can be peeled off into a separate PR.

@chrisnas
Copy link
Contributor Author

chrisnas commented Jul 9, 2024

You may consider peeling off some of the changes into separate PR to make this easier to land. For example, the changes in the GC proper around the 64-bit aligned allocs can be peeled off into a separate PR.

Split into how many pieces @jkotas?

  • 64-aligned allocs
  • Core implementation
  • NativeAOT implementation
  • ???

@jkotas
Copy link
Member

jkotas commented Jul 10, 2024

I think the following split may work better:

  • 64-aligned allocs
  • Introduce the secondary allocation limit (for both CoreCLR and NAOT)
  • Implement the actual sampling event

Smaller PRs are easier to stabilize, review and merge that locks in forward progress.

@chrisnas
Copy link
Contributor Author

I think the following split may work better:

  • 64-aligned allocs
  • Introduce the secondary allocation limit (for both CoreCLR and NAOT)
  • Implement the actual sampling event

Smaller PRs are easier to stabilize, review and merge that locks in forward progress.

I'm not sure to understand the difference between the last 2 items: emitting the event is a tiny part of the work; the second item is the big one. Also, without the sampling itself, the code will be different (mostly what I did at the beginning of NAOT when I did not want to change the behaviour - combined_limit = alloc_limit)

@jkotas
Copy link
Member

jkotas commented Jul 12, 2024

I'm not sure to understand the difference between the last 2 items: emitting the event is a tiny part of the work; the second item is the big one.

The second item (Introduce the secondary allocation limit (for both CoreCLR and NAOT)) would be a mechanical change. It does not need any tests. It should not introduce any observable behavior changes. It is just touching a lot of different places and it is easy to miss a few of those somewhere.

I do not have strong opinions about the best way to split this, but I think this will need to split into multiple PRs to land it - giving how long it takes to stabilize it.

@@ -0,0 +1,131 @@
#pragma once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be move to src\native\minipal and CoreCLR impl switched to it as well?

@jkotas
Copy link
Member

jkotas commented Jul 12, 2024

In any case, if you agree that the 64-aligned allocs cleanup would be a good independent PR, we can start with that.

_ASSERTE(allocPtr <= allocContext->alloc_limit);
if (size > static_cast<SIZE_T>(allocContext->alloc_limit - allocPtr))
_ASSERTE(allocPtr <= eeAllocContext->combined_limit);
if ((allocPtr == nullptr) || (size > static_cast<SIZE_T>(eeAllocContext->combined_limit - allocPtr)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want to adding extra checks to the allocation helper fast paths. It would cause undesirable performance regressions.

Copy link
Contributor

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

@tommcdon
Copy link
Member

@noahfalk I believe this PR is superseded by:

If yes, @chrisnas @noahfalk any objections if we close this PR?

@noahfalk
Copy link
Member

@noahfalk I believe this PR is superseded by:

#104849
#104851
#104955

Yep, thats correct. Closing this PR.

@noahfalk noahfalk closed this Aug 27, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Sep 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Diagnostics-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants