Skip to content
This repository has been archived by the owner on Nov 1, 2020. It is now read-only.

Update GC from dotnet/runtime #8059

Merged
merged 40 commits into from
Apr 1, 2020
Merged

Conversation

MichalStrehovsky
Copy link
Member

Skipping commits:

dotnet/runtime@631cb1b - I think we need to pull the Perl script to generate ETW boilerplate into the open.
dotnet/runtime@fcd862e - not going to mess with that right now
dotnet/runtime@6c1f8ad - I think this is for another commit that we skipped
dotnet/runtime@f99b4fe - more ETW

@MichalStrehovsky
Copy link
Member Author

Oh, great. Going to learn about "cgroups".

@MichalStrehovsky
Copy link
Member Author

@janvorli I'm porting your change in dotnet/runtime#980 and hitting an assert:

/__w/1/s/src/Native/gc/unix/cgroup.cpp:197: static char *CGroup::FindCgroupPath(bool (*)(const char *)): Assertion `cgroup_path_relative_to_mount[common_path_prefix_len] == '/'' failed.

I added some instrumentation in 12b821d and I'm getting:

hierarchy_mount: /sys/fs/cgroup/memory
hierarchy_root: /docker/4cd7e277fcf32d39de066076dec1c716d2f1037483207344b1147130b5998e17
cgroup_path_relative_to_mount: /docker/4cd7e277fcf32d39de066076dec1c716d2f1037483207344b1147130b5998e17
common_path_prefix_len: 72

In particular the root and the path relative to mount are the same (in your examples in cgroup.cpp, there should be an extra directory in the latter).

Do you know if I'm just holding it wrong, or the code in GC doesn't handle the way Docker is configured for this particular build pool?

@janvorli
Copy link
Member

The assert was wrong, it didn't take into account the unnamed group case. And it was never detected as the equivalent code in the coreclr PAL's _ASSERTE was not firing due to the fact that the debugging support was not initialized at the time the cgroups were being initialized.

bdebaere and others added 26 commits March 31, 2020 10:36
* Add parenthesis to parameters.

* Revert changes to IS_UNWINDING.

Commit migrated from dotnet/coreclr@ce29457
+ when hardlimit is specified we should only retry when we didn't fail due to commit failure - if commit failed it means we simply didn't have as much memory as what the hardlimit specified. we should throw OOM in this case.

+ added some diag info around OOM history to help with future diagnostics.

Commit migrated from dotnet/coreclr@7dca41f
This is the 1st part of BGC tuning using FL (free list) with a PID loop. Historically the most significant factor for triggering a BGC is based on the allocation budget. This experimental feature triggers based on the FL in gen2/3 with a PID loop (by default we only use PI, no D) so we use the allocation calculated based on the FL to determine when to trigger the next BGC.

    The goal of the PI feedback loop

The end goal of the tuning is keep a const physical ML (memory load). The ML goal is specified as a percentage (meaning the percent of physical memory in use). And we aim to do as few BGCs as possible to achieve this memory load.

This is most useful for the case where you are seeing that a large percent of free space in gen2 isn't getting used when a BGC is triggered; or you have plenty of memory but the GC heap size only takes up a small percentage and you want to delay BGCs to reduce the CPU consumed by BGC.

    Enable the FL tuning PI loop

Since this is an experimental feature, by default it's disabled. To enable set this env var:

set COMPlus_BGCFLTuningEnabled=1

When the FL tuning is enabled, by default we set the ML load to 75%. You can change it with this env var:

set COMPlus_BGCMemGoal=X

Note as with any COMPlus var, the value is interpreted as a hex number, not dec.

    Perf consideration of the current PI loop

Of course there’s always perturbation. From BGC’s POV there are 2 categories of perturbation –

1) from GC’s own perf characteristics changes, for example, suddenly we see a lot of pins from gen1 that get promoted into gen2.
2) non GC factors – this could be due to sudden increase of native memory usage in the process; or other processes on the same machine simply increase/decrease their memory usage.

And generally we don’t want to do something very high like 90% ‘cause it’s hard to react when the memory is tight – GC would need to compact and currently BGC does not compact. So for now we have to assume that “retracting the heap is difficult” which means we want our PI loop to be fairly conservative.

So we actually have another PI loop (the inner loop) to make sure the “sweep flr” is at a reasonable value. “Sweep flr” is the FLR (Free List Ratio) before BGC rebuilds the free list – so you can think of this as the smallest flr during a BGC. So the inner loop has a “sweep flr” goal of 20% by default which is pretty conservative. And when we can incrementally compact I would expect to reduce this by a fair amount. Another possibility is we do not set this as a fixed number and rather calculate a reasonable one dynamically based on what we observe how the free list is used.

Of course just because BGC does not compact it doesn’t mean that the total gen2 size cannot get smaller. It could get smaller just by objects at the end of gen2 naturally dying.

+ Initialization of the PI loops

We have to have some way to get this whole thing started so I usually do a few BGCs to reach 2/3 mem load goal then start using PI loops to decide when to trigger the next BGC.

+ Panic mode

I use a very simple rule to see if I should panic, ie, do an NGC2. If we observe the memory load is (goal + N%) where N is just a number we determine, we do an NGC2. This actually turned out to give decent results because we give it ample opportunity to allow some oscillation around goal (instead of panicking prematurely).

+ Implementation notes

When FL tuning is not enabled there should be no effect.

Record things when BGC starts, BGC sweep ends and BGC end.

I have other mechanisms like the D term, FF (feed forward) and smoothing. I have experimented with them in the past. Currently they are not enabled by default but can be enabled with COMPlus env vars.

Currently this doesn't work great with LOH because we have a fundamental limitation which is if we give free space to gen2 it's difficult to give it to LOH. One thing the user could do is to adjust the LOH threshold so most of the large object allocations happen in gen2.

Commit migrated from dotnet/coreclr@30bb5b5
Implement card marking stealing for better work balance in Server GC.

One of the last stages in the mark_phase is to mark objects referenced from older generations. This stage is often slow compared to the other stages, and it is also often somewhat unbalanced, i.e. some GC threads finish their work significantly sooner than others. The change also applies to the relocate_phase, but that phase usually takes significantly less time.

This change implements thread-safe enumeration of older generations by dividing them into chunks (2 MB in 64-bits, 1 MB in 32-bits), and arranges it so threads finishing work early will help on other heaps. Each thread grabs a chunk and then looks through the card table section corresponding to this chunk. When it's done with a chunk, it grabs the next one and so on.
There are changes at multiple levels:

- at the top level, mark_phase and relocate_phase contain changes to check for work already done for both the heap associated with the thread and other heaps.
- these routines call mark_through_cards_for_segments and mark_through_cards_for_large_objects which contain code to walk through the older generations in chunks.
 - ultimately card_marking_enumerator::move_next implements the thread safe enumeration, supplying chunks, and gc_heap::find_next_chunk supplies a chunk where all card bits are set.

Commit migrated from dotnet/coreclr@5ca444c
…reclr#27384)

* Use half-fences for volatile load/stores on Windows ARM64

* Updated Volatile.h in gc/env as well.
unified on type names and warning suppression.

Commit migrated from dotnet/coreclr@c128dba
Remove gcc nonnull-compare suppression and impossible conditions.

> error: nonnull argument 'this' compared to NULL
> [-Werror=nonnull-compare]

Commit migrated from dotnet/coreclr@4afbe3a
* Fix available memory extraction on Linux

The GlobalMemoryStatusEx in PAL is returning number of free physical pages in
the ullAvailPhys member. But there are additional pages that are allocated
as buffers and caches that get released when there is a memory pressure and
thus they are effectively available too.

This change extracts the available memory on Linux from the /proc/meminfo
MemAvailable row, which is reported by the kernel as the most precise
amount of available memory.

Commit migrated from dotnet/coreclr@859f464
Normalize trailing whitespaces in frequently changing files (docs and sources)

Commit migrated from dotnet/coreclr@ed5dc83
* Two simple fixes to suspension issues seen in GCPerfSim:

- Insert allow_fgc() call in background_mark_simple - this fixes the cases where there are a ton of GC handles referencing simple objects not containing pointers.

- Change PING_JIT_TIMEOUT constant from 10 milliseconds to 1 millisecond. This fixes the case where return address hijacking doesn't work quickly, because the hijacked thread doesn't return (e.g. because it's in a loop doing further calls). In this case we have to retry the hijack, and changing the timeout constant makes this happen more quickly.

Commit migrated from dotnet/coreclr@fab7aa2
* Fixing an assert in card stealing.

* use ifdef to be consistent with other cases in the file

* PR feedback

Commit migrated from dotnet/coreclr@7aa67a9
* Changes to set gen0 bricks always. This reduces the time spent in find_first_object when finding the start of objects for marking interior pointers.

* Revert "Changes to set gen0 bricks always. This reduces the time spent in find_first_object when finding the start of objects for marking interior pointers."

This reverts commit dotnet/coreclr@9d53ff9.

* Two fixes to speed up suspension for foreground GCs while background GC is in progress:

- In background_mark_simple1, check g_fSuspensionPending and if it is set, save the state of the work and restart the loop - this will call allow_fgc() and thus allow a foreground GC to take place.

- In revisit_written_page, call allow_fgc() at the end - this allow a foreground GC to happen whenever we are done with revisiting a written page.

* Addressed code review feedback - use counter instead of testing g_fSuspensionPending directly.

Commit migrated from dotnet/coreclr@a7678f5
* Suppress on clang only

* Fix integer conversion

* Extra qualifier

* Suppress warning

* Extra qualifier

* Signedness issues

* Correct offsetof

* Offsetof doesn't support non-constant values

* Conversion errors

* Move the comment too

* Fix assembly warning

* size is not constant

* Fix comment type

* Fix endmacro name

* Use OFFSET_NONE constant

Commit migrated from dotnet/coreclr@e8bbcf1
…clr#27846)

Unifying the type used for number of heaps/locks/threads
It is the same small number and should be just int.

NOTE: Initial number of allocated blocks per generation is also the same as number of heaps.

Commit migrated from dotnet/coreclr@7405e43
* Fix getting affinity set on MUSL on Jetson TX2

The code in PAL_GetCurrentThreadAffinitySet relied on the fact that the
number of processors reported as configured in the system is always
larger than the maximum CPU index. However, it turns out that it is not
true on some devices / distros. The Jetson TX2 reports CPUs 0, 3, 4 and
5 in the affinity mask and the 1 and 2 are never reported. GLIBC reports
6 as the number of configured CPUs, however MUSL reports just 4. The
PAL_GetCurrentThreadAffinitySet was using the number of CPUs reported as
configured as the upper bound for scanning affinity set, so on Jetson
TX2, the affinity mask returned had just two bits set while there were
4 CPUs. That triggered an assert in the GCToOSInterface::Initialize.

This change fixes that by looping over all cpu indices in the affinity set.
Similar fix went to GetProcessorForHeap and related stuff in gcenv.unix.cpp
While named cgroups work fine outside of docker container, they weren't
working when created and used inside of a docker container. The problem
was caused by the fact that the hierarchy root extracted from
/proc/self/mountinfo and the cgroup path extracted from /proc/self/cgroup
are not equal for named groups. They just share the same prefix.
The cgroups handling code was not epxecting this case and ended up building
the final cgroup path incorrectly (including the common part of the path).
This change fixes it by checking for matching prefix of the paths instead
of comparing the whole paths.
The amount of strong name support that CoreCLR needs is very small (really just a method to convert public key to public key token). It is not worth it to build a separate .lib for just this single method. Fold the strong name APIs into metadata and change the API to return HRESULT.
The allocate_in_free code path in allocate_in_expanded_heap incorrectly calculated the large (double) alignment padding size when limiting the plug size (SHORT_PLUGS) if set_padding_on_saved_p was true:

    set_padding_in_expand (old_loc, set_padding_on_saved_p, pinned_plug_entry); // Sets the padding flag on the saved plug
    ...
    pad += switch_alignment_size (is_plug_padded (old_loc)); // Reads the padding flag from the old (different!) plug

That caused access violation during a later heap walk since the g_gc_pFreeObjectMethodTable pointer marking the gap was not placed at the right address.
* gc, pal: Avoid linking libnuma.so on non-NUMA systems

In an effort to reduce the system calls performed during the
initialization, this seemed like another low-hanging fruit: only link
in libnuma.so if the system has more than one NUMA node.

* gc, pal: Remove unneeded dlsym() call

The result of dlsym() is never attributed to any variable, and this
function is never called, so it's safe to remove these dlsym() calls.
Improve comments and fix a couple of places where large alignment was not taken into account. Set pad_in_front to zero if SHORT_PLUGS is not defined.
cshung and others added 14 commits March 31, 2020 10:36
* refactoring

* ploh-->uoh

* fix Linux build
PR feedback

* some PR feedback

* more PR feedback

more PR feedback

* more PR feedback

* address comments in UpdatePreGCCounters

* removed a confusing comment

* stress fix in background sweep

* use `saved_sweep_ephemeral_seg` in background sweep.

* fixed `GCPerHeapHistory_V3` event.

* re-implemented dotnet/runtime#2103 on top of refactoring  (it was hard to merge)
* Use gcenv.windows.inl on Windows HOSTS

* Make two way pipe implementation depend on host

We have no ability to implement a cross OS pipe.  Allow the
two way pipe implentation depend on the host.

* Use empty debug macro implementation on Windows hosts.

* Make HMODULE type depend on host compiler

* Configure long file path wrappers based on host

* Configure memcpy based on host

* Configure FreeLibrary based on host

* Configure host utility funtion based on host

VirtualAlloc, CPU count, NUMA config make most sense as host
queries.  Configure based on host.

* Configure exceptiion holder based on host

* Prefer #if HOST_WINDOWS for longfilepathwrappers.cpp

* Prefer #if HOST_WINDOWS for src/coreclr/src/inc/holder.h

* Do not include GCToOSInterface::GetPageSize() in DAC

* Remove JIT64_BUILD

* !HOST_UNIX -> HOST_WINDOWS
Previously, there was an additional check
`if (!join_struct.wait_done)` at the beginning of an rjoin.

This usually had no effect as the join isn't usually done yet. But if
thread A is the first to enter the join and finishes before thread B
starts, then B will never enter the `if`.

The only effect this had was that B would not fire join events in these
rare cases. (After that we check `join_struct.wait_done` again anyway.)
As of this PR we will always fire the events, which makes them easier
to analyze consistently.

Before this PR, looking at the join events for a single thread would show
no traces of the rjoin happening, except for an extra restart event from
the other thread.
* Introducing Pinned Object Heap

* PR feedback

* reverted a test-only change
* The main difference of `AllocLHeap` is that it uses per-heap acontext instead of per-thread. There is no advantage in that and results in allocations under-reported in `GetAllocatedBytesForCurrentThread`

This change unifies to one allocation entry point - `Alloc` (and its `AllocAlign8` variety)

* Removed AllocAlign8

* PR feedback - some refactoring to merge duplicate calls to `Alloc`

* Splited an `ifdef/else`  in two to not cross code blocks.

* No need to update `GC_INTERFACE_MAJOR_VERSION` more than once per release. And we already did for this one.
…(#33597)

* Delete TlsIdx_XXX

* Delete EnableTerminationOnHeapCorruption

* Delete redundant cant stop tracking

* Workaround compiler bug
* Fix native components build for Android

* Add cmake introspection for pthread_setcancelstate

* Address CR feedback

* Use calculated eth speed for Android

* Use #ifdef FEATURE_EVENT_TRACE
* Adding API for POH allocations and propagating flags all the way to Alloc.

* make `AllocateUninitializedArray` and `AllocateArray`  public

* Added NYI implementations to Mono

* moved tests to libraries

* Actually use POH and more tests.

* Disable tests for the new API on mono

* mop up remaining TODOs

* Fix build breaking whitespace.

* Mono tabs and mark heavier tests as [Outerloop]

* Mono  space before openning parens and braces

* Refactored AllocateArray

* PR feedback

* XML Doc comments
runtime\src\coreclr\src\gc\gc.cpp(4343,1): warning C5208: unnamed class used in typedef name cannot declare members other than non-static data members, member enumerations, or member classes
There is an assert in FindCgroupPath that fires when hierarchy_root
and cgroup_path_relative_to_mount are equal, which is the case for
cgroups that are not named. This assert checks that the common
path in those two variables ends with / which is only the case
with named groups.

We have never seen this assert to fire because cgroups initialization
happens before the debugger support initialization in PAL and so
asserts are disabled at that point. I am going to fix that in a
separate PR.

This problem was discovered with the standalone GC where the assert
actually fires as it uses a plain C assert function.

This change fixes the assert to account for the case when both the
paths are the same.
@MichalStrehovsky
Copy link
Member Author

CI is not triggering - need to check CI status before merging.

@jkotas jkotas closed this Mar 31, 2020
@jkotas jkotas reopened this Mar 31, 2020
@jkotas jkotas merged commit 17369d3 into dotnet:master Apr 1, 2020
@MichalStrehovsky MichalStrehovsky deleted the updateGc branch April 1, 2020 06:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.