Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing the free list from singly linked to doubly linked #43021

Merged
merged 1 commit into from
Oct 6, 2020

Conversation

Maoni0
Copy link
Member

@Maoni0 Maoni0 commented Oct 5, 2020

One of the problems with BGC sweep is it zeros out the gen2 FL at the beginning which means we might need to increase gen2 size before it builds up enough FL to accommodate gen1 survivors. To lift this limitation I'm changing this to a doubly linked list so we can easily take items off and thread new ones back on.

Note that this is the initial checkin for the feature - there needs to be more stress testing and perf testing done on this so I'm checking in with the feature DOUBLY_LINKED_FL undefined. I've done some stress testing but not a whole lot.

This is only used for gen2 FL, not UOH - we already don't allow UOH allocations while we are sweeping UOH (which should be quite quick). In the future we will make it work so UOH allocs are allowed while it's being swept but that's beyond the scope of this feature (it would require work in the synchronization between the allocator and BGC sweep).

2 new bits on the method table were introduced -

Previously we didn't need to care about bgc mark bits at all since we can't possibly compact into the part of the heap that hasn't been swept. But now we can. So if we compact into a free item that hasn't been swept, we need to set the mark bits correctly. So we introduce a new bit:

// This bit indicates that we'll need to set the bgc mark bit for this object during an FGC.
// We only do this when we decide to compact.
#define BGC_MARKED_BY_FGC (size_t)0x2

Also now we don't have the luxury to allocate a min obj in the plan phase if what's allocated in this alloc context is too short. Previously we have this situation:

SB|MT|L|N

and if we plan allocated a min obj in this free item, we can allocate a min free obj right after it because the min free obj will not overwrite anything of that free item:

SB|MT|L|N
min free:
SB|MT|Payload

since we don't touch SB. But now we have this:

SB|MT|L|N|P

and if we only allocated 3 ptr size into this free item, and if we want to allocate a min free obj, we'd be over writing P (previous pointer of this free item):

SB|MT|L|N |P
SB|MT|Payload

One note on this is that we check the "allocated size" with (alloc_ptr - start_region), but start_region is updated every time we pad in the same free item. And it's really only necessary for the actual alloc context start (because we just need to preserve that free item's prev). But this is saved by the fact that in gen2 we don't pad. If we do pad in gen2 it would be good to handle this.

This is handled by set_free_obj_in_compact_bit (which sets the new MAKE_FREE_OBJ_IN_COMPACT bit) in adjust_limit where we set the bit and record the size of the "filler object". and we'll actually make the filler obj in gcmemcopy.

This means this feature is as of now ONLY FOR 64-bit as the bit we use to do this means we are taking up 3 bits in MT to do our bookkeeping. We could make it work for 32-bit by finding bits in the gap - I haven't done that.

Major areas changed were -

  • allocate_in_older_generation - this also has a complication wrt repair and commit. I introduced the new added list concept so we can repair and commit quickly, with one exception for bucket 0. For b0 since we do discard items, we do need to set prev of discarded items to PREV_EMPTY because we need to indicate that it's not the freelist anymore. However, we can still recover by going through the original head and set the prev of the discarded items one by one which wouldn't be fast so I choose to not do it - the discarded items are generally very small anyway.

  • gcmemcopy - this needs to care about the bits we added.

  • background_sweep - this takes items off of the FL and thread new ones back on. Since this never happens at the same time as the gen1 GCs using these items we don't need synchronization here.

  • allocator class - obviously this needs to care about the prev field now. The principle is we don't touch the prev field in unlink_item (except for the special b0 case) so when we repair we don't need to go repair the prev fields. When we commit, we do need to set the new prev field accordingly (unless for the added list which we would have already set the prev correctly).


Fixed some existing bugs -

The refactor change stopped updating next_sweep_obj but it's incorrect because it's used by the verification code in sos so brought that back.

More accounting to count free_list/obj spaces correctly.


TODO

Optimizations we can do in background_sweep -

  • Don't need to actually take items off if we are just going to thread back on the same one (and others from my design notes);

  • If there's a big item we may not want to remove, imagine this simplied scenario - we have 1 2-mb free item on the list and we just removed it. Now current_num_objs happens to be big enough so we left an FGC happen and this FGC is gen1 and now it doesn't find any free space and would have to grow gen2.

It's actually beneficial to switch to using the added list even for singly linked list so we could consider enabling it even when the feature is not on.

One of the problems with BGC sweep is it zeros out the gen2 FL at the beginning which means we might need to increase gen2 size before it builds up enough FL to accommodate gen1 survivors. To lift this limitation I'm changing this to a doubly linked list so we can easily take items off and thread new ones back on.

Note that this is the initial checkin for the feature - there needs to be more stress testing and perf testing done on this so I'm checking in with the feature DOUBLY_LINKED_FL undefined. I've done some stress testing but not a whole lot.

This is only used for gen2 FL, not UOH - we already don't allow UOH allocations while we are sweeping UOH (which should be quite quick). In the future we will make it work so UOH allocs are allowed while it's being swept but that's beyond the scope of this feature (it would require work in the synchronization between the allocator and BGC sweep).

2 new bits were introduced -

Previously we didn't need to care about bgc mark bits at all since we can't possibly compact into the part of the heap that hasn't been swept. But now we can. So if we compact into a free item that hasn't been swept, we need to set the mark bits correctly. So we introduce a new bit:

// This bit indicates that we'll need to set the bgc mark bit for this object during an FGC.
// We only do this when we decide to compact.

Also now we don't have the luxury to allocate a min obj in the plan phase if what's allocated in this alloc context is too short. Previously we have this situation:

SB|MT|L|N

and if we plan allocated a min obj in this free item, we can allocate a min free obj right after it because the min free obj will not overwrite anything of that free item:

SB|MT|L|N
min free:
        SB|MT|Payload

since we don't touch SB. But now we have this:

SB|MT|L|N|P

and if we only allocated 3 ptr size into this free item, and if we want to allocate a min free obj, we'd be over writing P (previous pointer of this free item):

SB|MT|L|N |P
        SB|MT|Payload

One note on this is that we check the "allocated size" with (alloc_ptr - start_region), but start_region is updated every time we pad in the same free item. And it's really only necessary for the actual alloc context start (because we just need to preserve that free item's prev). But this is saved by the fact that in gen2 we don't pad. If we do pad in gen2 it would be good to handle this.

This is handled by set_free_obj_in_compact_bit (which sets the new MAKE_FREE_OBJ_IN_COMPACT bit) in adjust_limit where we set the bit and record the size of the "filler object". and we'll actually make the filler obj in gcmemcopy. This means this feature is ONLY FOR 64-bit as the bit we use to do this means we are taking up 3 bits in MT to do our bookkeeping.

Major areas changed were -
+ allocate_in_older_generation - this also has a complication wrt repair and commit. I introduced the new added list concept so we can repair and commit quickly, with one exception for bucket 0. For b0 since we do discard items, we do need to set prev of discarded items to PREV_EMPTY because we need to indicate that it's not the freelist anymore. However, we can still recover by going through the original head and set the prev of the discarded items one by one which wouldn't be fast so I choose to not do it - the discarded items are generally very small anyway.

+ gcmemcopy - this needs to care about the bits we added.

+ background_sweep - this takes items off of the FL and thread new ones back on. Since this never happens at the same time as the gen1 GCs using these items we don't need synchronization here.

+ allocator class - obviously this needs to care about the prev field now. The principle is we don't touch the prev field in unlink_item (except for the special b0 case) so when we repair we don't need to go repair the prev fields. When we commit, we do need to set the new prev field accordingly (unless for the added list which we would have already set the prev correctly).

---

Fixed some existing bugs -

The refactor change stopped updating next_sweep_obj but it's incorrect because it's used by the verification code in sos so brought that back.

More accounting to count free_list/obj spaces correctly.

---

TODO

Optimizations we can do in background_sweep -

+ Don't need to actually take items off if we are just going to thread back on the same one (and others from my design notes);

+ If there's a big item we may not want to remove, imagine this simplied scenario - we have 1 2-mb free item on the list and we just removed it. Now current_num_objs happens to be big enough so we left an FGC happen and this FGC is gen1 and now it doesn't find any free space and would have to grow gen2.

It's actually beneficial to switch to using the added list even for singly linked list so we could consider enabling it even when the feature is not on.
@ghost
Copy link

ghost commented Oct 5, 2020

Tagging subscribers to this area: @safern, @ViktorHofer
See info in area-owners.md if you want to be subscribed.

@Maoni0
Copy link
Member Author

Maoni0 commented Oct 5, 2020

@PeterSolMS this is the doubly linked free list feature we talked about. please review (and don't forget to approve when you are ready so I can actually merge 😄 ). as discussed, I've disabled the feature with this PR. I did submit another PR with the feature on just to see how well it behaved with the CI tests. it has 6 failed tasks but 4 of them look to be infra issues. there is a legit assert from CoreCLR Pri0 Runtime Tests Run Windows_NT arm64 checked and CoreCLR Pri0 Runtime Tests Run Linux arm64 checked from this test

    GC\API\GC\GetAllocatedBytesForCurrentThread\GetAllocatedBytesForCurrentThread.cmd [FAIL]
      
      Assert failure(PID 12052 [0x00002f14], Thread: 9032 [0x2348]): item != head
      
      CORECLR! WKS::allocator::thread_item + 0x78 (0x00007ff9`2073e698)
      CORECLR! WKS::gc_heap::thread_gap + 0x124 (0x00007ff9`2073e5fc)
      CORECLR! WKS::gc_heap::background_sweep + 0x590 (0x00007ff9`2071da48)
      CORECLR! WKS::gc_heap::gc1 + 0x228 (0x00007ff9`20729570)
      CORECLR! WKS::gc_heap::bgc_thread_function + 0x124 (0x00007ff9`2071ebbc)
      CORECLR! WKS::gc_heap::bgc_thread_stub + 0x34 (0x00007ff9`2071ee24)
      CORECLR! <lambda_ee78ae1c7dabd5648ee282458d820e21>::<lambda_invoker_cdecl> + 0xA8 (0x00007ff9`205abb68)
      KERNEL32! BaseThreadInitThunk + 0x34 (0x00007ff9`56f4cbc4)
      NTDLL! RtlUserThreadStart + 0x44 (0x00007ff9`573ab154)
          File: F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp Line: 12252
          Image: D:\h\w\BD5909D9\p\CoreRun.exe

which I didn't have time to look at so if you could take a look that'd be great; I cannot repro it on x64.

@ghost
Copy link

ghost commented Oct 5, 2020

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@@ -164,7 +164,11 @@ class Object

MethodTable * GetGCSafeMethodTable() const
{
#ifdef HOST_64BIT
return (MethodTable *)((uintptr_t)m_pMethTab & ~7);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also need update in https://github.com/dotnet/diagnostics/blob/2e2e6775d46a31e44218d62657b4169fa13ac74d/src/SOS/Strike/sos.cpp#L122 and potentially other places that have this mask hardcoded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jkotas! we will be sure to change this (in another PR). for now I wanted to check this in so Peter can continue working on it while I take care of some other urgent workitem.

@PeterSolMS
Copy link
Contributor

I don't completely understand the need for the new BGC_MARKED_BY_FGC - could we instead have the foreground GC set the background mark bits for each object in the plug directly in this case? If that wouldn't work, could you explain why not?

@Maoni0
Copy link
Member Author

Maoni0 commented Oct 6, 2020

we could do that in gcmemcopy - the advantage of doing it in plan phase is we know exactly when we are fitting into a free list item and only then do we need to check if we need to set this bit at all; if we do this in a later phase we no longer know if a reloc address is in a gen2 FL item or not so we'd need to check all the time.

@Maoni0
Copy link
Member Author

Maoni0 commented Oct 6, 2020

/azp run runtime

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@PeterSolMS
Copy link
Contributor

The above assert failure actually repros for me on amd64 with the feature enabled - still trying to figure out the root cause.

However, with the feature disabled, this survives a couple of hours of GC stress without issue, so it's fine to merge IMO.

@Maoni0
Copy link
Member Author

Maoni0 commented Oct 6, 2020

when the define is not defined there's very little code change :) I also ran stress overnight with no problems.

I'm gonna ignore the "CoreCLR Pri0 Runtime Tests Run Windows_NT arm64 checked" job timed out because it ran fine last time (the only reason I re-ran the whole thing was to get the fix for the OSX build issue).

@Maoni0 Maoni0 merged commit f099416 into dotnet:master Oct 6, 2020
jkotas added a commit to jkotas/runtimelab that referenced this pull request Oct 25, 2020
jkotas added a commit to dotnet/runtimelab that referenced this pull request Oct 25, 2020
@Maoni0 Maoni0 deleted the doubly_checkin branch November 24, 2020 22:39
@ghost ghost locked as resolved and limited conversation to collaborators Dec 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants