-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GC heap hard limit for 32 bit #101024
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to this area: @dotnet/gc |
@gbalykov, I am wondering why you would want a I am working on #100380 (forgive the misleading title) that enabled the computation of To make
I was thinking about just limiting my work to just 64 bits, but if there is a reason to, we can figure how to get the calculation right on 32 bit platforms as well. |
It's the same as for 64bit: to be able to limit memory consumption of process, and, specifically, GC heap size. Setting the limit specifically on managed heap size allows for much better granularity than using limits for whole process with cgroups, etc. Even though that this change doesn't set default value for containerized environments ( Also here's related discussion about heap limits for 32 bit: #78128.
Can you share more details about this? During my anylysis of code (f84d33c) I didn't find any other related places that depend on bitness, and it seems that with this change behavior on 64 and 32 bit is pretty much the same. For example, I was also able to test this change with more complex apps, and physical size of heap (Private_Dirty of GC vmas) always remained in allowed limit as expected (of course, assuming that heap is able to fit in that limit with more frequent GC, for example). |
It does make sense to use To start with, the key difference between 32 bits and 64 bits is Here is where
Historical Context (Good for knowledge, but feel free to skip on first read)With respect to commit accounting, let's go back in time and see why we are doing it. By the time we introduced Then I introduced per object heap hard limit (#36731) by checking against By that time, the verification logic was unnecessarily complicated, and therefore we didn't merge those in. That was before Much of the verification logic was written with only Keeping track of these number is non-trivial, at the very least, we need to take the Then it comes to A technical challenge at that point of time is that if One simple way to get around it is to always keep the counters, which is what I am trying to do right now, we have to be careful with respect to performance (which I haven't checked yet). I am hoping it is okay. The known problem - overlapped commitWhile it appears that it is easy to just add when we commit and subtract when we decommit, it is not that simple. The problem is overlapped commits. Under
Under non- The double count on it own is wrong, but is not fatal. With that happened repeatedly, however, we will leak those bytes (while memory is fine), and so we ran out of bytes when you try to commit (but in fact you do have memory). One of the known case of overlapping commit is for mark arrays. We don't keep track of the exact bound where the mark array was committed, so any time we need to ensure mark array is available and it isn't fully committed, we commit the whole mark array range every time we need it. And there could be other unknown cases that we just don't know in the moment. The unknown problem - overflow?Beyond just overlapped commit, I am seeing other problems as well. In case where a heap hard limit is not provided, I suspect the counter overflowed. When I decommit, I am hitting an assert that we are decommitted more memory than we have, and the counter is suscipiously low at that point. Of course, that's just a guess. There could be other things going on as well. The only thing that I am sure is the number must be wrong and we can't decommit memory that wasn't committed. Plan of attackFundamentally, it is a calculation error. Unfortunately, we don't have any symptoms when the wrong happens, and failing at the end is too late. To make this easier, I think we can 1.) Track mark array usage, and make sure we don't do overlapped commit. Notice that Final wordsAs you can see, it is not a trivial problem to solve. With efforts, I think we can nail it. It will worth the effort to make 32 bits app work better. |
Thanks for such detailed description!
Does this mean that heap hard limit doesn't work correctly with segments even on 64bit on current main? Original PR to add heap hard limit support (dotnet/coreclr#22180) was merged back in 2019, and it seems that GC regions support (#59283) was enabled only in 2022, so during my work on this patch I thought that segments on 64 bit fully support heap hard limit. If heap hard limit with segment is not fully correct on 64 bit on main now, is it safe to assume that it was correct before GC regions support was enabled by default (e.g. in .net 6)? Or is this possible leak in bytes accounting (e.g. for mark array) a known bug of heap hard limit with segments that was present since 2019?
If I understand correctly, this logic for counting committed bytes by walking the heap data structures is needed for two things:
So, if both are disabled for 64 bit, it seems it should be the same as it was before GC regions were enabled by default. Is it correct?
Can you share why is it needed to always keep counters on 64bit with regions if there's already a way to update them by walking the heap data structures? Is it needed for better validation of counters and heap structures? If there's no way to update GC counters by walking the heap data structures with segments, then always keeping the counters seems the only way to fully support limits on 32 bit (including verification of counters and Thanks again for your interest in this! |
Sorry for the late reply, I took the last few days trying to figure out these. > Does this mean that heap hard limit doesn't work correctly with segments even on 64bit on current main?I confirm this is the case, to prove that, I have found a couple of issues, I plan to get these fixed. Bug 1When we get a new segment by committing memory for large object heap or pinned object heap, it is incorrectly accounted for as Gen 2, this will impact the gc_heap::get_segment (size_t size, gc_oh_num oh)
{
...
result = make_heap_segment ((uint8_t*)mem, size, __this, (uoh_p ? max_generation : 0));
...
} Bug 2When we release a segment, we used void gc_heap::virtual_free (void* add, size_t allocated_size, heap_segment* sg)
{
bool release_succeeded_p = GCToOSInterface::VirtualRelease (add, allocated_size);
if (release_succeeded_p)
{
reserved_memory -= allocated_size;
dprintf (2, ("Virtual Free size %zd: [%zx, %zx[",
allocated_size, (size_t)add, (size_t)((uint8_t*)add + allocated_size)));
}
} But when we call it in This will impact both > If heap hard limit with segment is not fully correct on 64 bit on main now, is it safe to assume that it was correct before GC regions support was enabled by default (e.g. in .net 6)?With the bugs I just shown above, It is pretty clear the answer is no. The bugs I have found above has nothing to do with > Can you share why is it needed to always keep counters on 64bit with regions if there's already a way to update them by walking the heap data structures? Is it needed for better validation of counters and heap structures?When I built it, it was meant to be a stepping stone. As you probably know, the logic is quite complicated and easy to make mistakes. Building this logic helped me a lot in validating that I am doing the right thing. With the PR, my goal was to get Except my investigations above proved otherwise, the bug is actually quite bad, the > Can you share the C# program that you use for testing?It is this failing test case in #100380. With that test case, I can:
Unfortunately, getting these committed counters correct for segments isn't a priority right now. It would be great if you could help with the debugging. Essentially, you want to find where in the code that is probably
I know, it is easier said than done, it took me quite a few days to find the couple of issues above. Let me know if there are things I could help with. I am more than happy to help you with this. |
With another week of debugging, I figured the bug related to server GC. It isn't related to threading, it is just that I missed one call site to With that, we can focus on the |
@cshung Great work! I've been on vacation and busy with other task right now, but plan to switch back to this starting from the next week. |
@gbalykov, welcome back from your vacation, hope you had a good time. Another update, I have got the bookkeeping numbers as well (sadly I gave up the mark array bytes for simplicity). With that, I have gone through a round of review with @Maoni0 with my change. I think the change is more or less good to go and I will merge it whenever I can. It would be great if you can rebase this change on top of that, test it on some 32 bit platforms and see if that still works? To stress test the calculation, you can turn on the |
@cshung great! I'll test with your change on 32bit this week |
82d6a05
to
d34af7e
Compare
Sorry for late response, took some time for additional testing. I've rebased your changes (#100380) before they were merged on top of 5474ab5 and then applied my patch. What I've tested on 32 bit:
In all cases there're no regressions and no asserts related to heap limit fire (with enabled heap hard limit some tests fail with OOM as expected, some tests work too long in debug and fail with 10min timeout). All of the above include HeapExpansion/plug test that you've mentioned previously (https://github.com/dotnet/runtime/blob/main/src/tests/GC/Features/HeapExpansion/plug.cs), it also passes without issues. I've not tested Thanks again for your fixes for segments. I've rebased this PR on latest main, please take a look. |
@cshung can you please review this? It seems that mark arrays are the only known bug in accounting with segments, so what do you think about reviewing and merging this first, and then focusing on mark arrays? |
My apologies with not responding promptly. For mark array, I think we are planning to just give it up like what we have right now, the worst impact of that is just inaccurate numbers. By making sure these numbers doesn't go into the accounting, it won't lead to accounted byte leaks. The accounting is still not perfect just yet, after that change is merged, we are starting to get random assertions (e.g. #102706 (comment)). I will be taking a look at that next, right now I am concentrated on #97316. For your change, I did skim through the code, the idea looks good to me. There are places where you need to handle potential overflow, these are good checks for 64 bits as well (64 bit number also overflows, it is just less likely). I wonder if we can just use the same logic for both 32 and 64 bits. Eventually, this needs to be signed off by @Maoni0. Having supporting data is likely to get this done more smoothly. Do you have a particularly motivating scenario, maybe we can run it with and without the hard limit set, and show us the behavior difference? |
Thanks for your feedback! Sorry for delay in response, I was busy with other tasks.
I wanted to minimize impact on 64bit to make this change simplier and kept checks for 32bit only since they were essential there. These checks might be useful on 64bit, though, not sure whether someone will try to set such high limits on 64bit. There's also additional assumption on 64bit, that physical size is smaller than virtual size in containerized environments, where by default heap hard limit is set to 75% of total physical memory available. This is fine, because virtual address space on 64 bit is much larger than actual physical size on devices. So when there comes a time when overflow checks are needed on 64bit, then this 75% setup logic will probably need to be updated too. Note that for 32bit I didn't enable this default value, only manual heap hard limit is added.
Motivating scenario are Tizen applications where app can limit it's own memory consumption, track it and refresh it (in future I want to add refresh of memory limit for 32bit too). In tests that I've done, on some apps I've seen significant reduction in used GC heap size (Private_Dirty) around 33%. This, of course, implies that actually used live-objects fit in specified limits (otherwise, there'll be OOM). @Maoni0 can you please take a look? |
@gbalykov, just checking whether this change is ready? Given that we are now past preview7 is this required for 9? |
@mangod9 yes, waiting for review from GC maintainers
It would be good to get this to .net 9 |
I didn't know this needed my attention till now. so I'll take a look today. sorry for the delay! |
I took a look at the changes and also chatted with @cshung about them. we are a little surprised by some of the changes, for example why the per heap limit is divided up this way (that it has to be a combination of 2, 1 and 1), also the change in |
@Maoni0 Thanks for taking a look!
This is needed to guarantee that there's no oveflow during computations on 32bit, for example, during Besides, this mimics what
This change is needed to add upper boundary on gen1 max size with enabled heap hard limit, which can be smaller than what was set for gen1 max size before that (for 32bit |
are you referring to this code in
but this is only used when hardlimit isn't specified. when hardlimit is specified we don't actually call
this is saying if you only specify the total hardlimit, we will divide this limit based on the number of heaps (and not let it fall below the min seg size). and for LOH/POH we'll double it unless we are using large pages. if you did specify the per-object-heap limits they will just be taken as it with the adjustment for number of heaps. it's very common that folks would specify a small value for POH.
we shouldn't have code enabled just because they don't produce a noticeable effect because later changes may cause a code path to have an effect inadvertently. so I would suggest to only have this code for 32-bit. this is not needed for functional because this size specifies the max size for gen1 budget. even if it was bigger than half a seg size we wouldn't be able to physically fit that much onto a segment anyway. I do think it's a good perf change for 32-bit to limit this. |
Yes. The idea is that we need to somehow limit values of segment size from above to not overflow with enabled heap hard limit on 32bit, and
Ok, I agree that it's better to keep this under ifdef if you have concerns for future. I'll update this.
During development I wasn't sure whether it's correct if we set gen1 max size larger than actual segment size, so that's why this change appeared. Thanks for clarification, it's good that this can also have positive effect on performance. |
d34af7e
to
21c343b
Compare
ahh you are right about the combination. 2/1/1 is the only combination to fit into 4gb. I was too used to thinking in 64-bit address space. there are some code styling issues that make the changes not consistent with the GC code. I'll make some suggestions. I'll do another pass on the changes later today. |
actually.... I don't think these sizes have to be power of 2 (I was just thinking about this a bit more since it seemed unfortunate; and it had been quite some time since I thought about seg size stuff). we just need these sizes to be multiples of a power of 2 number. I need to go to sleep but will give more detail later. this is how |
sorry I had a lot of distractions. not sure if you had a chance to look at how the segment size is set when
when
you could do the same for 32-bit. |
@Maoni0 thanks for your feedback
Can you please describe why powers of 2 are not ok in this case? You mean that they might differ significantly from what users specify? (i.e. user sets 780Mb for heap limit, but with this patch it'll round up to 1 Gb on both 32 and 64 bit) In terms of 2/1/1, mmap will probably fail anyway with smth around a bit higher than 1.5 Gb for SOH+LOH+POH, so higher border for SOH/LOH/POH heap-specific limits is just to prevent overflow of SOH+LOH+POH. So, 2/1/1 in different combinations should cover this case (e.g. 1.5 Gb for SOH and smth small for LOH/POH) and noone will actually be able to use 2/1/1 anyway. And same for same limit for all heaps, 1Gb*3 will lead to failing mmap anyway.
It seems that this can benefit both 32bit and 64bit, so maybe it's better to do it separately for both after finishing this one. What do you think about this? Besides same can be done in |
sorry again to reply late - I've just been swamped with urgent items and also was OOF for a while.
because it allows too little flexibility. same as large pages since for large pages we don't allow nearly as much virtual memory space, so we do this align on min seg size thing instead of requiring it to be a power of 2.
for 64-bit we allow 5x the hard limit if it's not using large pages. also for regions (since that's what 64-bit uses) it's not an issue since we don't do the reserve all segments on initialization for regions (that's only for segments).
I'll defer to @mangod9, in case someone else can work on this but from my side I've just been swamped with really urgent items and I imagine that will continue for the next few months too. |
@Maoni0 thanks for your feedback
I agree that there's still a lot of room for improvements in future, including removal of dependency on powers of 2 both for regions/segments, refresh for segments, etc. But this PR is an initial step that brings hard limit on 32bit with segments to the same level as on 64bit with segments. Segments, of course, are disabled by default on 64bit, but runtime can be built with them enabled. So, in this sense this PR is a finished change that focuses on specific problem to keep patch smaller, and further work will improve segments both on 32 and 64 bit. |
56db53a
to
10b0ae8
Compare
10b0ae8
to
f6bfa71
Compare
c588088
to
e2f009a
Compare
e2f009a
to
21708c7
Compare
Hello @gbalykov, is this ready for review? We will look into the feasibility of this change after the holidays. |
@mangod9 yes, this is ready for review, thank you. I'll rebase on current main to rerun CI. |
This change enables heap hard limit on 32 bit.
Approach is similar to
DOTNET_GCSegmentSize
(GCConfig::GetSegmentSize
), which allows to set size of segment for SOH/LOH/POH, and guarantees that there's no oveflow during computations (for example, duringsize_t initial_heap_size = soh_segment_size + loh_segment_size + poh_segment_size;
). WhenDOTNET_GCSegmentSize
is set on 32bit, it's rounded down to power of 2, so largest possible value of provided segment (SOH) is 2 Gb (4Mb<=soh_segment_size<=2Gb
). For LOH/POH same value is divided by 2 and then also rounded down to power of 2, so largest possible value of LOH/POH segment is 1 Gb (4Mb<=loh_segment_size<=1Gb, 0<=poh_segment_size<=1Gb
). So, segment size for SOH/LOH/POH never overflows, as well asinitial_heap_size
(except for oveflow ofinitial_heap_size
to 0, which will lead to failed allocation later anyway).Similar thing happens when
DOTNET_GCHeapHardLimit
orDOTNET_GCHeapHardLimitSOH
/DOTNET_GCHeapHardLimitLOH
/DOTNET_GCHeapHardLimitPOH
are set on 32bit. There're limits on these values:0 <= (heap_hard_limit = heap_hard_limit_oh[soh] + heap_hard_limit_oh[loh] + heap_hard_limit_oh[poh]) < 4Gb
a) 0 <= heap_hard_limit_oh[soh] < 2Gb, 0 <= heap_hard_limit_oh[loh] <= 1Gb, 0 <= heap_hard_limit_oh[poh] <= 1Gb
b) 0 <= heap_hard_limit_oh[soh] <= 1Gb, 0 <= heap_hard_limit_oh[loh] < 2Gb, 0 <= heap_hard_limit_oh[poh] <= 1Gb
c) 0 <= heap_hard_limit_oh[soh] <= 1Gb, 0 <= heap_hard_limit_oh[loh] <= 1Gb, 0 <= heap_hard_limit_oh[poh] < 2Gb
0 <= heap_hard_limit <= 1Gb
These ranges guarantee that calculation of soh_segment_size, loh_segment_size and poh_segment_size with alignment and round up won't overflow, as well as calculation of sum of them for allocation (overflow to 0 is allowed, same as for
DOTNET_GCSegmentSize
). When values specified by user with env variables don't meet the requirements above, runtime exits withCLR_E_GC_BAD_HARD_LIMIT
. When allocation (withmmap
on Linux) fails, runtime exits with same error as for large segment size specified withDOTNET_GCSegmentSize
.This patch doesn't enable heap hard limit on 32bit in containers (
is_restricted_physical_mem
), because current heap hard limit approach is to reserve one large GC heap segment with size equal to specified heap hard limit, and no new segments are reserved in future. On 64 bit in containers by default heap hard limit is set to 75% of total physical memory available, and this is fine, because virtual address space on 64 bit is much larger than actual physical size on devices. In contrast, on 32 bit virtual address space size might be the same as available physical size on device (e.g. 4 Gb for both). This means that reserving 75% of total physical mem will reserve 75% of whole virtual address space, which might be both undesirable (e.g. process later expects more available memory) andmmap
on Linux will probably fail anyway.I've ran release CLR tests on armel Linux with this patch on top of f84d33 with different limits, there seems to be no issues with 4Mb/8Mb/16Mb and 512Mb heap hard limits (some tests fail with "Out of memory"/"OOM" and "System.OutOfMemoryException", as expected, or "System.InvalidOperationException: NoGCRegion mode must be set" when NoGC region is larger than limit). Same for GCStresss 0xc and 0x3 with 4 Mb heap hard limit, and same for debug CLR tests on armel Linux with 4Mb heap hard limit.
@Maoni0 @cshung please share what you think. Thank you.