Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip ro segments which are not in range when walking heap for GC diagnostics #100002

Conversation

alexey-zakharov
Copy link
Contributor

@alexey-zakharov alexey-zakharov commented Mar 20, 2024

Fixed crash in gc_heap::walk_heap_per_heap when walking frozen segments

Motivation:

Enabling COR_PRF_MONITOR_GC GC Heap diagnostics flag causes crash after initiating AssemblyLoadContext unloading with the following callstack.

[Inline Frame] coreclr.dll!WKS::my_get_size(Object *) Line 11572	C++	Symbols loaded.
>	coreclr.dll!WKS::gc_heap::walk_heap_per_heap(bool(*)(Object *, void *) fn, void * context, int gen_number, int walk_large_object_heap_p) Line 51974	C++	Symbols loaded.
 	coreclr.dll!GCProfileWalkHeapWorker(int fProfilerPinned, int fShouldWalkHeapRootsForEtw, int fShouldWalkHeapObjectsForEtw) Line 727	C++	Symbols loaded.
 	coreclr.dll!GCToEEInterface::DiagGCEnd(unsigned __int64 fConcurrent, int index, int gen, bool reason) Line 818	C++	Symbols loaded.
 	coreclr.dll!WKS::gc_heap::do_post_gc() Line 50183	C++	Symbols loaded.
 	coreclr.dll!WKS::gc_heap::gc1() Line 22865	C++	Symbols loaded.
 	[Inline Frame] coreclr.dll!GCToOSInterface::GetLowPrecisionTimeStamp() Line 1091	C++	Symbols loaded.
 	coreclr.dll!WKS::gc_heap::garbage_collect(int n) Line 24329	C++	Symbols loaded.
 	coreclr.dll!WKS::GCHeap::GarbageCollectGeneration(unsigned int gen, gc_reason reason) Line 50526	C++	Symbols loaded.
 	coreclr.dll!WKS::GCHeap::GarbageCollect(int generation, bool low_memory_p, int mode) Line 49681	C++	Symbols loaded.
 	coreclr.dll!ETW::GCLog::ForceGCForDiagnostics() Line 492	C++	Symbols loaded.
 	[Inline Frame] coreclr.dll!ETW::GCLog::ForceGC(__int64) Line 431	C++	

I was suspecting invalid profiler setup, however the issue was also reproducible with PerfView tool without enabling COR_PRF_MONITOR_GC flag via profiler dll at startup - tool crashes on GC Heap dump once one of assemblies from AssemblyLoadContext gets unloaded.

I was suspecting then heap corruption, but was not able to gather any proof of that by running gc::verify_heap every GC and native/managed transition (the latter was painfully slow 😁). Further comparison with gc::verify_heap yielded that heap verification uses heap_segment_in_range/heap_segment_next_in_range methods to get the heap segment to iterate over. heap_segment_in_range checks that segment is not readonly/frozen (heap_segment_flags_readonly flag) or if it is “in range” (heap_segment_flags_inrange flag). At the same time gc_heap::walk_heap_per_heap scans through all segments.

Changes:

My hypothesis is that assembly unloading may free/rollback segments that are used for string/type/etc data in frozen segments and thus we need to skip such segments when traversing heap for diagnostics. I might be wrong given my limited understanding of USE_RANGES feature and would appreciate other ideas 😄

Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 20, 2024
@alexey-zakharov
Copy link
Contributor Author

@dotnet-policy-service agree company="Unity Technologies"

@alexey-zakharov alexey-zakharov marked this pull request as ready for review March 20, 2024 09:24
@EgorBo
Copy link
Member

EgorBo commented Mar 20, 2024

cc @cshung

My hypothesis is that assembly unloading may free/rollback segments that are used for string/type/etc data in frozen segments

We currently never unregister/remove such segments and never put stuff from unloadable ALCs there 🤔

@alexey-zakharov
Copy link
Contributor Author

We currently never unregister/remove such segments and never put stuff from unloadable ALCs there 🤔

thanks! ok, then my understanding was wrong and perhaps we do have segments corruption or there is another mechanism that may unmark segments
in case of corruption the question then would be how to diagnose it assuming verify_heap skips such segments and never crashes? 🤔 (perhaps protect frozen segments memory with READONLY flag)

@cshung
Copy link
Member

cshung commented Mar 20, 2024

@alexey-zakharov, is the crash something you can share with us a repro?

@alexey-zakharov
Copy link
Contributor Author

@alexey-zakharov, is the crash something you can share with us a repro?

@cshung yes, I think so - do you know what are the guidelines for sharing?
and should I make an issue instead?

@cshung
Copy link
Member

cshung commented Mar 21, 2024

Do you know what are the guidelines for sharing? and should I make an issue instead?

Thanks for planning to share! I don't think we have guidelines for sharing repro, but we have a template for opening bug issues and there are some general questions that you can answer there.

https://github.com/dotnet/runtime/issues/new?assignees=&labels=&projects=&template=01_bug_report.yml

@jkotas
Copy link
Member

jkotas commented Mar 29, 2024

Is this superseded by #100444 ?

@alexey-zakharov
Copy link
Contributor Author

Is this superseded by #100444 ?

yes, I've dug deeper into the issue and we've figured a better fix
I'm closing this PR

@alexey-zakharov alexey-zakharov deleted the upstream-walk_heap-skip-frozen-segments branch March 29, 2024 16:36
@alexey-zakharov
Copy link
Contributor Author

alexey-zakharov commented Mar 29, 2024

(the real issue we had was that SzArray of a collectible type was allocated on a frozen heap and once LoaderAllocator gets destroyed object's MethodTable points to a bad memory which leads to random consequences when iterating heap)

@github-actions github-actions bot locked and limited conversation to collaborators Apr 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-GC-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants