Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GC] Fix freelist crash - don't modify during check-only operation #104876

Merged
merged 1 commit into from
Jul 15, 2024

Conversation

markples
Copy link
Member

RegisterForFullGCNotification sets gc_heap::fgn_maxgen_percent, which guards calls to check_for_full_gc to determine if a notification should be sent. gc_heap::generation_to_condemn. One part of check_for_full_gc calls generation_to_condemn with the parameter check_only_p set to TRUE. generation_to_condemn calls try_get_new_free_region, which ensures that a free region is available. It checks the freelist, but if none are found it creates a new one and adds it to the freelist. check_for_full_gc does not (and should not) contain the necessary synchronization to make this safe. This can lead to corruption of the basic region freelist, often ending up with nullptr links between nodes.

This caused crashed in production and the cause was determined by using a custom GC build with additional runtime checks. The fix is to simply honor the check_only_p parameter and not call try_get_new_free_region when it is set. The problem can be reproed by changing try_get_new_free_region to always create a new region. This fix handles that repro.

This fix does not impact normal GC operation. It can cause check_for_full_gc to no longer detect that full collection is going to occur, but only if the failure to create or initialize a new region is the only reason why that full collection is going to occur. However, those cases are open to the race condition that can cause the freelist corruption is possible.

Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@markples
Copy link
Member Author

The build analysis failure is #104883, which wasn't open yet when this testing ran but is closed now so rerunning just build analysis doesn't work.

@@ -21507,10 +21507,13 @@ int gc_heap::generation_to_condemn (int n_initial,
}

#ifdef USE_REGIONS
if (!try_get_new_free_region())
if (!check_only_p)
Copy link
Member

@cshung cshung Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Using && to avoid changing many lines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That annoyed me a bit too, but I followed the pattern at line 21544 instead. I assumed that it was to make it clear by separating the "mode" of the function and the check itself, which made sense to me.

Copy link
Member

@cshung cshung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@Maoni0 Maoni0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@markples
Copy link
Member Author

/ba-g see above comment about #104883

@markples markples merged commit f38e661 into dotnet:main Jul 15, 2024
86 of 89 checks passed
@markples
Copy link
Member Author

/backport to release/8.0-staging

Copy link
Contributor

Started backporting to release/8.0-staging: https://github.com/dotnet/runtime/actions/runs/9962984209

markples pushed a commit that referenced this pull request Jul 25, 2024
…104992)

Backport of #104876 to release/8.0-staging

/cc @markples

## Customer Impact

- [x] Customer reported
- [ ] Found internally

GC can crash in region code if GC.RegisterForFullGCNotification is used.

## Regression

- [x] Yes
- [ ] No

Regression is part of general move to regions, but specific code was introduced in #48691

## Testing

The behavior is very difficult to repro and likely hadn't occurred before because it relies on a race condition, a region allocation failing, and GC.RegisterForFullGCNotification be used.  I built a custom GC that always triggers the region allocation failure path and was able to repro and verify the fix in GCPerfSim.

## Risk

Low: only impacts GC.RegisterForFullGCNotification path; only impacts notifications; may cause a notification for a full GC to be skipped, but only in a case where no other reasons for full gc occur -and- the code would be subject to the race condition and crash
@github-actions github-actions bot locked and limited conversation to collaborators Aug 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants