Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an internal mode to Lock to have it use non-alertable waits #97227

Merged
merged 2 commits into from
Jan 24, 2024

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Jan 19, 2024

Fixes #97105

- Added an internal constructor that enables the lock to use non-alertable waits
- Non-alertable waits are not forwarded to `SynchronizationContext` wait overrides, are non-message-pumping waits, and are not interruptible
- Updated most of the uses of `Lock` in NativeAOT to use non-alertable waits
- Also simplified the fix in dotnet#94873 to avoid having to do the relevant null checks in various places along the wait path, by limiting the scope of the null checks to the initialization phase

Fixes dotnet#97105
@kouvel kouvel added this to the 9.0.0 milestone Jan 19, 2024
@kouvel kouvel self-assigned this Jan 19, 2024
@ghost
Copy link

ghost commented Jan 19, 2024

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #97105

Author: kouvel
Assignees: kouvel
Labels:

area-NativeAOT-coreclr

Milestone: 9.0.0

@kouvel
Copy link
Member Author

kouvel commented Jan 19, 2024

I think the main uses of Lock in NativeAOT that have showed up in issues and need to be fixed to use non-alertable waits are the uses during class construction. After brief scans of the other uses, I've updated most of the uses of Lock in NativeAOT to use non-alertable waits (except the one in SyncTable used by Monitor, which should be alertable). I'm not 100% sure about all of them though, please take a look.

@VSadov
Copy link
Member

VSadov commented Jan 19, 2024

What are the exact criteria for using alertable vs. nonalertable locks? Is that "may run user code while holding a lock" ?

@kouvel
Copy link
Member Author

kouvel commented Jan 19, 2024

What are the exact criteria for using alertable vs. nonalertable locks? Is that "may run user code while holding a lock" ?

Alertable waits are interruptible (for Thread.Interrupt), allow reentrance through APCs and message pumping, and allow overriding the wait behavior through SynchronizationContext. If any of those features are desirable, and the scenario is ok with all of those features, the wait probably should be alertable. If any of those features are undesirable, the wait should probably be non-alertable.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@VSadov
Copy link
Member

VSadov commented Jan 19, 2024

I understand the difference in the behavior. I was hoping if there could be some guidance on when one kind of lock should be used vs. another.

@kouvel
Copy link
Member Author

kouvel commented Jan 19, 2024

I understand the difference in the behavior. I was hoping if there could be some guidance on when one kind of lock should be used vs. another.

I'm thinking of it like, if the lock usage scenario needs one of those alertable features and is designed to handle all of them, then the lock should use alertable waits, and otherwise, the lock should use non-alertable waits. Perhaps a reasonable and limited guidance in low-level runtime code would be to use non-alertable waits by default unless there is a good reason to use alertable waits.

@jkotas
Copy link
Member

jkotas commented Jan 19, 2024

the lock usage scenario needs one of those alertable features and is designed to handle all of them

Properly constructed apps should never need the reentrant waits. Also, most code is not designed to handle the reentrant waits.

I think about the re-entrant waits as a legacy feature that works around COM STA related dead locks in buggy apps.

@VSadov
Copy link
Member

VSadov commented Jan 20, 2024

I think about the re-entrant waits as a legacy feature that works around COM STA related dead locks in buggy apps.

Me too. I can't think of a scenario that would deliberately rely on re-entrant waits.
I think the defaults regarding reentrant waits are wrong for lock, but I guess it is way too late to fix that.

So the rule for deciding whether to use non-alertable lock is really "everything in the runtime uses nonalertable locks, except the Monitor lock, for legacy reasons"

I wonder if there is a use case for exposing the parameter via public API? Maybe it will lead to people using synchronization context less often? Arguably, when people use synchronization context to override Wait behavior, they think about their locks (not the unknown locks in libraries), so an option to make their locks uninterruptible could be useful.

@@ -16,7 +16,7 @@ internal unsafe partial class FrozenObjectHeapManager
{
public static readonly FrozenObjectHeapManager Instance = new FrozenObjectHeapManager();

private readonly LowLevelLock m_Crst = new LowLevelLock();
private readonly Lock m_Crst = new Lock(useAlertableWaits: false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


private ushort WaiterStartTimeMs
{
get => (ushort)(_waiterStartTimeMsAndFlags >> 1);
Copy link
Member

@VSadov VSadov Jan 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ushort range in milliseconds is ~65 seconds. This reduces it by half. On the other hand we get these timings from ticks that are typically more coarse than 1 msec.

Overflows here are not dangerous per say and 30 seconds is still a long time, but perhaps it is still better to sacrifice precision rather than range?
I.E - store the value without shifting, just mask out the lowest bit when returning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the range matters much, the feature that relies on it operates on 100-ms granularity, which can be served just as well before and after this change.

I.E - store the value without shifting, just mask out the lowest bit when returning.

I didn't follow, can you elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of something like the following:

        private ushort WaiterStartTimeMs
        {
            get => (ushort)(_waiterStartTimeMsAndFlags | 1);
            set => _waiterStartTimeMsAndFlags = (ushort)(value | (_waiterStartTimeMsAndFlags & 1));
        }

In fact the | 1 in the getter might not be necessary, considering ticks are coarser than 1msec.

It is ok either way though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of that?

Smaller chance of wrapping around if something gets stuck for 30 seconds. When it is a matter of seconds, things can realistically happen.

It is a very small benefit though since it would be rare and wrap around is not a huge deal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I don't think it's necessary to do something like that at the moment, there isn't an issue with the range.

Copy link
Member

@VSadov VSadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@kouvel
Copy link
Member Author

kouvel commented Jan 20, 2024

I wonder if there is a use case for exposing the parameter via public API?

It would probably be useful to expose non-alertable waits in general in public APIs. There are plenty of cases in the libraries where alertable waits were/are used but are not appropriate, and there are probably plenty more elsewhere. So far the issues involved may not have been common enough to incite a strong-enough need to do so, but the default wait behavior being alertable seems to have more potential to be problematic.

@jkotas
Copy link
Member

jkotas commented Jan 20, 2024

non-alertable waits in general in public APIs.

This PR bundles multiple features behind the nonAlertable

(1) Re-entrant COM waits
(2) EventSource instrumentation that leads to unexpected reentrancy
(3) Overriding via synchronization context
(4) Alertable options of Windows WaitForSingleObject/WaitForMulltipleObject APIs

(4) is not really problematic. The remaining 3 are the problematic behaviors.

Would be better to use a different name for this flag, and avoid associating with Windows alertable option? We have undocumented global switch https://github.com/search?q=repo%3Adotnet%2Fruntime%20APPDOMAIN_FORCE_TRIVIAL_WAIT_OPERATIONS that sounds like a reasonable name.

@kouvel
Copy link
Member Author

kouvel commented Jan 24, 2024

Looks like the remaining CI failures are known issues.

@kouvel kouvel merged commit e568f75 into dotnet:main Jan 24, 2024
176 of 179 checks passed
@kouvel kouvel deleted the LockFix2 branch January 24, 2024 23:09
agocke added a commit to agocke/runtime that referenced this pull request Feb 23, 2024
jkotas pushed a commit that referenced this pull request Feb 23, 2024
kouvel added a commit to kouvel/runtime that referenced this pull request Feb 23, 2024
kouvel added a commit to kouvel/runtime that referenced this pull request Feb 23, 2024
PR dotnet#97227 introduced a tick count masking issue where the stored waiter start time excludes the upper bit from the ushort tick count, but comparisons with it were not doing the appropriate masking. This was leading to a lock convoy on some heavily contended locks once in a while due to waiters incorrectly appearing to have waited for a long time.

Fixes dotnet#98021
@github-actions github-actions bot locked and limited conversation to collaborators Feb 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update some of NativeAOT's uses of Lock to use non-alertable waits
3 participants