Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port yield normalization from CoreCLR to Native AOT #103675

Merged
merged 23 commits into from
Jul 17, 2024

Conversation

eduardo-vp
Copy link
Member

@eduardo-vp eduardo-vp commented Jun 18, 2024

Porting the current way yield normalization is done to Native AOT.

The CoreCLR implementation was moved to src/coreclr/vm/yieldprocessornormalizedshared.cpp.

Both the CoreCLR file (src/coreclr/vm/yieldprocessornormalized.cpp) and the Native AOT file (src/coreclr/nativeaot/Runtime/yieldprocessornormalized.cpp) now share the same implementation.

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 18, 2024
@eduardo-vp eduardo-vp added area-System.Threading and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jun 18, 2024
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

src/coreclr/inc/yieldprocessornormalized.h Outdated Show resolved Hide resolved
src/coreclr/inc/yieldprocessornormalized.h Outdated Show resolved Hide resolved
@@ -46,9 +46,6 @@ uint32_t WINAPI FinalizerStart(void* pContext)

g_pFinalizerThread = PTR_Thread(pThread);

// We have some time until the first finalization request - use the time to calibrate normalized waits.
EnsureYieldProcessorNormalizedInitialized();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the measurement going to be triggered when this is deleted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still trying to figure this out, I'm not very familiar with Native AOT in general so I'd appreciate any suggestions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we would need to call YieldProcessorNormalization::PerformMeasurement() from here or add a EnsureYieldProcessorNormalizedInitialized() entry point to the new code that simply calls YieldProcessorNormalization::PerformMeasurement()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you happen to know if this function is called every ~4 seconds or faster than that? Currently we let YieldProcessorNormalization::PerformMeasurement() run every ~4 s so if that's the case, I believe we may add here the same call as in CoreCLR

if (YieldProcessorNormalization::IsMeasurementScheduled())
    {
        GCX_PREEMP();
        YieldProcessorNormalization::PerformMeasurement();
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FinalizerStart function is called once per process. It is equivalent of FinalizerThreadStart function in regular CoreCLR.

I think you want to follow the same structure as in regular CoreCLR: Trigger the measurement from ScheduleMeasurementIfNecessary by calling RhEnableFinalization (it is equivalent of FinalizerThread::EnableFinalization in regular CoreCLR) and then add the measurement to loop in ProcessFinalizers().

Copy link
Member

@VSadov VSadov Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you happen to know if this function is called every ~4 seconds or faster than that?

I am not sure. The whole deal with measuring duration of something that is proportional to CPU cycle is not very precise, since the CPU cycle can change drastically and many times per second and will be different for every core. Unless machine is configured into HighPerformance power plan, every measurement is a bit of a coin toss and will produce the same result with the same error margins.

The main purpose of calibration is to continue using historically hard-coded spin counts in numerous places where we spinwait while allowing that to work on systems with vastly different pause durations (i.e. on post-skylake intel CPUs pause takes ~140 cycles, pre-skylake is about ~10 cycles). For such purpose the callibration is precise enough.

I am not sure about the value of redoing the measurement over and over.
Perhaps to support scenarios where a VM is migrated between pre/post skylake machines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can add a periodic call PerformMeasurement in NativeAOT and see what happens.

My guess - nothing will change, just a bit more time spent in PerformMeasurement.

Copy link
Member

@VSadov VSadov Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is value in having the same behavior though.
If the re-measuring (or the whole calibration deal) could be somehow avoided or improved, it would make sense to do for both runtimes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC there's a good reason to keep re-doing measurements, so probably keeping this behaviour in Native AOT would be better, I believe @kouvel or @mangod9 may elaborate better

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The measurements done are very short and can be perturbed by CPU activity, the rolling min helps to stabilize it over time.

src/coreclr/nativeaot/Runtime/yieldprocessornormalized.cpp Outdated Show resolved Hide resolved
@jkotas jkotas requested review from VSadov and kouvel June 22, 2024 01:34
@eduardo-vp eduardo-vp force-pushed the port-yield-norm-to-aot branch from af5ceaa to 6519b0b Compare July 2, 2024 18:48
src/coreclr/vm/yieldprocessornormalizedshared.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/synch.h Outdated Show resolved Hide resolved
src/coreclr/nativeaot/Runtime/FinalizerHelpers.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/yieldprocessornormalizedshared.cpp Outdated Show resolved Hide resolved
src/coreclr/utilcode/yieldprocessornormalized.cpp Outdated Show resolved Hide resolved
src/coreclr/nativeaot/Runtime/MiscHelpers.cpp Outdated Show resolved Hide resolved
src/coreclr/nativeaot/Runtime/startup.cpp Show resolved Hide resolved
Eduardo Manuel Velarde Polar added 2 commits July 2, 2024 14:40
@eduardo-vp eduardo-vp marked this pull request as ready for review July 2, 2024 23:09
Eduardo Manuel Velarde Polar added 2 commits July 2, 2024 17:04
@eduardo-vp eduardo-vp requested review from jkotas, kouvel and VSadov July 3, 2024 21:30
Copy link
Member

@kouvel kouvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@jkotas
Copy link
Member

jkotas commented Jul 4, 2024

What kind of testing you have done on the change to validate that it works as expected? Do we expect improvements in any perf benchmarks?

@kouvel
Copy link
Member

kouvel commented Jul 4, 2024

I don't think there would be any changes to benchmarks. I would expect that the CPU time spent during startup in the measurements would be a lot less (the new scheme measures lazily, and in narrower windows), that's about it. It would be good to measure that.

@eduardo-vp
Copy link
Member Author

I tested the following snippet and I checked that the 8 initial measurements were done and subsequent measurements every ~4 seconds were done as well.

using System;
using System.Threading;

int minutesToSpin = 10;
int startTicks = Environment.TickCount;
while (Environment.TickCount - startTicks < minutesToSpin * 60 * 1000)
{
    Thread.SpinWait(1000);
    Thread.Sleep(2000);
}

image

@eduardo-vp eduardo-vp merged commit d35f302 into dotnet:main Jul 17, 2024
87 of 89 checks passed
MichalStrehovsky added a commit to MichalStrehovsky/runtime that referenced this pull request Jul 26, 2024
MichalStrehovsky added a commit to MichalStrehovsky/runtime that referenced this pull request Jul 26, 2024
eduardo-vp pushed a commit to eduardo-vp/runtime that referenced this pull request Aug 13, 2024
eduardo-vp pushed a commit to eduardo-vp/runtime that referenced this pull request Aug 14, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants