Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] implement Interlocked.ReadMemoryBarrier intrinsic #86842

Merged
merged 10 commits into from
May 31, 2023
5 changes: 5 additions & 0 deletions src/coreclr/nativeaot/Runtime/portable.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -469,6 +469,11 @@ COOP_PINVOKE_HELPER(void, RhpMemoryBarrier, ())
PalMemoryBarrier();
}

COOP_PINVOKE_HELPER(void, RhpReadMemoryBarrier, ())
{
PalReadMemoryBarrier();
}

#if defined(USE_PORTABLE_HELPERS)
EXTERN_C NATIVEAOT_API void* __cdecl RhAllocateThunksMapping()
{
Expand Down
5 changes: 5 additions & 0 deletions src/coreclr/nativeaot/Runtime/unix/PalRedhawkInline.h
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,11 @@ FORCEINLINE void PalMemoryBarrier()
__sync_synchronize();
}

FORCEINLINE void PalReadMemoryBarrier()
{
__atomic_thread_fence(__ATOMIC_ACQUIRE);
}

#define PalDebugBreak() abort()

FORCEINLINE int32_t PalGetLastError()
Expand Down
17 changes: 13 additions & 4 deletions src/coreclr/nativeaot/Runtime/windows/PalRedhawkInline.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

#if defined(HOST_ARM64)
#include <arm64intr.h>
#endif

// Implementation of Redhawk PAL inline functions

EXTERN_C long __cdecl _InterlockedIncrement(long volatile *);
Expand Down Expand Up @@ -121,6 +125,7 @@ EXTERN_C void __faststorefence();
#pragma intrinsic(__faststorefence)
#define PalMemoryBarrier() __faststorefence()

#define PalReadMemoryBarrier() __nop()

#elif defined(HOST_ARM)

Expand All @@ -130,11 +135,13 @@ EXTERN_C void __dmb(unsigned int _Type);
#pragma intrinsic(__dmb)
FORCEINLINE void PalYieldProcessor()
{
__dmb(0xA /* _ARM_BARRIER_ISHST */);
__dmb(_ARM_BARRIER_ISHST);
__yield();
}

#define PalMemoryBarrier() __dmb(0xF /* _ARM_BARRIER_SY */)
#define PalMemoryBarrier() __dmb(_ARM_BARRIER_ISH)

#define PalReadMemoryBarrier() __dmb(_ARM_BARRIER_ISH)

#elif defined(HOST_ARM64)

Expand All @@ -144,11 +151,13 @@ EXTERN_C void __dmb(unsigned int _Type);
#pragma intrinsic(__dmb)
FORCEINLINE void PalYieldProcessor()
{
__dmb(0xA /* _ARM64_BARRIER_ISHST */);
__dmb(_ARM64_BARRIER_ISHST);
__yield();
}

#define PalMemoryBarrier() __dmb(0xF /* _ARM64_BARRIER_SY */)
#define PalMemoryBarrier() __dmb(_ARM64_BARRIER_ISH)

#define PalReadMemoryBarrier() __dmb(_ARM64_BARRIER_ISHLD)

#else
#error Unsupported architecture
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -795,6 +795,10 @@ internal static unsafe partial void RhEventPipeInternal_WriteEventData(
[RuntimeImport(RuntimeLibrary, "RhpMemoryBarrier")]
internal static extern void MemoryBarrier();

[MethodImplAttribute(MethodImplOptions.InternalCall)]
[RuntimeImport(RuntimeLibrary, "RhpReadMemoryBarrier")]
internal static extern void ReadMemoryBarrier();

[Intrinsic]
[MethodImplAttribute(MethodImplOptions.InternalCall)]
[RuntimeImport(RuntimeLibrary, "acos")]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,14 @@ public static void MemoryBarrier()
}
#endregion

#region ReadMemoryBarrier
[Intrinsic]
internal static void ReadMemoryBarrier()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intrinsic methods can be implemented by recursive expansion like this:

[Intrinsic]
internal static void ReadMemoryBarrier();

It saves you from all plumbing for FCalls, implementing it in the unmanaged PAL. Also, it guarantees that the intrinsic expansion matches the non-intrinsic implementation.

It would be better to use this here. I think you can redo both MemoryBarrier and ReadMemoryBarrier, for both naot and coreclr like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, this is the logic in the JIT that takes care of it:

// The recursive non-virtual calls to Jit intrinsics are must-expand by convention.
mustExpand = gtIsRecursiveCall(method) && !(methodFlags & CORINFO_FLG_VIRTUAL);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried putting failing implementations and it looks like Debug can reach them. Are we doing intrinsic substitution differently in Debug?

Copy link
Member Author

@VSadov VSadov May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see - if recursive -> must always do intrinsic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you should be able to move these to src\libraries\System.Private.CoreLib\src\System\Threading\Interlocked.cs. It would be nice to add a test that creates a delegate pointing to MemoryBarrier, and calls the delegate - to make sure that the recursive intrinsic expansion is kicking in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can redo both MemoryBarrier and ReadMemoryBarrier, for both naot and coreclr like this.

And nonobject CompareExchange, I suppose.
I remember I did not like something about the asm implementation. It is hard to make it right - make sure it is a full fence, uses 8.1 atomics where available. It’d be better to just leave it to the compiler to emit the best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And nonobject CompareExchange, I suppose.

We do not have CompareExchange intrinsic expansion implemented for all platforms in RyuJIT. It would have to be under ifdefs that must be in sync with what's implemented in the JIT. I would leave it for separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched MemoryBarrier and ReadMemoryBarrier to be self-referring intrinsics.
PalMemoryBarrier in NativeAOT is still needed as we call it from native code, but otherwise a lot of code was removed.

{
RuntimeImports.ReadMemoryBarrier();
}
#endregion

#region Read
public static long Read(ref long location)
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -152,32 +152,16 @@ internal static CastResult TryGet(nuint source, nuint target)
// we must read in this order: version -> [entry parts] -> version
// if version is odd or changes, the entry is inconsistent and thus ignored
uint version = Volatile.Read(ref pEntry._version);

#if CORECLR
// in CoreCLR we do ordinary reads of the entry parts and
// Interlocked.ReadMemoryBarrier() before reading the version
nuint entrySource = pEntry._source;
#else
// must read this before reading the version again
nuint entrySource = Volatile.Read(ref pEntry._source);
#endif

// mask the lower version bit to make it even.
// This way we can check if version is odd or changing in just one compare.
version &= unchecked((uint)~1);

if (entrySource == source)
{

#if CORECLR
// in CoreCLR we do ordinary reads of the entry parts and
// Interlocked.ReadMemoryBarrier() before reading the version
nuint entryTargetAndResult = pEntry._targetAndResult;
#else
// must read this before reading the version again
nuint entryTargetAndResult = Volatile.Read(ref pEntry._targetAndResult);
#endif

// target never has its lower bit set.
// a matching entryTargetAndResult would the have same bits, except for the lowest one, which is the result.
entryTargetAndResult ^= target;
Expand All @@ -189,10 +173,7 @@ internal static CastResult TryGet(nuint source, nuint target)
// - use acquires for both _source and _targetAndResults or
// - issue a load barrier before reading _version
// benchmarks on available hardware (Jan 2020) show that use of a read barrier is cheaper.

#if CORECLR
Interlocked.ReadMemoryBarrier();
#endif

if (version != pEntry._version)
{
Expand Down