Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Change BulkMoveWithWriteBarrier to be GC suspension friendly #27642

Merged
merged 3 commits into from
Nov 5, 2019

Conversation

jkotas
Copy link
Member

@jkotas jkotas commented Nov 3, 2019

No description provided.

@jkotas
Copy link
Member Author

jkotas commented Nov 3, 2019

Micro-benchmark results:

// Mean pause time: 
// Baseline: 32.9ms
// This change: 21ms
static void Main()
{
    new Thread(() =>
    {
        var src = new object[10_000_000]; 
        var dst = new object[10_000_000]; 
        for (;;) Array.Copy(src, dst, src.Length);
    }).Start();
    for (;;) { GC.Collect(); Thread.Sleep(1); }
}
// Mean pause time: 
// Baseline: 33.1ms
// This change: 0.8ms
static void Main()
{
    new Thread(() =>
    {
        var a = new byte[100_000_000];
        for (;;) a.Clone();
    }).Start();
    for (;;) { GC.Collect(); Thread.Sleep(1); }
}

@GrabYourPitchforks
Copy link
Member

The managed part seems sound to me, but I'm not familiar enough with QCalls and GC cooperative modes to comment on the native components.

@jkotas jkotas added the area-VM label Nov 3, 2019
@jkotas jkotas requested a review from VSadov November 3, 2019 07:51
@VSadov
Copy link
Member

VSadov commented Nov 3, 2019

I would expect latencies in object Array.Copy to improve more if that was solely due to copying. Any idea what is the long pole now? Suspension itself?

The byte Clone benchmark would not have anything to suspend most of the time and thus shorter pauses?

@VSadov
Copy link
Member

VSadov commented Nov 3, 2019

Could be interesting to measure impact on GC.Collect(0) as well.

@@ -714,13 +714,75 @@ void QCALLTYPE Buffer::Clear(void *dst, size_t length)
memset(dst, 0, length);
}

#define BULK_MOVE_WITH_WRITE_BARRIER_BLOCK_SIZE 0x8000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be simpler if “chunking” happens on the managed side. Then it would be regular fcalls without polling and frames. (move itself does not trigger or throw)

Bulk w barrier move would just need to assert it is not asked to move too much at once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be simpler, but less efficient. It can either add a new frame that does the chunking (throughput overhead - for small block sizes in particular); or aggressive inline the chunking into every callsite to avoid the frame (code side overhead). Which tradeoff would you pick if the chunking happens on the managed side?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of something like:

ManagedHelperProbablyInlineable()
{
     // get refs, byte count and whether need to copy backwards
     nint count = . . .;
     if (count < CHUNK_SIZE)
     {
           NativeHelper(ref src, ref dst, count, backwards);
     }
     else
     {
           ManagedChunkingHelperDoNotInline(mt, ref src, ref dst, count, backwards);
     }
}

ManagedChunkingHelperDoNotInline(MethodTable *mt, ref byte src, ref byte dst, nint count, bool backwards)
{
       int elementSize = . . . ;
       while(count)
       {
              int chunk = Align(min(count, chunkSize), elementSize);
              NativeHelper(src, dst, chunk , backwards);
              count -= chunk;
              src = . . .;
              dst = . . .;
       }
}

@jkotas
Copy link
Member Author

jkotas commented Nov 3, 2019

I would expect latencies in object Array.Copy to improve more if that was solely due to copying. Any idea what is the long pole now? Suspension itself?

The Array.Copy example is copying object references in Gen2. The GC (whether it is Gen0 GC or Gen2 GC) has to visit these modified references. 21ms pause is cost to visit these modified references.

In the baseline, the pause time time is 12ms waiting for threads to suspend and 21ms GC doing work. With this change, it is just 21ms GC doing work.

I have picked this example specifically to show that this change helps, but copying large arrays of object references has still less than ideal performance characteristics.

@VSadov
Copy link
Member

VSadov commented Nov 4, 2019

The sample copies null references. And even if they were objects, they'd be old. I did not realize we update cards unconditionally. Yes, that introduces a lot of roots to scan and explains the remaining pause.

Checking for copying null pointers is not too hard, but may not be a typical enough scenario. Checking whether writes are cross-generational (even in a conservative way) is not an easy thing to fix right now. Especially in server GC.
Something to think about....

@jkotas
Copy link
Member Author

jkotas commented Nov 5, 2019

I think it could be simpler if “chunking” happens on the managed side

Pushed update with this change. @VSadov How does the new version look?

@VSadov
Copy link
Member

VSadov commented Nov 5, 2019

Looks nice!!
Array elements are always aligned when contain references? (I cant think of a case otherwise, but just in case..)

@jkotas
Copy link
Member Author

jkotas commented Nov 5, 2019

Array elements are always aligned when contain references?

Yes.

@@ -37,8 +39,53 @@ internal static unsafe void _ZeroMemory(ref byte b, nuint byteLength)
[DllImport(RuntimeHelpers.QCall, CharSet = CharSet.Unicode)]
private static extern unsafe void __ZeroMemory(void* b, nuint byteLength);

// The maximum block size to for __BulkMoveWithWriteBarrier FCall. This is required to avoid GC starvation.
private const uint BulkMoveWithWriteBarrierChunk = 0x10000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

65K - not too big?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

65k should be less than 50 microseconds. I can make it smaller.

@jkotas
Copy link
Member Author

jkotas commented Nov 5, 2019

/azp run coreclr-ci

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jkotas jkotas merged commit 5e1ef69 into dotnet:master Nov 5, 2019
@jkotas jkotas deleted the gcmemcpy branch November 5, 2019 16:12
jkotas added a commit to jkotas/coreclr that referenced this pull request Nov 8, 2019
jkotas added a commit that referenced this pull request Nov 8, 2019
jkotas added a commit to jkotas/coreclr that referenced this pull request Nov 8, 2019
stephentoub pushed a commit that referenced this pull request Nov 9, 2019
* Revert "Revert "Change BulkMoveWithWriteBarrier to be GC suspension friendly (#27642)" (#27758)"

This reverts commit b06f8a7.

* Fix wrong argument order for Unsafe.ByteOffset

* Add comment
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants