Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix pinning in quic #52368

Merged
merged 5 commits into from
May 12, 2021
Merged

fix pinning in quic #52368

merged 5 commits into from
May 12, 2021

Conversation

wfurt
Copy link
Member

@wfurt wfurt commented May 6, 2021

This should fix the native crashes and corrupted buffers. It would be great if you can give it try @CarnaViire

To avoid constant pinning of SendQuicBuffers, I decided to allocate global memory since that is only passed to native msquic. I also added Pending state to indicate that we are actively writing and waiting for msquic callback.

I stills see some timeouts on my 2 core VM.
I will submit separate PR to make some adjustments and to avoid hardcoded values. (e.g. use PassingTestTimeout)

contributes to #52047

@ghost
Copy link

ghost commented May 6, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

This should fix the native crashes and corrupted buffers. It would be great if you can give it try @CarnaViire

To avoid constant pinning of SendQuicBuffers, I decided to allocate global memory since that is only passed to native msquic. I also added Pending state to indicate that we are actively writing and waiting for msquic callback.

I stills see some timeouts on my 2 core VM.
I will submit separate PR to make some adjustments and to avoid hardcoded values. (e.g. use PassingTestTimeout)

contributes to #52047

Author: wfurt
Assignees: -
Labels:

area-System.Net

Milestone: -

public SafeMsQuicBufferHandle(int count)
: base(IntPtr.Zero, ownsHandle: true)
{
IntPtr buffer = Marshal.AllocHGlobal(sizeof(QuicBuffer) * count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need to do this? What pattern is causing this to be required?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean AllocHGlobal or wrapping it in SafeMsQuicBufferHandle?

Copy link
Member

@davidfowl davidfowl May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both. We do a similar thing in the ASP.NET Core servers and in sockets and we don't use this pattern AFAIK. What makes MSQuic unique here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the SafeHandle should be unnecessary here. The code has to be careful about managing the lifetimes of the async buffers. SafeHandle does not make it easier.

It is an interesting question whether it is better to use unmanaged memory or POH-allocated memory here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another example from SslStream.

const int NumSecBuffers = 4; // header + data + trailer + empty
Interop.SspiCli.SecBuffer* unmanagedBuffer = stackalloc Interop.SspiCli.SecBuffer[NumSecBuffers];
Interop.SspiCli.SecBufferDesc sdcInOut = new Interop.SspiCli.SecBufferDesc(NumSecBuffers)
{
pBuffers = unmanagedBuffer
};
fixed (byte* outputPtr = output)

Because we need to only live through the one native call, we can allocate the buffer array on stack. For msquic we need to wait until HandleEventSendComplete() is invoked. So we need to keep it alive past the sending function.

Anyway, we could use the pinning if we want to but as I mentioned this looked simpler and less work on each call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example you have to be careful about situation when somebody disposes the object while the operation is still in flight.

Yep, I'm aware and we also go back and forth on how useful SafeHandles are in certain places, but I'm surprised SafeHandle + native buffer is the solution here. Are we pooling these things?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no pool at the moment. The main goal was to stabilize the test runs and avoid corruption and crashes and to unblock @ManickaP and @CarnaViire. I did 700+ runs and I did not see any failure. I think it would be better to refactor and/or optimize once we have solid CI in place.

I think we will need more work toward #5262 and #32142. The SafeHandle primarily helps with p/invokes. For writes, we will need to preserve the memory longer unless we change the handout model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SafeHandle does not help you much here. You are directly managing the lifetime of the MemoryHandles for the individual buffer, so you can directly managed the lifetime for the array of buffer pointers as well. SafeHandle here is a lot of overhead for negligible benefit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand @jkotas. When we agree on #5262 and #32142 I will either (1) use native memory directly or (2) fall back to pinning on each call. I will also add test for concurrent dispose(s) and IO. I'm pretty sure we still have gaps there. (and possibly investigate and fix #52048 as well)

@antonfirsov
Copy link
Member

antonfirsov commented May 6, 2021

Why not to use Pinned Object Heap instead of native allocation & safe handles? (GC.AllocateUninitializedArray<QuicBuffer>(1, pinned:true))

@CarnaViire
Copy link
Member

It would be great if you can give it try @CarnaViire

I've tried the current fix and it worked for me -- I haven't seen neither AVE nor buffer corruptions on multiple runs. System.Net.Quic.Tests.MsQuicTests.WriteTests also seems to be fixed by that, so ActiveIssue can be removed from it too.

@jkotas
Copy link
Member

jkotas commented May 6, 2021

Why not to use Pinned Object Heap instead of native allocation & safe handles?

One still has to be very careful about lifetime management and ensuring that the pinned buffer is not collected while the unmanaged callback still works on it. Using POH won't make the code simpler.

Also, POH has a problematic performance characteristics today. It always allocates in Gen2 and thus using POH will unnecessarily put pressure on the expensive Gen2 GC collections.

Copy link
Contributor

@scalablecory scalablecory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets get this merged to unblock other work. We have yet to do a full optimization pass, and I expect we will revisit this (likely removing the new safe handle type) at that point.

@jkotas
Copy link
Member

jkotas commented May 12, 2021

Lets get this merged to unblock other work.

I do not see the point merging this with the useless safehandle handle. It is like 10 minute edit to delete it.

@wfurt
Copy link
Member Author

wfurt commented May 12, 2021

Lets get this merged to unblock other work.

I do not see the point merging this with the useless safehandle handle. It is like 10 minute edit to delete it.

and keep the native memory or go back to pinning?

@jkotas
Copy link
Member

jkotas commented May 12, 2021

Keep the native memory.

wfurt and others added 2 commits May 11, 2021 22:45
…ons/MsQuic/MsQuicStream.cs

Co-authored-by: Cory Nelson <phrosty@gmail.com>
@wfurt
Copy link
Member Author

wfurt commented May 12, 2021

I removed the SafeMsQuicBufferHandle and did ~ 30 test runs on whole QUIC test suite. I will let the tests run in loop overnight.


// Handle to pinned SendQuicBuffers.
public GCHandle SendHandle;
public IntPtr SendQuicBuffers = Marshal.AllocHGlobal(sizeof(QuicBuffer));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to allocate this only once it is needed, with the right size.

As written, SendQuicBuffers maybe be allocated and never used; or SendQuicBuffers may leak if the initialization fails mid-flight.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I can move it to constructor at some point or add logic to SendReadOnlyMemoryAsync(). The old code always allocated array of size 1 to optimize for using ReadOnlyMemory e.g. use from HttpClient.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just questions to understand the change better. Otherwise, LGTM.

@@ -781,15 +777,15 @@ private static uint HandleEventSendComplete(State state, ref StreamEvent evt)

private static void CleanupSendState(State state)
{
if (state.SendHandle.IsAllocated)
lock (state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the lock here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We lock the state in other place when we transition. So I think we should be either consists or avoid the lock completely via some other mechanism (like preventing duplicate operations)

@@ -798,6 +794,12 @@ private static void CleanupSendState(State state)
ReadOnlyMemory<byte> buffer,
QUIC_SEND_FLAGS flags)
{
lock (_state)
{
Debug.Assert(_state.SendState != SendState.Pending);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we want to prevent overlapping Sends. Shouldn't this then rather be a condition with throw QuicException...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not meant as guard agains overlaying Sends. (I think that should be done much earlier)
I added this to make sure our internal state logic (and msquic) always moves to some other state.

_state.BufferArrays[0] = handle;
if (_state.SendQuicBuffers == IntPtr.Zero)
{
_state.SendQuicBuffers = Marshal.AllocHGlobal(sizeof(QuicBuffer));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the major difference here is that we allocate SendQuicBuffers aka QuicBuffer* directly in unmanaged memory. While originally we were using pinned managed memory GCHandle.Alloc(_state.SendQuicBuffers, GCHandleType.Pinned);.

What was wrong with the original approach? Was it Marshal.UnsafeAddrOfPinnedArrayElement(_state.SendQuicBuffers, 0);? Should that be _state.SendHandle.AddrOfPinnedObject();?

This comment is just me trying to properly understand what's going on here. It has absolutely no influence on mergeability of this PR 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could keep pinning. But then we do more operations on each send and we need to maintain the handle. So I felt this would be simpler as the SendQuicBuffers are really only consumed by native code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general point here is that since we only use this array as unmanaged memory, there's not really any benefit to allocating it as managed memory, and some nontrivial cost (overhead of pin/unpin, additional GC overhead, etc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how does this solve the problem with byte mixing and AVE? Do we know the exact root cause of that?

Copy link
Member

@jkotas jkotas May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have doubts that this change is actually fixing the root cause of the crash. It is probably just moving it around.

This type has number of subtle issues like #52048 that will lead to use-after-free and similar crashes. Somebody will need to do a focused pass over it to fix them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how does this solve the problem with byte mixing and AVE? Do we know the exact root cause of that?

Yeah, you are right. Seems like this probably doesn't fix the root cause. It might make it less likely to happen, because the memory will never move; but I don't think we actually understand the root cause yet.

(I had assumed we were previously allocating the array on the stack, per discussion above -- but we weren't, so that wasn't the root problem.)

It would probably help to clear out the BufferArray entries when we dispose them, here: https://github.com/dotnet/runtime/pull/52368/files#diff-55ed6a1c110b1a90d4900bce3075ab49f1d6212c223e69b48f711f4084d264acR787

If we have a use-after-free issue, that would help find it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems suspicious that we are calling CleanupSendState in Dispose -- how do we know that msquic isn't still holding our buffer array?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas

This type has number of subtle issues like #52048 that will lead to use-after-free and similar crashes. Somebody will need to do a focused pass over it to fix them.

Yeah, I agree we need a pass here. Do you have specific concerns? Dispose in particular looks questionable to me, anything else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anything else?

Handling of different failure modes. For example:

https://github.com/dotnet/runtime/pull/52368/files#diff-55ed6a1c110b1a90d4900bce3075ab49f1d6212c223e69b48f711f4084d264acR943-R945

If AllocHGlobal fails, we will think that we have space for old SendBufferMaxCount, but SendQuicBuffers is actually going to be null. This will lead to NullReferenceException on retry.
If new MemoryHandle[array.Length] fails, we will think that we have space for new SendBufferMaxCount, but BufferArrays is actually going to have only have space for old SendBufferMaxCount. This will lead to IndexOutOfBoundsException on retry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to write new tests and cleanup the shutdown more. I would generally expect that if allocation fails, the stream (or whole system) will not be in usable state. I can certainly split the allocations and assignment to make sure it is consistent.

@wfurt wfurt merged commit 8eb7692 into dotnet:main May 12, 2021
@wfurt wfurt deleted the quic_52047 branch May 12, 2021 20:06
@karelz karelz added this to the 6.0.0 milestone May 20, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jun 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants