Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SocketsHttpHandler: do not use DualMode sockets in default connection logic #45614

Closed

Conversation

antonfirsov
Copy link
Member

In some environments IPV6 (thus dual-stack) sockets do not work despite Socket.SupportsIPv6 returning true.

I'm reverting the default branch of HttpConnectionPool.ConnectToTcpHostAsync to use the logic in 628d99b (before #39524) to fix #44686.

The only thing I changed in the old code is comments.

@scalablecory @geoffkizer PTAL.

@ghost
Copy link

ghost commented Dec 4, 2020

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

In some environments IPV6 (thus dual-stack) sockets do not work despite Socket.SupportsIPv6 returning true.

I'm reverting the default branch of HttpConnectionPool.ConnectToTcpHostAsync to use the logic in 628d99b (before #39524) to fix #44686.

The only thing I changed in the old code is comments.

@scalablecory @geoffkizer PTAL.

Author: antonfirsov
Assignees: -
Labels:

area-System.Net.Http

Milestone: -

@antonfirsov antonfirsov requested a review from a team December 4, 2020 21:58
socket.NoDelay = true;
return new NetworkStream(socket, ownsSocket: true);
}
catch (Exception error) when (!(error is OperationCanceledException))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
catch (Exception error) when (!(error is OperationCanceledException))
catch (Exception error) when (error is not OperationCanceledException)

Reads a bit nicer.

Copy link
Member Author

@antonfirsov antonfirsov Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to not do refactors in this PR and keep ConnectHelper.ConnectAsync as it was in it's original 628d99b state, there are way too many thinks we may want to fix.

@stephentoub
Copy link
Member

I don't understand something. If the problem is that the OS doesn't allow dual-stack sockets, why isn't Socket.DualMode throwing an exception when we try to set it?

return async ? ConnectAsync(host, port, cancellationToken) : new ValueTask<Stream>(Connect(host, port, cancellationToken));
}

private static async ValueTask<Stream> ConnectAsync(string host, int port, CancellationToken cancellationToken)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all concerning to me. The premise of the SocketsHttpHandler.ConnectCallback's design (and not exposing the default implementation of how a socket is created or the default connect logic) was that it's just a couple of lines of code to emulate the default behavior... now it's over a 100 lines?
cc: @geoffkizer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to introduce an easy mode static Socket.ConnectAsync API, so it'll go back to a 5 liner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then shouldn't we add that and then use it here rather than doing it this way? I don't see why we're putting back this connect helper logic. If the concern is being able to more easily backport, that same static API can just be added as an internal SocketEx static in the backport or something like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @stephentoub here

@scalablecory
Copy link
Contributor

I don't understand something. If the problem is that the OS doesn't allow dual-stack sockets, why isn't Socket.DualMode throwing an exception when we try to set it?

It seems like the OS as a whole does, but the NIC driver or some firewall setting is blocking it at a later point.

@stephentoub
Copy link
Member

So our message then is don't use DualMode because something somewhere may not support it even if the system says it does?

@wfurt
Copy link
Member

wfurt commented Dec 5, 2020

From the conversion, it seems like the Azure VPN ignores the dual-mode socket and the packet is than routed to physical interface instead of private VPN and then rejected by firewall (or lost as it missed tunnel)
It seems like 3.1 worked by pure luck IMHO -> there is nothing really wrong with 5.0. The fundamental flaw lives in the Azure implementation as far as I can tell.

@stephentoub
Copy link
Member

@wfurt, that's what it sounded like to me, too, which is why I'm skeptical of the direction of this change.

@davidfowl
Copy link
Member

And AWS

@antonfirsov
Copy link
Member Author

@stephentoub people are also hitting it in AWS:
https://twitter.com/bhop2112/status/1334943693179117570

So far I'd say there is a new user confirming the issue every week. Do we want to tell each of them to use the workaround? For some it's not possible, because they are using HttpClient through external libraries (2 users reported with Elastic Search).

@scalablecory
Copy link
Contributor

I agree it seems like dual-mode should work fine here, and these environments are doing something wrong.

@stephentoub
Copy link
Member

Do we want to tell each of them to use the workaround?

If the environments are broken, we should work with the powers that be to fix the environments. If alternatively we're saying dual-mode should never be used because it can't be trusted, we should be explicit about that, find and fix all such usage in all of our code, deprecate the relevant ctors/properties, etc etc.

@antonfirsov
Copy link
Member Author

@stephentoub I don't think we are saying that dual-mode should be never used. In this solution we are switching to a default connect implementation that provides the highest compatibility counting with flawed environments. Users normally do not need to make such compatibility efforts in their ConnectCallbacks.

@antonfirsov
Copy link
Member Author

antonfirsov commented Dec 5, 2020

I don't have a strong opinion here, I think I raised more or less the same concern in my #44686 (comment) as @stephentoub here in the PR, but @scalablecory @geoffkizer and @karelz voted for the trivial workaround. Now it seems opinions are changing.

There are basically 3 options:

  1. Do not "solve" this in .NET, keep using dual-stack. While waiting for the solution from the cloud providers, communicate the workaround, and leave the users who do not have control over HttpClient without help. Note that we don't have an ETA from Azure and no working relationship with AWS AFAIK. For sure this will block .NET 5 migration for many users living in the flawed Azure and AWS environments.
  2. Go with the cheapest possible solution (this PR), but probably communicate dual-stack ConnectCallback sample code in docs as suggested by @scalablecory .
  3. Something more sophisticated, that involves a new API in .NET 6, and an internal "SocketEx" utility in the backport. Obviously, this is gonna take longer, which is up to @karelz to prioritize then.

How do we plan to decide?

@stephentoub
Copy link
Member

stephentoub commented Dec 5, 2020

I don't think we are saying that dual-mode should be never used

So when someone uses the Socket(SocketType, ProtocolType) ctor, directly or via some other library that does so, we do or do not expect that to work in their favorite cloud? I'm questioning the larger picture here.

@antonfirsov
Copy link
Member Author

antonfirsov commented Dec 5, 2020

In my opinion the Socket(SocketType, ProtocolType) constructor is a very bad API, since it's not trivial from the API shape that (typically) it's going to create a dual-stack socket, which is still not (fully) supported in a certain environments. I've seen shooting ourselves in the foot several times because of that even in our own code & tests. I would have created something like Socket.CreateDualMode(SocketType, ProtocolType) for that instead.

But I think we are mixing concerns in this discussions now:

  • How mature is dual-stack / IPV6 adaption? -- Users who interface Socket directly, usually know their environment, and have their answers. I don't think we should obsolete fundamental API-s just because they don't work on certain flawed environments.
  • The problem comes when Socket and the fact whether IPv6 works or not becomes an implementation detail, like with HttpClient. We need to decide whether we want to provide a compatibility workaround or not.

@scalablecory
Copy link
Contributor

To be clear, while I think this is an environment problem, I still feel we should make this change for compat purposes.

I do not think we should recommend against dual-mode, but I don't mind us making something a little bit more compatible by default when customers using common cloud environments are affected.

@scalablecory
Copy link
Contributor

(That said, we also need to reach out to these platforms to understand why they do it this way, and ask them to fix it if there is no why and it's just a config problem)

@stephentoub
Copy link
Member

stephentoub commented Dec 5, 2020

I'm pushing on this for a few reasons.

This isn't just about HttpClient. Yes, it's a key library, but there are others, and if new Socket(SocketType, ProtocolType) doesn't work in the clouds devs deploy their apps to, and if we're pushing developers to this (which we are, if for no other reason than it's the simplest ctor for this purpose, shows up first in IntelliSense because it's shorter than the three argument version, etc.), then we're setting ourselves up for failure as a cloud native stack that doesn't work in the cloud. It's easy to find other libraries relying on this as well, for example:

Are all of these broken if deployed to Azure or AWS? That sounds really strange, but if it's true, that's really bad. And if it's not true, then our understanding of the cause seems to have some gaps that we should fill in before pushing for a solution.

With regards to HttpClient, we've pushed the ConnectCallback as the thing that addresses everyone's woes. Need to bind to a particular interface? Use ConnectCallback. Need to configure TTL? Use ConnectCallback. Need to set Receive/SendTimeout? Use ConnectCallback. Etc. It's going to be used, and not infrequently. Which is a good thing, except that the solution we've put forth now apparently has impliciations we didn't previously understand. For the next year, developers that follow-suit with what we've done in .NET 5 are either going to a) have to write more than a hundred lines of complicated code to get the "right" behavior, or b) suffer the same "this no longer works in the preeminent clouds" behavior, assuming that's the actual impact. That's not good. Even if we introduce a new Socket.ConnectAsync(...) static that puts this back to the simpler solution, that won't ship for another year. What is our recommendation until then?

I want to make sure we have our understanding and story straight before we rush in something "for compatibility", because while that may end up being the right answer, it has consequences. At the end of the day, the right answer may be to merge a fix into release/5.0 that reverts to the more complicated scheme, but let's really understand what's going on, having spoken with the folks at Azure and AWS, before we do so, as it has ramifications for lots of other things. And if it turns out that libraries and apps using new Socket(SocketType, ProtocolType) really are doomed for use in the cloud, then how can we not deprecate it? Who would we recommend use it at that point?

With regards to the actual fix, @scalablecory mentioned exposing a static Socket.ConnectAsync as the actual plan, but I don't see that API called out in #43935. Did I miss it? With regards to the actual fix, this PR is targeting .NET 6, but if I'm understanding correctly, this is not what we'd intend to ship in .NET 6. Rather, we'd intend to add a that "simple" static, and the change in this PR to System.Net.Http would be limited to tweaking this:

// Otherwise, create and connect a socket using default settings.
socket = new Socket(SocketType.Stream, ProtocolType.Tcp) { NoDelay = true };
if (async)
{
await socket.ConnectAsync(endPoint, cancellationToken).ConfigureAwait(false);
}
else
{
using (cancellationToken.UnsafeRegister(static s => ((Socket)s!).Dispose(), socket))
{
socket.Connect(endPoint);
}
}
return new NetworkStream(socket, ownsSocket: true);

to something like this:

                    // Otherwise, create and connect a socket using default settings.
                    if (async)
                    {
                        socket = await Socket.ConnectAsync(endPoint, cancellationToken).ConfigureAwait(false);
                    }
                    else
                    {
                        socket = new Socket(SocketType.Stream, ProtocolType.Tcp);
                        using (cancellationToken.UnsafeRegister(static s => ((Socket)s!).Dispose(), socket))
                        {
                            socket.Connect(endPoint);
                        }
                    }

                    socket.NoDelay = true;
                    return new NetworkStream(socket, ownsSocket: true);

yes? So, is the thinking the "simpler" solution is churning all this code in master, porting that back to release/5.0, and then immediately churning master again to the real .NET 6 solution?

@antonfirsov
Copy link
Member Author

antonfirsov commented Dec 5, 2020

@stephentoub the API is indeed missing, we need to propose it: #44686 (comment)

having spoken with the folks at Azure and AWS

Just added you to the email thread we have with Azure about this. The problem is understood, solution seems to be on the way, but there is no ETA. I have no idea who and how could connect AWS on this matter, so it gets priority.

@scalablecory
Copy link
Contributor

@normj can you help us with AWS side of this?

@antonfirsov
Copy link
Member Author

Context on AWS: #44686 (comment)

@davidfowl
Copy link
Member

We had similar problems with Bind in kestrel (it binds to both ipv4 and ipv6 by default) and we fallback to ip4v of binding fails.

Cc @Tratcher @halter73

@geoffkizer
Copy link
Contributor

In my opinion the Socket(SocketType, ProtocolType) constructor is a very bad API, since it's not trivial from the API shape that (typically) it's going to create a dual-stack socket, which is still not (fully) supported in a certain environments. I've seen shooting ourselves in the foot several times because of that even in our own code & tests. I would have created something like Socket.CreateDualMode(SocketType, ProtocolType) for that instead.

Totally agree. Doesn't help with the issue at hand, but something to consider for the future.

@geoffkizer
Copy link
Contributor

@stephentoub

If the environments are broken, we should work with the powers that be to fix the environments. If alternatively we're saying dual-mode should never be used because it can't be trusted, we should be explicit about that,

I think we are saying both. First, the environments are broken and we should push them to fix this. But second, since this is how they behave today, you should never use dual-mode in these environments (or in code that could run in these environments) because it can't be trusted today.

find and fix all such usage in all of our code,

Yes -- given what we know about these environments, I don't think we should ever use dual mode sockets in our own code.

deprecate the relevant ctors/properties, etc etc.

I think that's going a little far. If you know your target environment supports dual-mode sockets, then feel free to use dual-mode sockets.

That said, as @antonfirsov pointed out above, the structure of the API makes it such that it's often not obvious that you are using dual-mode sockets. So perhaps we should do something to address that.

@geoffkizer
Copy link
Contributor

@stephentoub

Are all of these broken if deployed to Azure or AWS? That sounds really strange, but if it's true, that's really bad.

I think the answer is: All of them are broken when deployed to Azure or AWS in certain deployment configurations.

What I don't have a good sense of is how common these configurations are. It would be good to get more data here.

But regardless, I think we have enough data to say that they are common enough to justify changing HttpClient to not use dual-mode sockets, and probably for other libraries as well, depending on how they are deployed.

{
// For synchronous connections, we can just create a socket and make the connection.
cancellationToken.ThrowIfCancellationRequested();
var socket = new Socket(SocketType.Stream, ProtocolType.Tcp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this creating a dual-mode socket?

Copy link
Member Author

@antonfirsov antonfirsov Dec 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ... I dont like this inconsistence either, just took the state at the commit I mentioned. (Note that there was no sync in 3.1)

The bad thing is that the ony alternatives I see is to either do sync over async or implement a sync version of DnsConnect within System.Net.Http

{
// If a ConnectCallback was supplied, use that to establish the connection.
if (Settings._connectCallback != null)
try
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old try catch wrapped both the ConnectCallback case as well as the default case. This one only wraps the ConnectCallback case. Why the change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an equivalent catch block in ConnectHelper.ConnectAsync:

catch (Exception error) when (!(error is OperationCanceledException))
{
throw CreateWrappedException(error, host, port, cancellationToken);
}

internal static Exception CreateWrappedException(Exception error, string host, int port, CancellationToken cancellationToken)
{
return CancellationHelper.ShouldWrapInOperationCanceledException(error, cancellationToken) ?
CancellationHelper.CreateOperationCanceledException(error, cancellationToken) :
new HttpRequestException($"{error.Message} ({host}:{port})", error, RequestRetryType.RetryOnNextProxy);
}

@geoffkizer
Copy link
Contributor

It's going to be used, and not infrequently. Which is a good thing, except that the solution we've put forth now apparently has impliciations we didn't previously understand. For the next year, developers that follow-suit with what we've done in .NET 5 are either going to a) have to write more than a hundred lines of complicated code to get the "right" behavior, or b) suffer the same "this no longer works in the preeminent clouds" behavior, assuming that's the actual impact.

I agree with all of this, but I'm not sure what to do about it.

I think we should be doing all of the following:

  • Push Azure/AWS/etc to support dual-mode sockets
  • Add a Task-based static Socket.ConnectAsync for 6.0, and use it in HttpClient
  • Backport that to 5.0 using private APIs so we aren't exposing new API

But unless we are willing to expose new API in 5.0, I think we are kinda stuck in a not ideal place here. Thoughts?

@antonfirsov
Copy link
Member Author

antonfirsov commented Dec 10, 2020

Since it's our IPv6 support detection logic that is unreliable in these environments, I think we should also consider an environment variable to force-disable IPv6 as an alternative to this PR.

Pros: Not specific to HttpClient, fixes the problem for all users of Socket(SocketType, ProtocolType). Simple and quick solution. Communicates the fact that the problem is with the environment, not with .NET.

Cons: not automatic, .NET 5 migration is still a breaking change in such environments, that has to be actioned by the customers.

@stephentoub
Copy link
Member

Since it's our IPv6 support detection logic that is unreliable in these environments

Can we improve that logic? For example, right now we only try to create the socket. What happens if we try to bind it? What happens if we try to connect it (over loopback)? Etc.

@antonfirsov
Copy link
Member Author

antonfirsov commented Dec 10, 2020

@stephentoub I would be happy to experiment with all of this ... if I had access to a repro environment, or if I knew how to create one.

There were some emails about in the past 15 hours, we are working under time pressure now.

@ManickaP
Copy link
Member

If the default connect logic will be whole of this:

private static async ValueTask<Stream> ConnectAsync(string host, int port, CancellationToken cancellationToken)
{
// We use the static Socket.ConnectAsync with a SocketAsyncEventArgs, because this approach is:
// 1. Cancellable
// 2. Does not create Dual-stack sockets, which are unavailable in certain environments,
// see https://github.com/dotnet/runtime/issues/44686.
var saea = new ConnectEventArgs();
try
{
saea.Initialize(cancellationToken);
// Configure which server to which to connect.
saea.RemoteEndPoint = new DnsEndPoint(host, port);
// Initiate the connection.
if (Socket.ConnectAsync(SocketType.Stream, ProtocolType.Tcp, saea))
{
// Connect completing asynchronously. Enable it to be canceled and wait for it.
using (cancellationToken.UnsafeRegister(static s => Socket.CancelConnectAsync((SocketAsyncEventArgs)s!), saea))
{
await saea.Builder.Task.ConfigureAwait(false);
}
}
else if (saea.SocketError != SocketError.Success)
{
// Connect completed synchronously but unsuccessfully.
throw new SocketException((int)saea.SocketError);
}
Debug.Assert(saea.SocketError == SocketError.Success, $"Expected Success, got {saea.SocketError}.");
Debug.Assert(saea.ConnectSocket != null, "Expected non-null socket");
// Configure the socket and return a stream for it.
Socket socket = saea.ConnectSocket;
socket.NoDelay = true;
return new NetworkStream(socket, ownsSocket: true);
}
catch (Exception error) when (!(error is OperationCanceledException))
{
throw CreateWrappedException(error, host, port, cancellationToken);
}
finally
{
saea.Dispose();
}
}

How are we going to communicate to people how to implement their own ConnectCallback if we're not providing this implementation on the outside and it's this long? I can imagine that some of the callback implementations will just want to enhance the logic, not completely replace it. With this, they'll have to copy paste 50 lines of code or reach for the method via reflection.

I understand this needs to be solved for 5.0 and rather quickly, but can't we at least think of something more user-friendly before we ship this? Also, isn't the same problem happening for the sync code path?

I really like the idea of solving this at the socket level.

@jonsagara
Copy link

jonsagara commented Dec 10, 2020

I would be happy to experiment with all of this ... if I had access to a repro environment, or if I knew how to create one.

@antonfirsov Would it help if I gave you commit access to my repro repository? I think I'd need to change the deployment from kudu to a self-contained application via Azure DevOps so that you can deploy. kudu only supports specific, pre-intalled versions of .NET. As long as your new test version of .NET 5 is available to be downloaded as a preview, I think I could make this work with a simple DevOps deployment.

Or, maybe I could give you publish access to the App Service so that you can push your own self-contained app directly?

@antonfirsov
Copy link
Member Author

antonfirsov commented Dec 10, 2020

@ManickaP @jonsagara not decided yet, but we may ship a quickfix for the January update of .NET 5. If we do so we need to act ... very quickly.

If we want to catch that train, we need to merge either this PR or #45893 as is. We can then continue investigations about solving this at socket level either by making IPv6 detection more robust or by shipping & backporting new static Socket.ConnectAsync and Socket.Connect overloads.

@antonfirsov
Copy link
Member Author

Would it help if I gave you commit access to my repro repository?

@jonsagara does pushing trigger deployment?

@antonfirsov

This comment has been minimized.

@azure-pipelines

This comment has been minimized.

@jonsagara
Copy link

@jonsagara does pushing trigger deployment?

It does, but I don't think that will work for this case. App Service kudu deployment currently only supports .NET SDK 5.0.100.

We can get around this by directly publishing your changes as a self-contained deployment. All I'd need to do is give you the Publish Profile from Azure Portal. Then, clone the repo, import the Publish Profile, and you should be able to publish and test your changes as a self-contained application.

I just tested this using a currently unsupported build (.NET SDK 5.0.101), and the app started and still reproduces the bug.

I can't share the Publish Profile publicly, so if you're interested, please let me know the best place to send it. Thanks!

@antonfirsov
Copy link
Member Author

@jonsagara DM me on twitter: https://twitter.com/antonfrv

@jonsagara
Copy link

@antonfirsov Will do. Will you please follow me back so that I can send you a message? https://twitter.com/jonsagara

@karelz
Copy link
Member

karelz commented Dec 15, 2020

Closing the PR as we do not plan to fix the problem this way anymore -- see #44686 (comment) for details.

@karelz karelz closed this Dec 15, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 14, 2021
@karelz karelz added this to the 6.0.0 milestone Jan 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure App Service HTTP requests to Azure VNet IP Addresses fail after upgrading to .NET 5.0