-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endless OperationCancelledException
errors in GrpcCall
after Cancellation in SocketConnectivitySubchannelTransport.TryConnectAsync
#2420
Comments
Thanks for the detailed report. I can reproduce the error by adding the code you suggested. However, that doesn't answer the question of how the subchannel got in that state. I haven't reproduced it through normal usage. I have a theory about what is causing this. PR: #2422 |
Hi, @JamesNK Thank you for such quick solution. We've tested it in our environment and the problem is no longer reproduced. Please could you tell us when are you gonna to build release version of grpc-dotnet with this fix? |
https://www.nuget.org/packages/Grpc.Net.Client/2.63.0-pre1 is on NuGet with this change. A non-preview version will come in about 2 weeks. |
Hi, @JamesNK We have another case leading the problem on Reproduction: We use GRPC to connect to third-party server written in Go programming language (if it is important) with TLS certificates to authenticate. Code used to authenticate: .ConfigureChannel((provider, options) =>
{
// very simple factory with very simple resolver
ResolverFactory factory = provider.GetRequiredService<MyResolverFactory>();
var grpcServices = new ServiceCollection();
grpcServices.AddSingleton(factory);
options.ServiceProvider = grpcServices.BuildServiceProvider();
options.Credentials = ChannelCredentials.SecureSsl;
options.ServiceConfig = new ServiceConfig
{
LoadBalancingConfigs = { new RoundRobinConfig() },
MethodConfigs = {
new MethodConfig
{
RetryPolicy = new RetryPolicy
{
MaxAttempts = ...,
RetryableStatusCodes =
{
StatusCode.Aborted,
StatusCode.DeadlineExceeded,
StatusCode.ResourceExhausted,
StatusCode.Unavailable
}
}
}
}
};
})
.ConfigurePrimaryHttpMessageHandler(serviceProvider =>
{
var socketsHandler = new SocketsHttpHandler
{
ConnectTimeout = ...,
PooledConnectionIdleTimeout = ...,
KeepAlivePingDelay = ...,
KeepAlivePingTimeout = ...,
EnableMultipleHttp2Connections = true
};
var clientX509Certificate = X509Certificate2.CreateFromPem(ClientCert, ClientKey);
var clientX509CertificatesCollection = new X509CertificateCollection(
new X509Certificate[]
{
clientX509Certificate
});
socketsHandler.SslOptions.ClientCertificates = clientX509CertificatesCollection;
var certificatesChain = ParseCertificates(certificates.CaCertChain);
socketsHandler.SslOptions.RemoteCertificateValidationCallback = CreateCertificateValidator(certificatesChain);
return socketsHandler;
});
...
private static RemoteCertificateValidationCallback CreateCertificateValidator(X509Certificate2[] ca)
{
return (_, cert, chain, _) =>
{
if (chain is null || cert is null)
{
return false;
}
chain.ChainPolicy.TrustMode = X509ChainTrustMode.CustomRootTrust;
chain.ChainPolicy.CustomTrustStore.AddRange(ca);
return chain.Build(new X509Certificate2(cert));
};
} I use this simple code to reproduce this issue: var client = host.Services.GetRequiredService<3rdPartyService.3rdPartyServiceClient>();
for (var i = 0; i < 200; i++)
{
try
{
logger.LogWarning("##### START LOOP");
var request = new 3rdPartyServiceRequest
{
SomeFieldName = ByteString.CopyFrom("lllllllll", Encoding.UTF8)
};
logger.LogWarning("##### BEFORE Unary call");
var response = await client.3rdPartyServicePerformRequestAsync(request);
logger.LogWarning("##### Response: {Response}", response.SomeResponceFieldName.ToString());
}
catch (Exception ex)
{
logger.LogError(ex, "##### SOME ERROR");
}
finally
{
logger.LogWarning("##### BEFORE Delay");
// await Task.Delay(1 * 1000);
logger.LogWarning("##### END LOOP");
}
} To reproduce the issue we need to specify well-formed TLS-certificate but without proper permissions (there were some cases when we used invalid certificates and got the same issue but we can't reproduce it now). For this example I took only one endpoint in balancing, but if I have more then I get the same result for every endpoint. Log is as foollows:
I waited for backoff to handle this but it has no effect here. Please could you look at this issue too. The problem could be somewhere near the one you fixed. |
So to be clear, the problem is you're sometimes getting the wrong exception? You expect to always get "The SSL connection could not be established" but sometimes you get "A task was canceled.". |
Btw, this is a different bug and should be in a different issue. Please create a new one. Original issue should be fixed in 2.63.0. |
Hi, @JamesNK
We detected the following endless errors from several pods when they are trying to perform Grpc calls to several endpoints:
Exception itself is always the same:
Luckily we could capture start of this in another service in Debug:
After that all requests to URL related to
Subchanel id 19-2
(look at the picture above) started to throw the following errors:(don't be confused with log messages order because messages with the same timestamp can follow in mixed order)
We could reproduce this error with the following code:
Main routine:
In
SocketConnectivitySubchannelTransport.TryConnectAsync()
we synthetically cancel context as it was cancelled from the outside:The output is the following:
As you can see
SocketConnectivitySubchannelTransport.TryConnectAsync()
never called again. No backoff or subchannels recreation ever happen.If we have no exception in
TryConnectAsync()
then we expect to get the following output:We found no workaround yet, so It would be great if you take a look at this issue.
Best regards, Alexander
The text was updated successfully, but these errors were encountered: