-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failures in SSL Stream tests: There are no more endpoints available in the endpoint mapper. #74838
Comments
Tagging subscribers to this area: @dotnet/ncl, @vcsjones Issue DetailsThis happened in the PR: #74808 Failure message:
Which translates to:
Tests with a similar failure: System.Net.Security.Tests.SslStreamNetworkStreamTest.SslStream_SecondNegotiateClientCertificateAsync_Throws Callstack example:
|
looks like some environmental problem. cc @bartonjs in case he has some insight to the Cng.... |
We've seen it before on the es-ES test machine(s)... something about the RPC system is broken and that breaks CNG which breaks .NET Crypto and also SChannel. It happened for a while, then stopped... and I guess is happening again. |
@dotnet/ncl This issue is a bit old. Should we close it in favor of a newer one? I think people will keep assigning their failures to this, but the root cause is probably going to be different. |
It seems like We may add some checks into test contractor and blow whole suite if crypto is not functional but that would probably not help with build stability. |
Happy to investigate, but could you share the queue and ideally the specific log you're asking about so I don't have to trawl through all of them? |
This seems to exclusively happen on |
Seeing this happening again. Specifically, I found it in at least two unrelated
Also failing in another unrelated PR, affecting the System.Net tests using SSL: #81457 |
The OS is broken, not .NET. There's nothing we can do about it... this is an infrastructure problem (something, somehow, has messed up the es-es machine(s)). |
@bartonjs how can we help investigate? I remember this happening before, but the weird part this time seems to be that even in the same PR from above there are instances of the same test passing normally (e.g. https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-81492-merge-514b2709db814335b9/System.Security.Cryptography.Cng.Tests/1/console.59ad373c.log?helixlogtype=result). Every helix machine in queues named windows.10.amd64.server2022.es.* runs the same image, and it seems to succeed more often than it fails. I can help you get a repro machine with the exact image used here, but are there other experiments or changes I could make to prevent this? Do also note that we've regenerated this image last Thursday for unrelated reasons. |
There are failures like the following too, seen on #81634 :
Do these fall under this issue too, or should I open a separate one for them? |
same issue @radical. |
These are still causing lots of failures. I thought there was a plan in place to fix the images... did that happen and this is still occurring? |
I ping the Windows schannel team but nothing useful came back. If we agree that we would like to have at least one non-english test run I feel our choices are IMHO limited to:
we can also run more reports to see if there is pattern of particular machines or if the failures are uniform across the pool. |
Tagging subscribers to this area: @dotnet/area-system-security, @vcsjones Issue DetailsOccurrences from Runfo based on last 30 days and on Test Monitor history:
This happened in the PR: #74808 Failure message:
Which translates to:
System.Net.Security.Tests.SslStreamNetworkStreamTest.SslStream_SecondNegotiateClientCertificateAsync_Throws
System.Net.Http.Functional.Tests.PlatformHandler_HttpClientHandler_ServerCertificates_Http2_Test.UseCallback_SelfSignedCertificate_ExpectedPolicyErrors Callstack example:
Other exceptions:
let failedTests = (testNameSubstring : string, methodName : string, messageSubstr: string, includePR : bool, includePassedOnRerun : bool) {
cluster('engsrvprod.kusto.windows.net').database('engineeringdata').AzureDevOpsTests
| where TestName contains testNameSubstring
| where includePassedOnRerun or (Outcome == 'Failed')
| extend startOfTestName = indexof_regex(TestName, @"[^.]+$")
| extend Method = substring(TestName, startOfTestName)
| extend Type = substring(TestName, 0, startOfTestName - 1)
| project-away startOfTestName
| where (methodName == '') or (Method == methodName)
//| where Message contains messageSubstr
| where (Message contains 'System.Security.Cryptography.CryptographicException : No hay más extremos disponibles desde el asignador de extremos.' or Message contains 'System.Security.Cryptography.CryptographicException : Datos incorrectos.' or Message contains 'System.Security.Cryptography.CryptographicException : Algoritmo especificado no es válido.')
| distinct JobId, WorkItemId, Message, StackTrace, Method, Type, Arguments, Outcome
| join kind=inner (cluster('engsrvprod.kusto.windows.net').database('engineeringdata').Jobs
//| where Branch == 'refs/pull//merge'
//| where Branch == 'refs/pull//merge'
| where Branch != 'refs/pull/71473/merge'
| where Branch != 'refs/pull/73057/merge'
| where JobId != '20283731' // 7/29
| where JobId != '20342414' // 8/6 ... 488x
| where JobId != '20342402' // 8/6 ... 781x
| where Branch != 'refs/pull/71405/merge'
| where Branch != 'refs/pull/72869/merge'
| where Branch != 'refs/pull/72814/merge'
| where Branch != 'refs/pull/72886/merge'
| where Branch != 'refs/pull/72882/merge'
| where Branch != 'refs/pull/73055/merge'
| where Branch != 'refs/pull/73200/merge'
| where Branch != 'refs/pull/62863/merge'
| where Branch != 'refs/pull/73061/merge'
| where ((Branch == 'refs/heads/main') or (Branch == 'refs/heads/master') or (includePR and (Source startswith "pr/")))
| where Type startswith "test/functional/cli/"
and not(Properties contains "runtime-staging")
| summarize arg_max(Finished, Properties, Type, Branch, Source, Started, QueueName) by JobId
| project-rename JobType = Type) on JobId
| extend PropertiesJson = parse_json(Properties)
| extend OS = replace_regex(tostring(PropertiesJson.operatingSystem), @'\((.*)\).*|([^\(].*)', @'\1\2')
| where OS contains "ES"
| extend Runtime = iif(PropertiesJson.runtimeFlavor == "mono", "Mono", iif(PropertiesJson.DefinitionName contains "coreclr", "CoreCLR", ""))
| extend TargetBranch = extractjson("$.['System.PullRequest.TargetBranch']", Properties)
| extend Architecture = PropertiesJson.architecture
| extend Scenario = iif(isempty(PropertiesJson.scenario), "--", PropertiesJson.scenario)
//| extend DefinitionName = PropertiesJson.DefinitionName
| project-away PropertiesJson
};
failedTests(
'System.Security.Cryptography', //testNameSubstring
'', //methodName
'',//ignored
true, //includePR
true); //includePassedOnRerun Known Issue Error MessageFill the error message using known issues guidance. {
"ErrorPattern": ".*No hay.*extremos disponibles desde el asignador de extremos.*",
"BuildRetry": false,
"ExcludeConsoleLog": false
} ReportSummary
|
As of last week's (2/15) rollout, both the image we supply and the base image it is generated from have been recreated entirely from scratch, so if the problem persists after then regenerating the images didn't help. (Given we don't actually understand the problem this seems like a predictable outcome) I have another theory here, which is that Azure Security Monitor does fun stuff to prep and scan the machine and it may be impacting the behavior of the machine. They're certainly aware of the problem, but it has taken longer than expected to teach Azure Security Monitor to work correctly on non-EN-US oses. This could be doing stuff in the background on the machine that causes your failures. Just an idea, since the problem is so mysterious. |
Thanks. Until we can get to the bottom of it, then, we should switch this queue to using an en-US OS. |
Sure, |
@MattGal This error came up in an unrelated discussion for me, and "the RPC system is overwhelmed" was pointed out as a possible meaning to this error. Do the machines that are running these non-English locales have similar specs and resources as the English ones? Are there less machines in the pool, so they might be running more jobs simultaneously? |
They're running on the same size Azure VM as pretty much every Windows VM in Helix (Standard_D2a_v4). I say "pretty much" just because your Windows.10.Amd* machines are running Intel 4-core setups, but literally everything else is this. This is an irrelevant side note needed for compute-intensive and AVX-512-requiring workloads and only applies to a few queues. While there is a tiny bit of generational and maintenance variance between server racks in Azure, these machines are close enough in spec for everything save performance testing. Same disks, same memory, same AMD EPYC processors.
Helix machines only run one job's work item at a time. I implemented, and there remains (totally vestigial) code in there today, the ability to have N helix clients running on a given machine but the work items people sent couldn't stop accessing the same part of the file system or eating up all the processor capability, so it's been one work item per agent for something like 6 years now. The reason I suspected AzSecMon is that they run all sorts of executables (like the auditpol.exe example) while assuming they're EN-US versions of them and with lots of retries. It might be interesting to catch this exception and list all the processes running on the machine as part of test output, to see if some common System32 executable is going nuts on the system. |
Occurrences from Runfo based on last 30 days and on Test Monitor history:
This happened in the
release/7.0
branch. Can you please confirm if this will require a backported fix?PR: #74808
Queue: Libraries Test Run release coreclr windows x86 Release
Job: https://dev.azure.com/dnceng/public/_build/results?buildId=1976097&view=logs&j=457f7e88-dfa2-5bd9-f871-fdf124c2477d&t=bfe52dfb-2099-5c7f-ee52-70a1d81c544e
Log: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-74808-merge-0a29e7160d114b13be/System.Net.Security.Tests/3/console.38a99609.log?helixlogtype=result
Failure message:
Which translates to:
System.Net.Security
tests with this failure:System.Net.Security.Tests.SslStreamNetworkStreamTest.SslStream_SecondNegotiateClientCertificateAsync_Throws
System.Net.Security.Tests.SslStreamNetworkStreamTest.SslStream_UntrustedCaWithCustomTrust_OK
System.Net.Security.Tests.SslStreamNetworkStreamTest.SslStream_NegotiateClientCertificateAsync_IncompleteIncomingTlsFrame_Throws
System.Net.Security.Tests.SslStreamNetworkStreamTest.SslStream_NegotiateClientCertificateAsync_ClientWriteData
System.Net.Security.Tests.SslStreamNetworkStreamTest.SslStream_NegotiateClientCertificateAsyncNoRenego_Succeeds
System.Net.Security.Tests.SslStreamNetworkStreamTest.SslStream_NestedAuth_Throws
System.Net.Security.Tests.SslStreamNetworkStreamTest.SslStream_NegotiateClientCertificateAsyncConcurrentIO_Throws
System.Net.Security.Tests.SslStreamNetworkStreamTest.SslStream_NegotiateClientCertificateAsync_PendingDecryptedData_Throws
System.Net.Security.Tests.SslStreamNetworkStreamTest.SslStream_NegotiateClientCertificateAsyncTls13_Succeeds
System.Net.Security.Tests.SslStreamNetworkStreamTest.SslStream_NegotiateClientCertificateAsync_Succeeds
System.Net.Security.Tests.SslStreamNetworkStreamTest.SslStream_TargetHostName_Succeeds
System.Net.Security.Tests.SslStreamNetworkStreamTest.SslStream_RandomSizeWrites_OK
System.Net.Security.Tests.CertificateValidationRemoteServer.ConnectWithRevocation_WithCallback
System.Net.Security.Tests.SslStreamMutualAuthenticationTest.SslStream_RequireClientCert_IsMutuallyAuthenticated_ReturnsTrue
System.Net.Security.Tests.SslStreamCredentialCacheTest.SslStream_SameCertUsedForClientAndServer_Ok
System.Net.Http
tests with this failure:System.Net.Http.Functional.Tests.PlatformHandler_HttpClientHandler_ServerCertificates_Http2_Test.UseCallback_SelfSignedCertificate_ExpectedPolicyErrors
System.Net.Http.WinHttpHandlerFunctional.Tests.ClientCertificateTest.UseClientCertOnHttp2_DowngradedToHttp1MutualAuth_Success
Callstack example:
Other exceptions:
Known Issue Error Message
Fill the error message using known issues guidance.
Report
Summary
The text was updated successfully, but these errors were encountered: