Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There are no more endpoints available from the endpoint mapper. #3844

Open
3 tasks
lewing opened this issue Aug 21, 2024 · 17 comments
Open
3 tasks

There are no more endpoints available from the endpoint mapper. #3844

lewing opened this issue Aug 21, 2024 · 17 comments

Comments

@lewing
Copy link
Member

lewing commented Aug 21, 2024

Build

https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=781694

Build leg reported

Microsoft.NET.Build.Containers.UnitTests.RegistryTests.InsecureRegistry

Pull Request

https://github.com/dotnet/sdk.git/pull/42858

Known issue core information

Fill out the known issue JSON section by following the step by step documentation on how to create a known issue

 {
    "ErrorMessage" : "There are no more endpoints available from the endpoint mapper.",
    "BuildRetry": false,
    "ErrorPattern": "",
    "ExcludeConsoleLog": false
 }

@dotnet/dnceng

Release Note Category

  • Feature changes/additions
  • Bug fixes
  • Internal Infrastructure Improvements

Release Note Description

Additional information about the issue reported

No response

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=781694
Error message validated: [There are no more endpoints available from the endpoint mapper.]
Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 8/21/2024 1:53:12 AM UTC

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0
@lewing
Copy link
Member Author

lewing commented Aug 21, 2024

dotnet/aspnetcore#57416 but for everybody

@lewing
Copy link
Member Author

lewing commented Aug 22, 2024

cc @jeffschwMSFT

@lewing
Copy link
Member Author

lewing commented Aug 22, 2024

We suspect dotnet/sdk#42870 is another manifestation of the same problem

@jeffhandley
Copy link
Member

This seems to be an occurrence of System.Security.Cryptography.X509Certificates test failures (dotnet/runtime#70320) (also dotnet/runtime#20708, dotnet/runtime#74838, and dotnet/runtime#83226) which we encounter occasionally as a known build error. This is an external issue with Windows when the "CNG Key Isolation" windows service has crashed or has stopped.

The only known way to resolve this is to reboot the affected machine as the underlying service may die and fail to restart when the host is severely resource starved.

@lewing
Copy link
Member Author

lewing commented Aug 23, 2024

This seems to be an occurrence of System.Security.Cryptography.X509Certificates test failures (dotnet/runtime#70320) (also dotnet/runtime#20708, dotnet/runtime#74838, and dotnet/runtime#83226) which we encounter occasionally as a known build error. This is an external issue with Windows when the "CNG Key Isolation" windows service has crashed or has stopped.

The only known way to resolve this is to reboot the affected machine as the underlying service may die and fail to restart when the host is severely resource starved.

@jeffhandley so this is failing > 30x a day in CI right now, who and how can we do that? existing systems aren't working. There are multiple open issues across multiple repos and an active FR thread.

@krwq
Copy link
Member

krwq commented Aug 23, 2024

I opened 4 random logs out of curiosity and it seems they run on 3 different machines so it seems a bit unlikely it's machine specific - the only thing they had in common was that they used "Windows-10-10.0.20348-SP0" - possibly there is something special about that build? Possibly someone from infra team could check if there were any recent updates before it started happening more frequently? (i.e. Aug 13/14 update)

It's possible it's true that once we hit this it will fail until reboot but there is definitely something which makes it not work.

What I see in common about the stack trace is:

2024-08-22T23:00:15.1869805Z    at System.Security.Cryptography.CngHelpers.OpenStorageProvider(CngProvider provider)
2024-08-22T23:00:15.1869835Z    at System.Security.Cryptography.CngKey.Import(ReadOnlySpan`1 keyBlob, String curveName, CngKeyBlobFormat format, CngProvider provider)
2024-08-22T23:00:15.1869869Z    at System.Security.Cryptography.RSACng.ImportKeyBlob(ReadOnlySpan`1 rsaBlob, Boolean includePrivate)
2024-08-22T23:00:15.1869901Z    at System.Security.Cryptography.RSACng.ImportParameters(RSAParameters parameters)
2024-08-22T23:00:15.1869931Z    at System.Security.Cryptography.X509Certificates.CertificatePal.CopyWithPrivateKey(RSA rsa)
2024-08-22T23:00:15.1870051Z    at System.Security.Cryptography.X509Certificates.RSACertificateExtensions.CopyWithPrivateKey(X509Certificate2 certificate, RSA privateKey)
2024-08-22T23:00:15.1870085Z    at System.Security.Cryptography.X509Certificates.CertificateRequest.CreateSelfSigned(DateTimeOffset notBefore, DateTimeOffset notAfter)

I'm also seeing some of those tests call CreateSelfSigned from static constructor so it's possible it's related (it might be worth trying to move that logic out inside ASP.NET and see if that changes anything - my theory is that something could got partially initialized).

so to me it sounds like one of the following is most likely:

  • Windows has a bug and it became more frequent recently after one of the updates
  • we (System.Security.Cryptography) have some obscure bug (possibly threading or static initialization related) - I haven't seen any recent changes to our code so if that's the case something in Windows or ASP.NET makes it more frequent
  • some other unrelated thing has a bug (i.e. JIT, codegen for interop) and it happens to manifest here

@ericstj
Copy link
Member

ericstj commented Aug 23, 2024

Is it possible that we aren't disposing something (that CreateSelfSigned in the static constructor) and that thing accumulates resources in this external service? If it's happening more often we might be able to get a repro more easily.

@bartonjs
Copy link
Member

When we've seen this in the past, it has meant that the "CNG Key Isolation Service" service has exited (crashed, probably). sc stop keyiso will cause it to happen, so maybe sc start keyiso will fix it... or, the easier answer is to reboot.

In the past I have tried quite a lot to cause one of these failures to happen using .NET, so I could see how we can stop encountering them. I've never been able to do so. The OS team responsible for it says they only ever see these errors on test machines, and that their recommended answer is to just pave the machine and watch the problem go away.

The best short-term fix I can think of is to just be aggressively rebooting the test machines... assuming they're not aggressively rebooted/reimaged/spun-up-from-nothing already. If I could remember the right way to suggest making that inquiry to the engineering first responders, I'd say so here...

@garath
Copy link
Member

garath commented Aug 23, 2024

Uptime for machines managed by Helix depends on the queue demand. The machines are kept alive until the load diminishes, then the scaleset scales down by removing the oldest machines first.

Out of curiosity, I pulled some data for the last 7 days for windows.amd64.vs2022.pre.open (chosen randomly) on machine uptime. I thought it would be interesting to drop here.

Execute in [Web] [Desktop]
image

@jeffhandley
Copy link
Member

@garath Am I reading the data correctly that the longest that particular machine went between reboots was 177 minutes, with a p95 uptime less than 60 minutes on any given day this past week? If so, that does invalidate the assertion that rebooting will address it.

@bartonjs Are there folks you've talked to before from Windows that we could engage on this reoccurrence?

@garath
Copy link
Member

garath commented Aug 23, 2024

Am I reading the data correctly that the longest that particular machine

Not any single machine, this is the 50th, 75th and 95th percentile for the "uptime" of all machines ever allocated to that queue on a given day.

@jeffhandley
Copy link
Member

OK; thanks, @garath. I also checked the max in case there were machines outside the 95th percentile that could have had long uptimes and be the culprits. But the max I saw was 177 minutes; does that sound right to you that the longest any machine went without a reboot was 177 minutes?

@bartonjs
Copy link
Member

Huh, I had it in my head that the average Helix machine had an uptime more like several days (and the lowest-indexed ones in the scale... cluster... thing... having an uptime of "since patches were last installed").

I've reached out to Windows to see if it's a known thing with known workarounds (or potentially a thing they'd want to look at).

@garath
Copy link
Member

garath commented Aug 26, 2024

had it in my head that the average Helix machine had an uptime more like several days

This could be true for the very, very busy queues. I also believe it was more true a few years ago before we created our own custom autoscaler service.

I'm happy to pull data if there are any questions. Just let me know what you'd like to see.

@garath
Copy link
Member

garath commented Aug 26, 2024

But the max I saw was 177 minutes; does that sound right to you that the longest any machine went without a reboot was 177 minutes?

Looks like that's exactly correct. Below are the top 10 longest uptimes for the last seven days in the Windows.Amd64.VS2022.Pre.Open queue.

Execute in [Web] [Desktop]

Machine Name Start Finish Uptime (min)
a000WZD 2024-08-21T15:23:00.743Z 2024-08-21T18:20:24.598Z 177
a000Y60 2024-08-21T20:21:50.615Z 2024-08-21T23:07:53.706Z 166
a000VWL 2024-08-21T06:19:29.636Z 2024-08-21T08:54:44.321Z 155
a0010PB 2024-08-22T17:59:14.076Z 2024-08-22T20:26:39.141Z 147
a000XG7 2024-08-21T16:12:38.972Z 2024-08-21T18:21:45.958Z 129
a000YR8 2024-08-21T22:28:16.549Z 2024-08-22T00:34:19.332Z 126
a0016GG 2024-08-26T12:45:37.372Z 2024-08-26T14:46:43.211Z 121
a0016R6 2024-08-26T14:16:10.458Z 2024-08-26T16:17:39.64Z 121
a0014XH 2024-08-25T11:29:31.126Z 2024-08-25T13:27:00.391Z 117
a000WWF 2024-08-21T12:58:14.531Z 2024-08-21T14:55:30.927Z 117

@sebastienros
Copy link
Member

sebastienros commented Sep 3, 2024

The exception on aspnetcore builds is slightly different, I doubt it will help but in case someone looking looking for it:

System.Security.Cryptography.CryptographicException: There are no more endpoints available from the endpoint mapper.
   at System.Security.Cryptography.X509Certificates.X509CertificateLoader.ImportPfx(ReadOnlySpan`1 data, ReadOnlySpan`1 password, X509KeyStorageFlags keyStorageFlags)
   at System.Security.Cryptography.X509Certificates.X509CertificateLoader.LoadPkcs12NoLimits(ReadOnlyMemory`1 data, ReadOnlySpan`1 password, X509KeyStorageFlags keyStorageFlags, Pkcs12Return& earlyReturn)
   at System.Security.Cryptography.X509Certificates.X509CertificateLoader.LoadPkcs12(ReadOnlyMemory`1 data, ReadOnlySpan`1 password, X509KeyStorageFlags keyStorageFlags, Pkcs12LoaderLimits loaderLimits)
   at System.Security.Cryptography.X509Certificates.X509CertificateLoader.LoadFromFile[T](String path, ReadOnlySpan`1 password, X509KeyStorageFlags keyStorageFlags, Pkcs12LoaderLimits loaderLimits, LoadFromFileFunc`1 loader)
   at System.Security.Cryptography.X509Certificates.X509CertificateLoader.LoadPkcs12PalFromFile(String path, ReadOnlySpan`1 password, X509KeyStorageFlags keyStorageFlags, Pkcs12LoaderLimits loaderLimits)
   at System.Security.Cryptography.X509Certificates.CertificatePal.FromBlobOrFile(ReadOnlySpan`1 rawData, String fileName, SafePasswordHandle password, X509KeyStorageFlags keyStorageFlags)
   at System.Security.Cryptography.X509Certificates.X509Certificate..ctor(String fileName, String password, X509KeyStorageFlags keyStorageFlags)
   at Microsoft.AspNetCore.InternalTesting.TestResources.GetTestCertificate(String certName) in /_/src/Shared/TestResources.cs:line 31
   at Interop.FunctionalTests.H2SpecTests.<>c__DisplayClass0_0.<RunIndividualTestCase>b__2(ListenOptions listenOptions) in /_/src/Servers/Kestrel/test/Interop.FunctionalTests/H2SpecTests.cs:line 41
   at Microsoft.AspNetCore.Server.Kestrel.Core.KestrelServerOptions.Listen(IPAddress address, Int32 port, Action`1 configure) in /_/src/Servers/Kestrel/Core/src/KestrelServerOptions.cs:line 485
   at Interop.FunctionalTests.H2SpecTests.<>c__DisplayClass0_0.<RunIndividualTestCase>b__1(KestrelServerOptions options) in /_/src/Servers/Kestrel/test/Interop.FunctionalTests/H2SpecTests.cs:line 36
   at Microsoft.Extensions.Options.OptionsFactory`1.Create(String name)
   at Microsoft.Extensions.Options.UnnamedOptionsManager`1.get_Value()
   at Microsoft.AspNetCore.Server.Kestrel.Core.KestrelServerImpl.CreateServiceContext(IOptions`1 options, ILoggerFactory loggerFactory, DiagnosticSource diagnosticSource, KestrelMetrics metrics) in /_/src/Servers/Kestrel/Core/src/Internal/KestrelServerImpl.cs:line 80
   at Microsoft.AspNetCore.Server.Kestrel.Core.KestrelServerImpl..ctor(IOptions`1 options, IEnumerable`1 transportFactories, IEnumerable`1 multiplexedFactories, IHttpsConfigurationService httpsConfigurationService, ILoggerFactory loggerFactory, KestrelMetrics metrics) in 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants