-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failure: Microsoft.Extensions.Hosting.WindowsServiceLifetimeTests.ExceptionOnStartIsPropagated #107671
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Looks like the same issue as #107589 |
#88027 also seems to be related which was getting reported since last year and the specific test case was disabled, but the underlying issue is not fixed. |
Basically, closed because the WindowsServiceLifetimeTests test was passed and the failure happened in RemoteExecutor which might be just a network issue |
Failed in: runtime-coreclr libraries-jitstress 20240911.1 Failed tests:
Error message:
Stack trace:
|
This test started failing very recently. The test ran on 9/6 without problem https://dev.azure.com/dnceng-public/public/_build/results?buildId=800541&view=results. So definitely do not need a backport. |
Here remote executor is actually tracking the service process running. I've been seeing failures of these tests from JIT stress Arm64 pipelines (only) that don't reproduce on local x64 machine, nor arm64 lab machines. I suspect something about the queue or machine setup for these JIT stress runs is busted. The stack you're seeing here is just the remote executor stack that's waiting for the service process to exit. To understand what's actually going on you need to see what the service process is doing. If I look at https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-212737db1d264eb29f/Microsoft.Extensions.Hosting.WindowsServices.Tests/1/console.534963dd.log?helixlogtype=result I see a dump was created. I'm having trouble getting from helix though. |
So the next action would be to look at that dump and answer the question "Why is this process hung?" You can use runfo to get all the files from the test:
I had a look and none of the tests seem to be running any user code. I can't see any of the managed stack info (how can I load SOS when debugging ARM64 on amd64?) Here's all the native stacks
|
Yes, I was at similar conclusion when looking at the trace of this one and #107589. Additionally, I couldn't get SOS to load in my windbg to confirm what is going on. |
Failed in: runtime-coreclr libraries-jitstress 20240912.1 Failed tests:
Error message:
Stack trace:
|
Something odd going since last month, given that we had to point fix the tests to make them pass on jit stress pipelines |
Ok, so I got @hoyosjs's help to get the managed stacks. I had to build an amd64_arm64 cross dac and then tell the debugger to use it. Here's the managed state:
So this looks to me like the service is still running. It should have thrown an exception on start and exited. The dump shows that it's the correct service running. It's almost as if the exception was swallowed. I'll look more closely at how this is supposed to work and see if I can build a theory. |
This looks wrong:
This callstack has the entrypoint running this service: runtime/src/libraries/Microsoft.Extensions.Hosting.WindowsServices/tests/UseWindowsServiceTests.cs Lines 36 to 44 in d58b1ca
And if I check the commandline args:
However if I check the service name:
So it looks to me like the service was registered with the wrong command line arguments! I think I found the problem. We create a single runtime/src/libraries/Microsoft.Extensions.Hosting.WindowsServices/tests/WindowsServiceTester.cs Line 72 in d58b1ca
Seems OK right? An options class should just immutable state to pass parameters around. Not so fast! This options type has a And that instance is actually mutated before it's copied out into the final returned handle: So we've had this latent threading bug here forever. If we have multiple tests running they'll race at mutating that Fix here is simple - don't reuse the |
Tagging subscribers to this area: @dotnet/area-extensions-hosting |
Failed in: runtime-coreclr libraries-jitstress 20240910.1
Failed tests:
Error message:
Stack trace:
The text was updated successfully, but these errors were encountered: