Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long Running Test - System.Diagnostics.Tests.ProcessTests.Kill_ExitedChildProcess_DoesNotThrow #35506

Closed
ViktorHofer opened this issue Apr 27, 2020 · 17 comments · Fixed by #35584

Comments

@ViktorHofer
Copy link
Member

https://dev.azure.com/dnceng/public/_build/results?buildId=618671&view=ms.vss-test-web.build-test-results-tab&runId=19376690&resultId=184383&paneView=attachments

console.a6c9799c.log:
https://helix.dot.net/api/2019-06-17/jobs/5607cdb0-5b11-499c-a476-0d70c6599ef4/workitems/System.Diagnostics.Process.Tests/files/console.a6c9799c.log

Configuration: netcoreapp5.0-Windows_NT-Release-x86-CoreCLR_release-Windows.7.Amd64.Open

C:\h\w\A661091E\w\A8EC08EF\e>"C:\h\w\A661091E\p\dotnet.exe" exec --runtimeconfig System.Diagnostics.Process.Tests.runtimeconfig.json --depsfile System.Diagnostics.Process.Tests.deps.json xunit.console.dll System.Diagnostics.Process.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing  
  Discovering: System.Diagnostics.Process.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Diagnostics.Process.Tests (found 233 of 253 test cases)
  Starting:    System.Diagnostics.Process.Tests (parallel test collections = on, max threads = 2)
    System.Diagnostics.Tests.ProcessStartInfoTests.ShellExecute_Nano_Fails_Start [SKIP]
      Condition(s) not met: "IsWindowsNanoServer"
Invalid number of parameters
0 File(s) copied
   System.Diagnostics.Process.Tests: [Long Running Test] 'System.Diagnostics.Tests.ProcessTests.Kill_ExitedChildProcess_DoesNotThrow', Elapsed: 00:02:10
   System.Diagnostics.Process.Tests: [Long Running Test] 'System.Diagnostics.Tests.ProcessTests.Kill_ExitedChildProcess_DoesNotThrow', Elapsed: 00:04:10
   System.Diagnostics.Process.Tests: [Long Running Test] 'System.Diagnostics.Tests.ProcessTests.Kill_ExitedChildProcess_DoesNotThrow', Elapsed: 00:06:10
   System.Diagnostics.Process.Tests: [Long Running Test] 'System.Diagnostics.Tests.ProcessTests.Kill_ExitedChildProcess_DoesNotThrow', Elapsed: 00:08:10
   System.Diagnostics.Process.Tests: [Long Running Test] 'System.Diagnostics.Tests.ProcessTests.Kill_ExitedChildProcess_DoesNotThrow', Elapsed: 00:10:10
   System.Diagnostics.Process.Tests: [Long Running Test] 'System.Diagnostics.Tests.ProcessTests.Kill_ExitedChildProcess_DoesNotThrow', Elapsed: 00:12:10
   System.Diagnostics.Process.Tests: [Long Running Test] 'System.Diagnostics.Tests.ProcessTests.Kill_ExitedChildProcess_DoesNotThrow', Elapsed: 00:14:10
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Diagnostics.Process untriaged New issue has not been triaged by the area owner labels Apr 27, 2020
@ghost
Copy link

ghost commented Apr 27, 2020

Tagging subscribers to this area: @eiriktsarpalis
Notify danmosemsft if you want to be subscribed.

@danmoseley
Copy link
Member

It timed out, it wasn't just long running. Not obvious why: we need dump files on hangs dotnet/dnceng#1216

@wfurt
Copy link
Member

wfurt commented Apr 28, 2020

I've seen it on OSX as well. Maybe in this case test can enforce reasonable timeout and Assert to cause corefump.

@danmoseley
Copy link
Member

Not a bad idea, although the core dump won't help if the issue somehow is that the child process won't exit.

I'll make the change.

@wfurt
Copy link
Member

wfurt commented Apr 28, 2020

I started as well as I'm planning to dump process list on failure. I'd be happy to stop and let you some fun @danmosemsft. I think for now, the focus should be to cap test duration and collect useful info on failure.

@danmoseley
Copy link
Member

danmoseley commented Apr 28, 2020

Oh you go ahead then since that sounds way better than the simple thing I was going to do.

For the RemoteExec case, we have a pretty good setup right now for timeouts and for gathering info on hangs, active processes etc.
https://github.com/dotnet/arcade/blob/590a102630c7efc7ca6f652f7c6c47dee4c4086c/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs#L139-L219

Do we have a way to trigger this when we're not in a RemoteExec context? It would be super handy to have a general mechanism. I can think of all kinds of ways we could extend it. (I hope we can avoid 2+ implementations of it)

@danmoseley
Copy link
Member

And ultimately, we need to find a way to hook into xunit so that it triggers for hangs in arbitrary tests, without special effort in each test.

@wfurt
Copy link
Member

wfurt commented Apr 28, 2020

I think the in-process may be tricky. We may be able to mark test as failure but I don't know if there is reliable and safe way how to terminate running function.

@danmoseley
Copy link
Member

@wfurt one way would be to spawn the RemoteExecutor with a special flag "make a dump of my process". It would run the same code I linked above but against the PID provided. When it returned, the test could either throw XUnitException to fail itself and continue, or terminate the process.

@danmoseley
Copy link
Member

cc @stephentoub in case he has another suggestion.

@stephentoub
Copy link
Member

Doesn't the vstest infrastructure we're about to switch to make it straightforward to get dumps, or am I misremembering?

@ViktorHofer
Copy link
Member Author

Yes, VSTest can be configured to kill the testhost after a specified timeout and will then collect the dump. @nohwnd has the details and can talk about cross-plat support of the dump collection.

@danmoseley
Copy link
Member

It might be interesting to be able to get a dump, fail only that test, and continue.

It would also be good to have a hook to gather other data if and when we need it.

It would also be good to understand what platforms we can create dumps on.

@nohwnd could you share more about what vstest offers?

@nohwnd
Copy link
Member

nohwnd commented Apr 29, 2020

@danmosemsft

Update: Got my ARM and AMD abbreviations confused.

Hang dumps currently work only on Windows x86, x64 but not amd ARM. So I don't think it will be much help here, based on the test name.

We are investigating how to make hang and crash dumps cross-platform in the helix prototype effort you are also part of.

  • It might be interesting to be able to get a dump, fail only that test, and continue.

That would be very nice, but it is complicated by the fact that test host cannot safely dump itself, because it is not always safe to do that across platforms. So the blame data collector would have to detect the hang (that is already happening), dump the process when the test timeout is reached, and associate that dump with the test somehow, but NOT kill the process as it does now.

Instead it would wait until all tests finish running, or until they are all past their timeout threshold. This would in the worst case generate 1 full dump per test if all are hanging, and full dump is requested.

Once all tests are finished or timed-out, the test host would stay hanging because it cannot terminate until it's threads are done running, and because dotnet core does not allow threads to be aborted we need to kill the process externally (or it may be able to terminate itself, I am not sure now).

The upside is that this approach should be test framework agnostic, because this happens above the test adapter level. And because there is a special parser for the Sequence file that is produced, AzDo is also able to mark those unfinished tests as aborted.

  • It would also be good to have a hook to gather other data if and when we need it.

What kind of data would that be? I did not find that in this thread.

  • It would also be good to understand what platforms we can create dumps on.

Windows x86 and x64 (but not amd ARM) currently because the procdump tool is used for both hang dumps and crash dumps. But with the efforts around Microsoft.Diagnostics.NETCore.Client we should soon be able to produce hang dumps easily on Windows, Linux and MacOS. At least on modern dotnets (3.1+ in windows and linux, and 5.0 on macos).

I am experimenting with improving the overall experience here, where I collect dumps via vstest console across multiple operating systems, ideally by the end of the script all the ticks would be green:

https://dev.azure.com/jajares/blame/_build/results?buildId=18&view=results

@danmoseley
Copy link
Member

Thanks @nohwnd for working on this!

What kind of data would that be? I did not find that in this thread.

It would depend on the test (or the test failure) but one example that came up elsewhere was logging whether the machine was heavily loaded and what other processes were running. I could imagine we might want to log config files or registry keys, the versions of certain libraries, etc. Certainly this is secondary to reliably getting useable dumps.

@wfurt
Copy link
Member

wfurt commented Apr 29, 2020

It seems like dotnet/dnceng#1216 is essentially dup of https://github.com/dotnet/core-eng/issues/5380. Second one outlines steps for the NIX platforms and it should not depends on architecture. Alternatively, we could use tools from the diag repo.

As far as the auxiliary information, load, m tail for kernel log and process lists with states and and stats would be good start IMHO. For networking tests, dump of interfaces, dns configuration, routing table and connections would be awesome.

@ViktorHofer
Copy link
Member Author

It would depend on the test (or the test failure) but one example that came up elsewhere was logging whether the machine was heavily loaded and what other processes were running. I could imagine we might want to log config files or registry keys, the versions of certain libraries, etc. Certainly this is secondary to reliably getting useable dumps.

@danmosemsft that could be achieved with a (diagnostics) data logger:

DataCollectors are used to monitor test execution. Getting CPU or memory usage info, taking screenshot, recording screen activity, measuring code coverage, etc. while executing tests are a few common scenarios that can be realised through DataCollectors.

from https://github.com/Microsoft/vstest-docs/blob/master/docs/extensions/datacollector.md

@jeffhandley jeffhandley removed the untriaged New issue has not been triaged by the area owner label Sep 17, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants