Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iOS - Test suite crash despite all tests passing #11210

Closed
2 tasks
mdh1418 opened this issue Oct 10, 2022 · 14 comments
Closed
2 tasks

iOS - Test suite crash despite all tests passing #11210

mdh1418 opened this issue Oct 10, 2022 · 14 comments
Assignees
Labels
External Dependency Used to track FR issues for services that are not directly owned by DNCEng Ops - First Responder

Comments

@mdh1418
Copy link
Member

mdh1418 commented Oct 10, 2022

Build

https://dev.azure.com/dnceng-public/public/_build/results?buildId=44476&view=logs&j=a5078f86-b345-5a4a-85ee-f64916152c6f&t=b1d0531b-2b6c-5e64-cc59-e2e1ffcc72bf

Build leg reported

iOS arm64 Release AllSubsets_Mono

Pull Request

dotnet/runtime#76725

Action required for the engineering services team

To triage this issue (First Responder / @dotnet/dnceng):

  • Open the failing build above and investigate
  • Add a comment explaining your findings

If this is an issue that is causing build breaks across multiple builds and would get benefit from being listed on the build analysis check, follow the next steps:

  1. Add the label "Known Build Error"
  2. Edit this issue and add an error string in the Json below that can help us match this issue with future build breaks. You should use the known issues documentation
{
   "ErrorMessage" : "",
   "BuildRetry": false
}

Additional information about the issue reported

The net.dot* log also seems to contain the xml that should have been generated
https://gist.github.com/mdh1418/a47d8c99b49a200789e29ba3edcc11ce

Comparing with another build where the job passed, the same number of tests are ran and reported, however the xml is not injected in the net.dot*
https://gist.github.com/mdh1418/3b5fea30fdef61b6c5c5be86d9404107

@MattGal
Copy link
Member

MattGal commented Oct 10, 2022

[11:05:10] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 54671 (36821) on device (error: 61)
[11:05:10] dbug: Detected test end tag in application's output
[11:05:10] dbug: Process mlaunch exited with 137
[11:05:10] dbug: Test run completed
[11:05:10] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 54671 on the device and the port 54671 (36821) on the mac: 61
[11:05:10] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 54671 (36821) on device (error: 61)

this is indicative of the test process OOMing, and while it may be an inconsistent repro, is not an infrastructure issue. This issue should probably go in the dotnet/runtime repo not dotnet/arcade.

@premun
Copy link
Member

premun commented Oct 11, 2022

We are seeing these TCP issues since July. We've taken several steps (rebooting the devices, updating Xcode..) to rule out some of the infrastructural reasons but since this started happening across all device queues in roughly same time, and since we've not had these issues the previous year, I'm also very inclined to think that this is not of an infra nature.

Can be what Matt says, can be networking regression..

cc @steveisok

@MattGal
Copy link
Member

MattGal commented Oct 11, 2022

We are seeing these TCP issues since July. We've taken several steps (rebooting the devices, updating Xcode..) to rule out some of the infrastructural reasons but since this started happening across all device queues in roughly same time, and since we've not had these issues the previous year, I'm also very inclined to think that this is not of an infra nature.

Can be what Matt says, can be networking regression..

cc @steveisok

One other weird thing I noted is that the same errors are in the same run in the work items which passed too, e.g. in System.IO.MemoryMappedFiles test we see:

[11:06:03] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 54594 (17109) on device (error: 61)
[11:06:03] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 54594 on the device and the port 54594 (17109) on the mac: 0
[11:06:03] dbug: Test log server listening on: localhost:54594
[11:06:03] dbug: [TCP tunnel] Xamarin.Hosting: Created USB tunnel between the port 54594 on the device and the port 54594 on the mac.
...
[11:06:03] dbug: Test execution started
[11:06:04] dbug: Detected test end tag in application's output
[11:06:04] dbug: Process mlaunch exited with 137
[11:06:04] dbug: Test run completed
[11:06:09] dbug: Killing process tree of 37886...
[11:06:09] dbug: Pids to kill: 37886
[11:06:09] dbug: [TCP tunnel] Killing process 37892 as it was cancelled
[11:06:09] dbug: [TCP tunnel] Process mlaunch exited with 137
[11:06:09] dbug: Tests have finished executing

but somehow this coerces its exit code back to 0, reports tests, and considers itself succeeded. There may be "acceptable" and "unacceptable" crash scenarios here or something, but whatever this is merits investigation by someone with iOS / Xamarin expertise, for sure.

@premun
Copy link
Member

premun commented Oct 11, 2022

That's a nice catch. The way it works is that we start the TCP connection, send some initial thing and then nothing really gets sent until the end when the whole XML with results is sent.

It seems like the TCP connection flops in between and it's a matter of luck whether it's fine when we need it? We don't really have much code around keeping it alive / watching the status / retrying. Only thing we do is log whatever it outputs (with the [TCP tunnel] prefix).

The TCP tunnel is managed by mlaunch.

@steveisok
Copy link
Member

We pulled a machine off of the iOS queue and are going to investigate further. I don't think this is an infrastructure issue either.

@missymessa
Copy link
Member

@mdh1418 if this error is only occurring on your tests, we should move this issue to being a Test Known Issue in the dotnet/runtime repo instead of an infrastructure issue that the engineering services team would need to track.

@ulisesh, is it possible to migrate this to dotnet/runtime to turn into a Test Known Issue, or would it be easier for Mitchell to open a new issue in the repo and for us to close this one?

@steveisok
Copy link
Member

steveisok commented Oct 12, 2022

@missymessa I think I was being vague when I said I didn't think this was an infrastructure issue. I believe it may be some issue with the way we use usb-mux, which could be a problem in the mlaunch component we use to install / execute the mobile apps. Technically, that is an "infrastructure" component, but is not managed within arcade. The xamarin-ios team is in control of it.

/cc @mandel-macaque

@missymessa
Copy link
Member

@steveisok is the expectation that this issue should be tracked by dnceng until ownership of the error is determined?

@missymessa
Copy link
Member

Also, given that this error message is in test logs, using an Infrastructure Known Issue isn't going to be able to find an error message in the test logs. It would work better if a Test Known Issue was created so that those errors can be found properly in the tests.

For example: Test Known Issue: dotnet/runtime#74488

And what it looks like in Build Analysis when a Known Issue for a test is found: https://github.com/dotnet/runtime/pull/73263/checks

@steveisok
Copy link
Member

@steveisok is the expectation that this issue should be tracked by dnceng until ownership of the error is determined?

For now, I think this is probably the best place.

@missymessa
Copy link
Member

I'm going to move this to Tracking. There's nothing actionable for us (dnceng) to do on this issue, however, I would encourage y'all to turn this into a Test Known Issue in the dotnet/runtime repo so that it can be tracked properly with the Known Issue feature.

@ilyas1974
Copy link
Contributor

Talked to @steveisok and his team is continuing to determine the error and next steps

@ilyas1974 ilyas1974 self-assigned this Oct 17, 2022
@ilyas1974
Copy link
Contributor

No progress from the team on this issue.

@ilyas1974
Copy link
Contributor

Closing this issue as it is a duplicate of issue #10820

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
External Dependency Used to track FR issues for services that are not directly owned by DNCEng Ops - First Responder
Projects
None yet
Development

No branches or pull requests

6 participants