Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle faulty TCP connection between XHarness and Apple devices #11700

Closed
1 of 5 tasks
premun opened this issue Nov 22, 2022 · 13 comments
Closed
1 of 5 tasks

Handle faulty TCP connection between XHarness and Apple devices #11700

premun opened this issue Nov 22, 2022 · 13 comments
Assignees

Comments

@premun
Copy link
Member

premun commented Nov 22, 2022

Context

When we run mobile device tests on Apple platforms, we create a TCP tunnel through the USB cable that the AppleTV / iPhone is attached by. The TCP tunnel is created by a tool called mlaunch that XHarness uses to talk to the device. A separate mlaunch process is launched on the side that creates the tunnel and XHarness reads its stdout.

flowchart LR
    Step1[XHarness]
    Step2[mlaunch --tcp]
    Step3[TCP tunnel]
    Step4[Apple Device]
    Step5[mlaunch]

    Step1--spawns a backgroud mlaunch process-->Step2
    Step3--listens to TCP-->Step1
    Step2--creates TCP tunnel-->Step3
    Step4--writes data-->Step3
    Step1--calls -->Step5
    Step5--installs + runs app-->Step4
Loading

Problem with TCP

It can happen that the TCP tunnel between XHarness and the device fails:

[12:46:55] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 56661 on the device and the port 56661 (21981) on the mac: 61
[12:46:55] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 56661 (21981) on device (error: 61)

It can also happen that the tunnel is created but the device cannot connect to it (desribed in dotnet/xharness#934).

When either of thees happens, the app running on the device fails to connect as a plan B just writes into stdout instead the TCP connection (the stdout is then in the net.dot.System.Runtime.Tests.log log). So while the app did performs the actual unit tests fine, the run is considered an APP_CRASH because the app never talked back.

Goal

The subject of this issue is to not solve the TCP flakiness as that is somewhere between iOS, MacOS, mlaunch and the .NET runtime. The problem we want to solve here is that we qualify the TCP as APP_CRASH which puts it in the same boat as when the app actually crashes.

Instead, we'd like to recognize the TCP tunnel being the problem and exit with some new exit code (e.g. TCP_CONNECTION_FAILED).
We can then set retries for this specific case as this can be considered an infra failure and let APP_CRASH not retry.

To summarize:

  • Add a new exit code to XHarness TCP_CONNECTION_FAILED and return it instead of APP_CRASH when TCP dies
  • Allow retries for this new exit code + revert retries for APP_CRASH (see #11689)

Development instructions

It should be easy to stage a repro. You can do it like this:

  1. We create the TCP tunnel here: https://github.com/dotnet/xharness/blob/bc9877ac24c13ef8d4ad4c2c3652291a1b02b78f/src/Microsoft.DotNet.XHarness.Apple/AppOperations/AppTester.cs#L166-L172
  2. We get the port it was created it on https://github.com/dotnet/xharness/blob/bc9877ac24c13ef8d4ad4c2c3652291a1b02b78f/src/Microsoft.DotNet.XHarness.Apple/AppOperations/AppTester.cs#L177
  3. We send it to the mobile application via env variable https://github.com/dotnet/xharness/blob/bc9877ac24c13ef8d4ad4c2c3652291a1b02b78f/src/Microsoft.DotNet.XHarness.Apple/AppOperations/AppTester.cs#L223
  4. The app then connects to that port and this event is called: https://github.com/dotnet/xharness/blob/bc9877ac24c13ef8d4ad4c2c3652291a1b02b78f/src/Microsoft.DotNet.XHarness.Apple/AppOperations/AppTester.cs#L197

To repro, you can just send the wrong port to the app (e.g. deviceListenerPort + 1) and this way the phone/simulator won't connect.

You can use the XHarness E2E tests to create a Helix job from dotnet/xharness: https://github.com/dotnet/xharness/blob/main/tools/run-e2e-test.ps1

You should be able to work with this. More details are also here: dotnet/xharness#934

Release Note Category

  • Feature changes/additions
  • Bug fixes
  • Internal Infrastructure Improvements

Release Note Description

Improved handling of faulty TCP connection between XHarness and Apple devices

@premun
Copy link
Member Author

premun commented Nov 22, 2022

cc @steveisok @akoeplinger

@steveisok
Copy link
Member

/cc @mandel-macaque

@premun premun added the Needs Triage A new issue that needs to be associated with an epic. label Nov 29, 2022
@premun
Copy link
Member Author

premun commented Nov 29, 2022

[Async-Triage] Ilya suggested we take this through the FR and I was going to do it during my last rotation but the FR was super busy that week and I didn't get around to it. Ilya also mentioned it might be useful for someone else to have a go at this so that we spread the knowledge about mobile devices a bit while I would support the person. I tried to spec out the issue in enough detail for that to be hopefully possible.

@michellemcdaniel
Copy link
Contributor

michellemcdaniel commented Dec 1, 2022

Premek, can you discuss the impact of this, ie the business impact (more what benefit would introducing the new error code vs just retrying on APP_CRASH)?

@premun
Copy link
Member Author

premun commented Dec 5, 2022

The impact is a bit described in the linked issue where a customer reported failing rolling builds where this just happens often enough that it matters. This issue actually has been around since August and we thought we can ignore it but it just happens frequently enough that we shouldn't ignore it.

As Jan describes in the issue - ignoring all APP_CRASH results is not what we can afford and we should try to weed out infra failures from other problems as much as possible. The stack is just very complicated for the runtime team and if the runtime is flaky on iOS, we should know so that people's MAUI apps don't crash randomly. The way we would know this is by measuring the results as precisely as possible. If other (infra) issues creep into this, it becomes problematic.

Unfortunately, we cannot turn the original issue into a Known Issue to get better numbers as APP_CRASH means multiple things now. It's also super risky because we don't have any one single string to use for the Known Issue (the TCP connection can flop during a perfectly good run too) so it might mask other potential problems.

As an additional example of a potential business impact, we've just caught a big one similar to this 2 weeks ago which almost went into the 7.0.200 build. This alert literally saved our ass big time as we reverted a bad backport - #11605 (comment)
This was an APP_CRASH behaviour on Android emulators so if we were ignoring those, it would be a big problem.

@premun
Copy link
Member Author

premun commented Dec 15, 2022

Side note: I am on FR the next week

@premun premun removed the Needs Triage A new issue that needs to be associated with an epic. label Dec 21, 2022
@premun premun added Ops - First Responder Needs Triage A new issue that needs to be associated with an epic. and removed Needs Triage A new issue that needs to be associated with an epic. labels Dec 21, 2022
@ilyas1974
Copy link
Contributor

@garath, what is the current status of this work item?

@premun
Copy link
Member Author

premun commented Jan 4, 2023

Wasn't it @jonfortescue working on this?

@ilyas1974
Copy link
Contributor

John handed off to Stu last week

@tkapin
Copy link
Member

tkapin commented Jan 20, 2023

Kudos to you @AlitzelMendez for not being afraid to jump into the mobile space. Good work!

@premun premun assigned premun and unassigned AlitzelMendez Jan 23, 2023
@premun
Copy link
Member Author

premun commented Jan 23, 2023

Thanks @AlitzelMendez, assigning to myself and will monitor this through out the week

@premun
Copy link
Member Author

premun commented Jan 23, 2023

Actually, will keep the customer issue up for tracking of this: #11683

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants