Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[iOS] Infrastructure is failing to report successful test run #11683

Closed
2 tasks
jkotas opened this issue Nov 20, 2022 · 34 comments
Closed
2 tasks

[iOS] Infrastructure is failing to report successful test run #11683

jkotas opened this issue Nov 20, 2022 · 34 comments
Assignees
Labels
Detected By - Customer Issue was reported by a customer Ops - First Responder

Comments

@jkotas
Copy link
Member

jkotas commented Nov 20, 2022

Build

https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=89051

Build leg reported

System.Runtime.Tests.WorkItemExecution

Pull Request

dotnet/runtime#78288

Action required for the engineering services team

To triage this issue (First Responder / @dotnet/dnceng):

  • Open the failing build above and investigate
  • Add a comment explaining your findings

If this is an issue that is causing build breaks across multiple builds and would get benefit from being listed on the build analysis check, follow the next steps:

  1. Add the label "Known Build Error"
  2. Edit this issue and add an error string in the Json below that can help us match this issue with future build breaks. You should use the known issues documentation
{
   "ErrorMessage" : "[TCP tunnel] Xamarin.Hosting: Failed to connect to port",
   "BuildRetry": true
}

Additional information about the issue reported

System.Runtime.Tests are reported as failing. It looks like a failure in the reporting infrastructure. System.Runtime.Tests succeeded according to the net.dot.System.Runtime.Tests.log file:

Tests run: 50259 Passed: 50025 Inconclusive: 0 Failed: 0 Ignored: 133 Skipped: 101
Killing process 95249 as it was cancelled
[TerminateWithSuccess]

Report

Build Definition Test Pull Request
89138 dotnet/runtime System.Runtime.Tests.WorkItemExecution dotnet/runtime#78593

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
1 1 1
@jkotas
Copy link
Member Author

jkotas commented Nov 21, 2022

Another hit in dotnet/runtime#78593

@premun
Copy link
Member

premun commented Nov 21, 2022

Analysis

I can see that the TCP tunnel between XHarness and the device failed:

[12:46:55] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 56661 on the device and the port 56661 (21981) on the mac: 61
[12:46:55] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 56661 (21981) on device (error: 61)

When this happens, the app just dumps the testResults.xml into its stdout instead the TCP connection (the stdout is then in the net.dot.System.Runtime.Tests.log log). So while the app did perform the actual unit tests fine, the run was considered an APP_CRASH because the app never talked back.

These TCP problems happen every now and then and this is the issue for them: dotnet/xharness#934
We haven't had the time to handle the TCP flakiness yet. It is unclear whether the flakiness comes from MacOS, mlaunch or the runtime/networking stack.

Mitigation proposal

I can see that we have turned retries off for Apple devices due to a small amount of HW but maybe we can turn them on? We now have many more iPhones in the osx.1200.amd64.iphone.open queue. Maybe we can move onto there and turn on retries? This sounds like we should do it anyway? @steveisok what do you think?

Solution (?)

  • Proper solution is to figure out why the TCP connection fails but that will be costly.
  • We can also make the TCP tunnel logic smarter so that it understands the connection is flopping.
  • We can also consider getting the testResults.xml from stdout always but that will be quite problematic because we get the logs through a quite hacky way. All solutions seem quite costly.

@premun premun self-assigned this Nov 21, 2022
@premun
Copy link
Member

premun commented Nov 21, 2022

I don't think we should create "known issue" for this as it might cover potential problems. I think enabling retries is the best next step. I will add open a PR for that. I checked the telemetry and there are not many APP_CRASH results (just 3 for iPhones in the last week or so).

Regardless.. seems like we are not using the new iPhones though - there are 50 iPhones with no work - we should also enable tests for iPhones too probably, not just AppleTVs?

@steveisok
Copy link
Member

I don't think we should create "known issue" for this as it might cover potential problems. I think enabling retries is the best next step. I will add open a PR for that. I checked the telemetry and there are not many APP_CRASH results (just 3 for iPhones in the last week or so).

Regardless.. seems like we are not using the new iPhones though - there are 50 iPhones with no work - we should also enable tests for iPhones too probably, not just AppleTVs?

We can enable retries and see if they help. I thought we added iphones to the rolling build. I'll definitely do that if it's not currently.

@premun
Copy link
Member

premun commented Nov 21, 2022

We have agreed that we could make XHarness recognize TCP issues and return some extra exit code and then have a known build error for that case.

@jkotas
Copy link
Member Author

jkotas commented Nov 21, 2022

We have agreed that we could make XHarness recognize TCP issues and return some extra exit code and then have a known build error for that case.

+1

I believe that the logic you have added in #11689 is going to hide intermittent test crashes that is bad for getting a signal about the product reliability. Can it be deleted as part of fixing this?

@premun
Copy link
Member

premun commented Nov 21, 2022

I believe that the logic you have added in #11689 is going to hide intermittent test crashes that is bad for getting a signal about the product reliability. Can it be deleted as part of fixing this?

Yes, this is a good point. Let's start distinguishing TCP-caused failures and let the other blow up.
(there have been 12 APP_CRASH results in the last 3 days btw)

I will include this in the issue that I will open for the TCP exit code.

@premun
Copy link
Member

premun commented Nov 21, 2022

Removing Known Build Error as this log line might also appear on runs with unrelated issues.

@premun
Copy link
Member

premun commented Nov 22, 2022

I have logged #11700 with enough details for someone to try to fix this

@premun
Copy link
Member

premun commented Jan 23, 2023

Fixes for this were added through #11700. Keeping this open and will monitor the telemetry through out the week.

@premun
Copy link
Member

premun commented Jan 27, 2023

Seems like the new fixes are already working! There are the TCP problems being detected:

image

But what is more important is that the retries happening for these cases now dropped the overall failure rate of AppleTV jobs to 0:

image

cc @AlitzelMendez @ilyas1974

@premun premun closed this as completed Jan 27, 2023
@steveisok
Copy link
Member

@premun it looks like your retry logic did mitigate a bunch, but there are still some getting through.

@steveisok steveisok reopened this Feb 16, 2023
@kotlarmilos
Copy link
Member

@premun I've been hitting it for a coupe of days in dotnet/runtime#81319. I will test the CI from a dummy PR with the changes and make sure it is not related.

@premun
Copy link
Member

premun commented Feb 17, 2023

@kotlarmilos seems like the AzDO builds of your PR got deleted as a new ones got queued by new commits. I tried to find a failing leg in the commit history of your PR but no luck. Could you point me to one please? Can be just a failing Helix job.

@premun
Copy link
Member

premun commented Feb 17, 2023

I see a large increase in iOS device failures in the last 10 days:

image

@steveisok
Copy link
Member

@premun here's an example build

https://github.com/dotnet/runtime/runs/11396687519

@kotlarmilos
Copy link
Member

@premun @steveisok AppleApp.targets had an issue not including an additional assembly during the AOT compilation, causing a runtime initialization failure on a device. It seems that when a runtime initialization fail the helix reports the TCP tunnel error, instead of an error log.

If the increase in iOS device failures are related to dotnet/runtime#81319 only, this issue could be closed.

Thanks for the support!

@steveisok
Copy link
Member

@kotlarmilos there are still issues with the tcp tunnel even though your PR may have contributed to the spike of failures.

@premun
Copy link
Member

premun commented Mar 31, 2023

Okay, so we've now got a TCP_CONNECTION_FAILED categorized separately and it's the largest amount of errors happening. Maybe we should now mark this as known test issue and ignore those?

@steveisok
Copy link
Member

@premun I agree. However, I am noticing an odd pattern if you look at recent builds. When you dive into this log, notice how many retries there are:

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-83380-merge-a5362ecb5f8d43b98d/System.Text.Json.Tests.Attempt.3/1/console.774bab3d.log?helixlogtype=result

And if you look at the net.dot from any of the failures, they'll appear to be running and abruptly cut off as a "runtime crash". I don't think that's accurate and it appears that helix is cutting the run short.

https://dev.azure.com/dnceng-public/public/_build/results?buildId=218047&view=ms.vss-test-web.build-test-results-tab&runId=4079086&resultId=156355&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

@premun
Copy link
Member

premun commented Mar 31, 2023

When you dive into this log, notice how many retries there are

Which retries do you mean?

Overall I am seeing an improvement but I understand we're now moving queues and might be running a reduced number of workloads?

image

@premun
Copy link
Member

premun commented Mar 31, 2023

I think the crashes are expected - the app never connects over TCP so starts running without it but XHarness hits the launch timeout and kills the app because it doesn't know what's going on.

[05:56:42.1359790] 2023-03-27 02:56:42.237 System.Collections.Tests[30144:5229512] 	[PASS] System.Collections.Generic.Tests.UIntPtrComparersTests.Comparer_ComparerCreate
[05:56:42.1382500] 2023-03-27 02:56:42.239 System.Collections.Tests[30144:5229512] 	[PASS] System.Collections.Generic.Tests.UIntPtrComparersTests.EqualityComparer_EqualityComparerDefault
[05:56:42.1552990] Xamarin.Hosting: Process '30144' exited with exit code 1 or crashing signal .
[05:56:42.1557690] Application 'net.dot.System.Collections.Tests' terminated (with exit code '1' and/or crashing signal ').

@kotlarmilos
Copy link
Member

kotlarmilos commented Apr 11, 2023

I think the crashes are expected - the app never connects over TCP so starts running without it but XHarness hits the launch timeout and kills the app because it doesn't know what's going on.

In the build https://github.com/dotnet/runtime/pull/79169/checks?check_run_id=12668103836 the tests failed due to a problem with establishing a TCP connection. As a result, XHarness terminated the tests execution.

In the build https://github.com/dotnet/runtime/pull/84304/checks?check_run_id=12661274730 the tests may have failed during the startup due to missing runtime invoke wrappers. Also, it may have failed due to a problem with establishing a TCP connection. As a result, the issue was reported as a problem with establishing a TCP connection. Since the issue is not reproducible locally, it is currently unclear what the next steps should be.

@steveisok @premun I am considering some ideas for improving the overall experience. Here are some thoughts:

  • I don't understand why the TCP connection seems to be unstable
  • XHarness could print a stack trace and error message when an error occurs for better diagnostics
  • It might be possible that multiple TCP connection attempts being made within a short period of time could overload a device
  • If an app never connects over TCP, XHarness shouldn't execute tests to avoid timeouts

@premun
Copy link
Member

premun commented Apr 12, 2023

  • I don't understand why the TCP connection seems to be unstable
  • XHarness could print a stack trace and error message when an error occurs for better diagnostics
  • It might be possible that multiple TCP connection attempts being made within a short period of time could overload a device

It's actually not XHarness doing the TCP but mlaunch - a tool used by XHarness and VS to talk to Apple devices/simulators. We didn't use to have this many issues with TCP but it's possible something regressed between mlaunch and new MacOS versions.

  • If an app never connects over TCP, XHarness shouldn't execute tests to avoid timeouts

This happens in the TestRunner - https://github.com/dotnet/xharness/blob/389c851b0dc1d2c50d03e4aad000b7802d0ebed6/src/Microsoft.DotNet.XHarness.TestRunners.Common/iOSApplicationEntryPointBase.cs#L26

@kotlarmilos
Copy link
Member

Thank you for the explanation. Here are some additional points that may be worth considering.

  • I don't understand why the TCP connection seems to be unstable
  • XHarness could print a stack trace and error message when an error occurs for better diagnostics
  • It might be possible that multiple TCP connection attempts being made within a short period of time could overload a device

It's actually not XHarness doing the TCP but mlaunch - a tool used by XHarness and VS to talk to Apple devices/simulators. We didn't use to have this many issues with TCP but it's possible something regressed between mlaunch and new MacOS versions.

I've seen a number of TCP error logs, even on successfully executed tests where mlaunch connects to a device after N attempts. What would be a good way to measure occurrence of the connection attempts on the CI? We may update the pattern in this issue to search for TCP error logs, but it may collect only failed jobs.

  • If an app never connects over TCP, XHarness shouldn't execute tests to avoid timeouts

This happens in the TestRunner - https://github.com/dotnet/xharness/blob/389c851b0dc1d2c50d03e4aad000b7802d0ebed6/src/Microsoft.DotNet.XHarness.TestRunners.Common/iOSApplicationEntryPointBase.cs#L26

Probably I didn't fully understand your comment at first. Are you saying that if a TCP connection is not established, XHarness will not launch the app but instead hit a timeout? According to the logs, it seems that the app had started, and during its execution a timeout occured.

I think the crashes are expected - the app never connects over TCP so starts running without it but XHarness hits the launch timeout and kills the app because it doesn't know what's going on.

[05:56:42.1359790] 2023-03-27 02:56:42.237 System.Collections.Tests[30144:5229512] 	[PASS] System.Collections.Generic.Tests.UIntPtrComparersTests.Comparer_ComparerCreate
[05:56:42.1382500] 2023-03-27 02:56:42.239 System.Collections.Tests[30144:5229512] 	[PASS] System.Collections.Generic.Tests.UIntPtrComparersTests.EqualityComparer_EqualityComparerDefault
[05:56:42.1552990] Xamarin.Hosting: Process '30144' exited with exit code 1 or crashing signal .
[05:56:42.1557690] Application 'net.dot.System.Collections.Tests' terminated (with exit code '1' and/or crashing signal ').

Based on the failures discovered in dotnet/runtime#84304, it was observed that the error log of an application that might fails during startup is being masked by TCP connection failures. To proceed working on such improvements and improve overall CI test execution on ios platforms, I recommend considering the following action points:

  • Create a sample PR that intentionally breaks the test application during startup, and update the CI to display the corresponding error logs instead of TCP failure
  • Investigate why a TCP connection over mlaunch is unable to connect to a device on the first attempt in most cases; current assumption that it rarely/never connects on the first attempt
  • Confirm that the XHarness don't to run an application if a TCP connection failure occurs

Do you have any other ideas on what could be improved or do you think some of the action points listed are unnecessary?

I suggest updating the tracking issue dotnet/runtime#84254 with some of the above-mentioned action points.

@premun
Copy link
Member

premun commented Apr 13, 2023

Probably I didn't fully understand your comment at first. Are you saying that if a TCP connection is not established, XHarness will not launch the app but instead hit a timeout? According to the logs, it seems that the app had started, and during its execution a timeout occured.

It happens in this order:

  1. XHarness starts, checks the app
  2. Finds a device (via mlaunch)
  3. Installs app on the device (via mlaunch)
  4. In the background, starts another instance of mlaunch to create the TCP tunnel between MacOS and the device
  5. Starts a TCP listener on the port it gave mlaunch in the previous step
  6. Waits for a signal log line from the TCP mlaunch (step 4) and when it says "TCP tunnel waiting for connections", proceeds next.
  7. Starts the app on the phone (also mlaunch)

Now.. XHarness doesn't have visibility into the device so it doesn't know if the app started well. The app can now crash for instance so it never connects. The app also might fail to connect to the TCP tunnel. For XHarness it's all the same. If it doesn't connect in a specific time (argument --launch-timeout), it times out.

Now from the point of view of the app, the app can start and as argument (or envvar) it receives a port where the tunnel should be. If it can't connect there, it will log in it's stdout instead of the TCP tunnel (this is the TestRunner link I sent).

There's also this diagram for this - #11700

To me, it seems, that in the last few months, mlaunch opens the tunnel but then has issues keeping it open. If you check the logs, it flops from "awaiting connection on port XY" to "failed to open the tunnel" again. We try to detect this state with a recent change and categorize this as TCP failure.

  • Create a sample PR that intentionally breaks the test application during startup, and update the CI to display the corresponding > error logs instead of TCP failure

When making the last change, we did give the app a wrong port purposefully to test the TCP tunnel not being there.

  • Confirm that the XHarness don't to run an application if a TCP connection failure occurs

The problem is that the TCP tunnel starts in a while and then things can work fine. Example:
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-arcade-refs-heads-release-60-ac03203fd7924eddad/zipped-apps/1/console.3dbd4769.log?helixlogtype=result

These are the main things:


### APP INSTALLATION

[15:46:40] dbug: Running /private/tmp/helix/working/B2D609CE/p/microsoft.dotnet.xharness.cli/1.0.0-prerelease.22327.1/tools/net6.0/any/../../../runtimes/any/native/mlaunch/bin/mlaunch --sdkroot /Applications/Xcode124.app --list-crash-reports=/tmp/helix/working/B2D609CE/t/tmpcE9ZNU.tmp --devname DNCENGTVOS-026 -v -v -v -v -v
[15:46:40] dbug: Using Xcode 12.4 found in /Applications/Xcode124.app
[15:46:40] dbug: Xamarin.Hosting: Device discovery started
[15:46:40] dbug: Xamarin.Hosting: Device discovery event: Connected (fff44:c6:5d:7e:84:2cfff)
[15:46:40] dbug: Xamarin.Hosting: Device discovery event: Connected (fff44:c6:5d:82:ac:e7fff)
[15:46:40] dbug: Xamarin.Hosting: Connected to Vsm-xam-appletv-02 (fff44:c6:5d:7e:84:2cfff) in 00:00:00.0187767
[15:46:40] dbug: Xamarin.Hosting: Connected to Vsm-xam-appletv-01 (fff44:c6:5d:82:ac:e7fff) in 00:00:00.0178724
[15:46:40] dbug: Xamarin.Hosting: Device discovery event: Connected (93c1ae4069a5601dc0d7e8725f9848815c5f3f7b)
[15:46:40] dbug: Xamarin.Hosting: Connected to DNCENGTVOS-026 (93c1ae4069a5601dc0d7e8725f9848815c5f3f7b) in 00:00:00.0014046
[15:46:41] dbug: Process mlaunch exited with 0

### STARTING TCP TUNNEL

[15:46:41] dbug: Starting TCP tunnel between mac port: 55899 and device port 55899
[15:46:41] dbug: [TCP tunnel]
[15:46:41] dbug: [TCP tunnel] Running /private/tmp/helix/working/B2D609CE/p/microsoft.dotnet.xharness.cli/1.0.0-prerelease.22327.1/tools/net6.0/any/../../../runtimes/any/native/mlaunch/bin/mlaunch --sdkroot /Applications/Xcode124.app --tcp-tunnel=55899:55899 --devname 93c1ae4069a5601dc0d7e8725f9848815c5f3f7b -v -v -v -v -v
[15:46:41] dbug: [TCP tunnel] Using Xcode 12.4 found in /Applications/Xcode124.app
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Device discovery started
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Device discovery event: Connected (fff44:c6:5d:7e:84:2cfff)
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Device discovery event: Connected (fff44:c6:5d:82:ac:e7fff)
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Connected to Vsm-xam-appletv-02 (fff44:c6:5d:7e:84:2cfff) in 00:00:00.0186957
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Connected to Vsm-xam-appletv-01 (fff44:c6:5d:82:ac:e7fff) in 00:00:00.0155840
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Device discovery event: Connected (93c1ae4069a5601dc0d7e8725f9848815c5f3f7b)
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Connected to DNCENGTVOS-026 (93c1ae4069a5601dc0d7e8725f9848815c5f3f7b) in 00:00:00.0011462
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Creating USB tunnel between the port 55899 on the device and the port 55899 on the mac.
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61

...

[15:46:45] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:45] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:45] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Please connect the device 'Tcp tunnel started on device port 55899 to host port 55899 for device 93c1ae4069a5601dc0d7e8725f9848815c5f3f7b'...
[15:46:46] dbug: TCP tunnel created on port 55899

### WE SAW THAT TCP TUNNEL STARTED WELL
### STARTING THE APP


[15:46:46] dbug: Starting the application
[15:46:46] dbug: 
[15:46:46] dbug: Running /private/tmp/helix/working/B2D609CE/p/microsoft.dotnet.xharness.cli/1.0.0-prerelease.22327.1/tools/net6.0/any/../../../runtimes/any/native/mlaunch/bin/mlaunch --sdkroot /Applications/Xcode124.app -setenv=NUNIT_AUTOEXIT=true -setenv=NUNIT_HOSTPORT=55899 -setenv=NUNIT_ENABLE_XML_OUTPUT=true -setenv=NUNIT_XML_VERSION=xUnit -setenv=NUNIT_HOSTNAME=10.105.153.129,169.254.177.29,fe80::1%1,fe80::14ee:7dbb:3d47:cae1%4,2001:4898:9008:60:10a3:9ee6:c47a:ea01,2001:4898:9008:60:6131:89e0:ab6b:c9ac,2001:4898:9008:60:6071:4e8:3e9b:5ec1,2001:4898:9008:60:f449:e1d5:b345:31d6,2001:4898:9008:60:5dee:4dd2:58f:174,2001:4898:9008:60:bd42:319b:cd83:da0d,2001:4898:9008:60:4535:ba78:97ae:15cb,fe80::aede:48ff:fe00:1122%5,fe80::4f7:69d5:88c2:9f52%6,fe80::dcaf:aaff:fe16:42a1%10,fe80::dcaf:aaff:fe16:42a1%11,fe80::51a3:d95b:3171:8da0%17,fe80::a003:cdd2:b0b5:f493%18 --disable-memory-limits --devname 93c1ae4069a5601dc0d7e8725f9848815c5f3f7b -setenv=USE_TCP_TUNNEL=true --launchdevbundleid net.dot.System.Buffers.Tests --wait-for-exit -v -v -v -v -v


### NOW TCP TUNNEL BREAKS AGAIN

[15:46:47] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:47] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:47] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:47] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:47] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
...

### NOW IT WORKS AGAIN AND APP CONNECTS ACTUALLY

[15:46:50] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 0
[15:46:50] dbug: Test log server listening on: localhost:55899

###

[15:46:50] dbug: [TCP tunnel] Xamarin.Hosting: Created USB tunnel between the port 55899 on the device and the port 55899 on the mac.
[15:46:50] dbug: Connection from 127.0.0.1:55899 saving logs to /tmp/helix/working/B2D609CE/w/AA2A0964/uploads/test-tvos-device-20230411_154640.log
[15:46:50] dbug: Test execution started

### THE TEST FINISHES FINE AFTER THIS

Something must have changed in mlaunch or in MacOS but we didn't see this behaviour ~6 months ago and running device tests was actually super smooth. I am happy to meet in the office and we can talk about this in person as it's quite complicated.

@kotlarmilos
Copy link
Member

@premun Good idea! Let's synchronize offline and then post an update here.

@kotlarmilos
Copy link
Member

The regression that occurred at the beginning of February 2023 appears to have affected iOS and tvOS devices. According to the telemetry data, no change was found in the OS version on the host machines.

Screenshot 2023-04-28 at 17 33 45

In the past month, there were 13 TCP failures on osx.1200.amd64.iphone.open, and 138 TCP failures on osx.1015.amd64.iphone.open. Additionally, there are four times more devices in osx.1200.amd64.iphone.open than in osx.1015.amd64.iphone.open. The findings for tvOS devices are different, with uniform distribution of the TCP failure and no pattern detected.

Queue osx.1200.amd64.iphone.open is used on rolling builds (fewer invocations), but more tests are executed. One solution could be to exclude the osx.1015.amd64.iphone.open queue to see if the error rate decreases.

Anyway, these findings suggest that there may be an issue with the OS/Xcode or runtime network stack. To investigate this further and potentially repro the issue, it may be helpful to run an iOS app using the same mlaunch flow that Helix is using to run it, and collect the output over TCP locally.

@steveisok
Copy link
Member

Anyway, these findings suggest that there may be an issue with the OS/Xcode or runtime network stack. To investigate this further and potentially repro the issue, it may be helpful to run an iOS app using the same mlaunch flow that Helix is using to run it, and collect the output over TCP locally.

When this was done in the past, no one could reproduce the issue locally. @akoeplinger suspect's the tcp session is done over wifi as opposed to usb-max. I don't think that's something we've done locally and is worthy of a try.

@kotlarmilos
Copy link
Member

Thank you. Do you think that the following is something worthy of a try?

Queue osx.1200.amd64.iphone.open is used on rolling builds (fewer invocations), but more tests are executed. One solution could be to exclude the osx.1015.amd64.iphone.open queue to see if the error rate decreases.

@steveisok
Copy link
Member

Thank you. Do you think that the following is something worthy of a try?

We've tried variations of this with the same result, so no tbh. I think, as you pointed out, it's an issue w/ mlaunch/xcode/osx or something within runtime. I don't believe it's the latter.

@premun premun assigned ilyas1974 and unassigned premun May 5, 2023
@ilyas1974
Copy link
Contributor

@steveisok just wanted to see if there was any update on this item.

@premun
Copy link
Member

premun commented Jun 19, 2023

This is now being tracked in dotnet/runtime#82637 as a known error

@premun premun closed this as completed Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Detected By - Customer Issue was reported by a customer Ops - First Responder
Projects
None yet
Development

No branches or pull requests

5 participants