-
Notifications
You must be signed in to change notification settings - Fork 353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[iOS] Infrastructure is failing to report successful test run #11683
Comments
Another hit in dotnet/runtime#78593 |
AnalysisI can see that the TCP tunnel between XHarness and the device failed:
When this happens, the app just dumps the These TCP problems happen every now and then and this is the issue for them: dotnet/xharness#934 Mitigation proposalI can see that we have turned retries off for Apple devices due to a small amount of HW but maybe we can turn them on? We now have many more iPhones in the osx.1200.amd64.iphone.open queue. Maybe we can move onto there and turn on retries? This sounds like we should do it anyway? @steveisok what do you think? Solution (?)
|
I don't think we should create "known issue" for this as it might cover potential problems. I think enabling retries is the best next step. I will add open a PR for that. I checked the telemetry and there are not many Regardless.. seems like we are not using the new iPhones though - there are 50 iPhones with no work - we should also enable tests for iPhones too probably, not just AppleTVs? |
We can enable retries and see if they help. I thought we added iphones to the rolling build. I'll definitely do that if it's not currently. |
We have agreed that we could make XHarness recognize TCP issues and return some extra exit code and then have a known build error for that case. |
+1 I believe that the logic you have added in #11689 is going to hide intermittent test crashes that is bad for getting a signal about the product reliability. Can it be deleted as part of fixing this? |
Yes, this is a good point. Let's start distinguishing TCP-caused failures and let the other blow up. I will include this in the issue that I will open for the TCP exit code. |
Removing |
I have logged #11700 with enough details for someone to try to fix this |
Fixes for this were added through #11700. Keeping this open and will monitor the telemetry through out the week. |
Seems like the new fixes are already working! There are the TCP problems being detected: But what is more important is that the retries happening for these cases now dropped the overall failure rate of AppleTV jobs to 0: |
@premun it looks like your retry logic did mitigate a bunch, but there are still some getting through. |
@premun I've been hitting it for a coupe of days in dotnet/runtime#81319. I will test the CI from a dummy PR with the changes and make sure it is not related. |
@kotlarmilos seems like the AzDO builds of your PR got deleted as a new ones got queued by new commits. I tried to find a failing leg in the commit history of your PR but no luck. Could you point me to one please? Can be just a failing Helix job. |
@premun here's an example build |
@premun @steveisok If the increase in iOS device failures are related to dotnet/runtime#81319 only, this issue could be closed. Thanks for the support! |
@kotlarmilos there are still issues with the tcp tunnel even though your PR may have contributed to the spike of failures. |
Okay, so we've now got a |
@premun I agree. However, I am noticing an odd pattern if you look at recent builds. When you dive into this log, notice how many retries there are: And if you look at the net.dot from any of the failures, they'll appear to be running and abruptly cut off as a "runtime crash". I don't think that's accurate and it appears that helix is cutting the run short. |
I think the crashes are expected - the app never connects over TCP so starts running without it but XHarness hits the launch timeout and kills the app because it doesn't know what's going on.
|
In the build https://github.com/dotnet/runtime/pull/79169/checks?check_run_id=12668103836 the tests failed due to a problem with establishing a TCP connection. As a result, XHarness terminated the tests execution. In the build https://github.com/dotnet/runtime/pull/84304/checks?check_run_id=12661274730 the tests may have failed during the startup due to missing runtime invoke wrappers. Also, it may have failed due to a problem with establishing a TCP connection. As a result, the issue was reported as a problem with establishing a TCP connection. Since the issue is not reproducible locally, it is currently unclear what the next steps should be. @steveisok @premun I am considering some ideas for improving the overall experience. Here are some thoughts:
|
It's actually not XHarness doing the TCP but mlaunch - a tool used by XHarness and VS to talk to Apple devices/simulators. We didn't use to have this many issues with TCP but it's possible something regressed between mlaunch and new MacOS versions.
This happens in the TestRunner - https://github.com/dotnet/xharness/blob/389c851b0dc1d2c50d03e4aad000b7802d0ebed6/src/Microsoft.DotNet.XHarness.TestRunners.Common/iOSApplicationEntryPointBase.cs#L26 |
Thank you for the explanation. Here are some additional points that may be worth considering.
I've seen a number of TCP error logs, even on successfully executed tests where mlaunch connects to a device after N attempts. What would be a good way to measure occurrence of the connection attempts on the CI? We may update the pattern in this issue to search for TCP error logs, but it may collect only failed jobs.
Probably I didn't fully understand your comment at first. Are you saying that if a TCP connection is not established, XHarness will not launch the app but instead hit a timeout? According to the logs, it seems that the app had started, and during its execution a timeout occured.
Based on the failures discovered in dotnet/runtime#84304, it was observed that the error log of an application that might fails during startup is being masked by TCP connection failures. To proceed working on such improvements and improve overall CI test execution on ios platforms, I recommend considering the following action points:
Do you have any other ideas on what could be improved or do you think some of the action points listed are unnecessary? I suggest updating the tracking issue dotnet/runtime#84254 with some of the above-mentioned action points. |
It happens in this order:
Now.. XHarness doesn't have visibility into the device so it doesn't know if the app started well. The app can now crash for instance so it never connects. The app also might fail to connect to the TCP tunnel. For XHarness it's all the same. If it doesn't connect in a specific time (argument Now from the point of view of the app, the app can start and as argument (or envvar) it receives a port where the tunnel should be. If it can't connect there, it will log in it's stdout instead of the TCP tunnel (this is the TestRunner link I sent). There's also this diagram for this - #11700 To me, it seems, that in the last few months, mlaunch opens the tunnel but then has issues keeping it open. If you check the logs, it flops from "awaiting connection on port XY" to "failed to open the tunnel" again. We try to detect this state with a recent change and categorize this as TCP failure.
When making the last change, we did give the app a wrong port purposefully to test the TCP tunnel not being there.
The problem is that the TCP tunnel starts in a while and then things can work fine. Example: These are the main things:
Something must have changed in mlaunch or in MacOS but we didn't see this behaviour ~6 months ago and running device tests was actually super smooth. I am happy to meet in the office and we can talk about this in person as it's quite complicated. |
@premun Good idea! Let's synchronize offline and then post an update here. |
The regression that occurred at the beginning of February 2023 appears to have affected iOS and tvOS devices. According to the telemetry data, no change was found in the OS version on the host machines. In the past month, there were 13 TCP failures on Queue Anyway, these findings suggest that there may be an issue with the OS/Xcode or runtime network stack. To investigate this further and potentially repro the issue, it may be helpful to run an iOS app using the same mlaunch flow that Helix is using to run it, and collect the output over TCP locally. |
When this was done in the past, no one could reproduce the issue locally. @akoeplinger suspect's the tcp session is done over wifi as opposed to usb-max. I don't think that's something we've done locally and is worthy of a try. |
Thank you. Do you think that the following is something worthy of a try?
|
We've tried variations of this with the same result, so no tbh. I think, as you pointed out, it's an issue w/ mlaunch/xcode/osx or something within runtime. I don't believe it's the latter. |
@steveisok just wanted to see if there was any update on this item. |
This is now being tracked in dotnet/runtime#82637 as a known error |
Build
https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=89051
Build leg reported
System.Runtime.Tests.WorkItemExecution
Pull Request
dotnet/runtime#78288
Action required for the engineering services team
To triage this issue (First Responder / @dotnet/dnceng):
If this is an issue that is causing build breaks across multiple builds and would get benefit from being listed on the build analysis check, follow the next steps:
Additional information about the issue reported
System.Runtime.Tests are reported as failing. It looks like a failure in the reporting infrastructure. System.Runtime.Tests succeeded according to the
net.dot.System.Runtime.Tests.log
file:Report
Summary
The text was updated successfully, but these errors were encountered: