Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime-staging (i.e. iOS and tvOS) legs are timing out in PRs #58549

Closed
ViktorHofer opened this issue Sep 2, 2021 · 12 comments
Closed

runtime-staging (i.e. iOS and tvOS) legs are timing out in PRs #58549

ViktorHofer opened this issue Sep 2, 2021 · 12 comments
Assignees
Milestone

Comments

@ViktorHofer
Copy link
Member

Looking at the runtime-staging builds from the last 24 hours, a lot mobile legs, i.e. iOS and tvOS timed out. Those legs already have a quite large timeout (180min). As this is causing PRs to turn red, we should fix this immediately.

@steveisok @akoeplinger @MattGal

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Sep 2, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented Sep 2, 2021

Tagging subscribers to this area: @directhex
See info in area-owners.md if you want to be subscribed.

Issue Details

Looking at the runtime-staging builds from the last 24 hours, a lot mobile legs, i.e. iOS and tvOS timed out. Those legs already have a quite large timeout (180min). As this is causing PRs to turn red, we should fix this immediately.

@steveisok @akoeplinger @MattGal

Author: ViktorHofer
Assignees: -
Labels:

untriaged, area-Infrastructure-mono

Milestone: -

@ViktorHofer ViktorHofer added this to the 6.0.0 milestone Sep 2, 2021
@steveisok
Copy link
Member

@ViktorHofer when you say a lot, how many are we talking? Also, do some fall in the classification of getting a super slow mac? Or are they just getting slightly slower macs?

@akoeplinger Can you think of any immediate action we can take outside of shutting PR runs down temporarily?

@ViktorHofer
Copy link
Member Author

I did a manual check as I don't know how to query for timed out legs in AzDO or in Kusto. I looked at ~10 builds and half of those timed out.

@ViktorHofer
Copy link
Member Author

Also, do some fall in the classification of getting a super slow mac?

The ones that I looked at weren't slow mac machines. Building the repo didn't take longer than 20-30min but generating the app bundles took over two hours.

@MattGal
Copy link
Member

MattGal commented Sep 2, 2021

I found these runs while trying to hunt super-slow macs, but I don't see any evidence of real slowness when looking at their stages that should be basically constant, e.g. cloning the repo. There's two major problems IMO:

  • They always build all the tests even if they don't send to helix (@steveisok is checking that out)
  • Building the tests takes up most of the time, but then when they do send to helix, they need to understand that this is an external, load-sensitive thing (other jobs send these machines work) they're waiting on, so when actually building all these tests and sending to helix they need to accommodate the amount of time this takes... we're talking 15+ GB of apps being created, uploaded to azure storage, downloaded in the pacific northwest to on-premises machines, installed on an emu or device, run, and having the results be re-uploaded, per pipeline in this job. That is a lot!

Some sort of system where we don't build (Number of test assemblies) macOS app packages and only doing so when we actually need them will improve performance a lot.

@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 2, 2021
@ViktorHofer
Copy link
Member Author

@MattGal
Copy link
Member

MattGal commented Sep 8, 2021

Still timing out, i.e. https://github.com/dotnet/runtime/pull/58011/checks?check_run_id=3543565778.

Your archives here are just too big (it's something in the 15-20 GB range). I'd like to point out something about one of the listed runs to make my case.

2021-09-08T09:53:59.8783710Z   Sending Job to OSX.1015.Amd64.Open...
2021-09-08T10:32:17.8976280Z   Sent Helix Job; see work items at https://helix.dot.net/api/jobs/cb1b61c0-2f46-42f1-8def-876c46d0777f/workitems?api-version=2019-06-17

It took this hosted macos machine 39 minutes to upload all those zips (wow), but the job it sent? It ran the entire 3 hours, 20 minutes of test work items in a mere 5 minutes, 18 seconds.

WorkItems
| where JobName == "cb1b61c0-2f46-42f1-8def-876c46d0777f"
| extend runtime = Finished-Started
| summarize sum(runtime), min(Queued), max(Finished), entireTime=max(Finished)-min(Queued)

I can't tell you what to do here, but something has to make these payloads smaller somehow. I've long though about combining test assemblies into a batches for app packages, but the real decision will of course be owned by the team.

@steveisok steveisok self-assigned this Sep 8, 2021
@steveisok
Copy link
Member

This is something we are actively working on. We're trying to prove / validate a couple of ways to reduce build times and maybe payload sizes as well. No lightning quick solution though.

@akoeplinger
Copy link
Member

akoeplinger commented Sep 12, 2021

#58965 shaves off a considerable amount of the time it takes to build the app bundles.

I did also see what @MattGal noticed in that PR, but there's an interesting twist:

The Build iOSSimulator x64 Release AllSubsets_Mono job took 50 minutes between the Sending Job to OSX.1015.Amd64.Open... and Sent Helix Job messages to upload archives while the Build tvOSSimulator x64 Release AllSubsets_Mono job only took two (!) minutes to do the same, even though the app bundle sizes should be nearly identical between the two jobs.

So there must be something else going on than "just" large archives.

@MattGal
Copy link
Member

MattGal commented Sep 13, 2021

So there must be something else going on than "just" large archives.

Yes, though I think the largeness is part of the problem since network and disk bandwidth of a virtual machine running on a host is commonly shared between the two agents running there.

We've been hunting down examples of slow mac stuff recently and I realized I should re-enable the helix side of it, since perhaps that's what I'm looking for.

This job does seem to have interesting data points for our tracking issue, https://github.com/dotnet/core-eng/issues/14027

2 minutes (to upload 4.8 GB) version:

Name: Build tvOSSimulator x64 Release AllSubsets_Mono
Helix Job : 7ab7deeb-2941-4f82-a3a4-b9f2e7dd37c7
Job list: https://helixde8s23ayyeko0k025g8.blob.core.windows.net/helix-job-1367ec86-b6b9-4b48-ad51-0417ce3017b60a41aa8e71447a0ae/job-list-419c8afc-3d38-4a88-bd70-56f853bac9ef.json?sv=2019-07-07&se=2021-10-01T22%3A40%3A51Z&sr=c&sp=rl&sig=MOBx7MnChUDyPsIf2AcVa5ukLsX2rY7a8qC%2BtaDea9E%3D
Total payload size: 4.81 GiB

50 minutes (to upload 3.64 GB) version

Name: Build iOSSimulator x64 Release AllSubsets_Mono
Helix Job : ec79c1e3-835b-4720-9b3a-c3168dd33a21
Job List: https://helixde8s23ayyeko0k025g8.blob.core.windows.net/helix-job-caed5aa9-afd8-4877-93da-dbb85816e60b7c81df05b5b450f96/job-list-20bb3e33-5022-4d35-bebf-e476c901a39c.json?sv=2019-07-07&se=2021-10-01T22%3A40%3A19Z&sr=c&sp=rl&sig=1m%2BSx%2FsWvyjEzmc8GVtLJI6nH5GxweLRLTnX0clCNCE%3D
Total Payload size: 3.64 GiB

So the slower one actually uploaded 1.2 GB less, that's very interesting. We'll keep pushing on the IcM linked in the issue, but please have patience because there just isn't enough instrumentation yet to understand the problem.

steveisok pushed a commit that referenced this issue Oct 21, 2021
…on CI (#59154)

Backport of #58965 to release/6.0

This allows us to not run the CMake configure step separately for each libraries test suite which speeds up the build.

Helps with #58549
@steveisok steveisok modified the milestones: 7.0.0, Future Jul 31, 2022
@akoeplinger
Copy link
Member

We don't have runtime-staging anymore, closing this old issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

5 participants