Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some macos agents are slow. More than 2x slower #3885

Closed
2 of 7 tasks
andreineculau opened this issue Aug 11, 2021 · 22 comments
Closed
2 of 7 tasks

Some macos agents are slow. More than 2x slower #3885

andreineculau opened this issue Aug 11, 2021 · 22 comments
Labels
Area: Apple bug report investigate Collect additional information, like space on disk, other tool incompatibilities etc.

Comments

@andreineculau
Copy link

andreineculau commented Aug 11, 2021

Description

Similar to #2707 , I am noticing slower MacOS agents. It's random, but often: 10-25% of the runs are slow.

A good run can "uninstall homebrew" (i.e. mostly disk i/o) in 184 seconds https://github.com/rokmoln/support-firecloud/runs/3295490478?check_suite_focus=true#step:3:190

A slow run can "uninstall homebrew" 2.5 times slower, in 483 seconds
https://github.com/rokmoln/support-firecloud/runs/3295490501?check_suite_focus=true#step:3:190

Similarly, reinstalling homebrew (disk, network and cpu bound) happens 1.8 times slower, in 593 seconds https://github.com/rokmoln/support-firecloud/runs/3295490501?check_suite_focus=true#step:3:447 instead of 333 seconds https://github.com/rokmoln/support-firecloud/runs/3295490478?check_suite_focus=true#step:3:456

Overall, my builds time out even with a 2x-than-normal timeout (normally 23minutes, timeout 45 minutes).

As requested by @miketimofeev , I have a repro workflow here https://github.com/andreineculau/actions-ve-repro-2707 , the same as @smorimoto used in #2707.

You can see a good run ending after 1m30s and running brew install systembench in 29s https://github.com/andreineculau/actions-ve-repro-2707/runs/3298362909?check_suite_focus=true as opposed to a slow run ending after 3m09s and running brew install systembench in 2m13s https://github.com/andreineculau/actions-ve-repro-2707/runs/3298362887?check_suite_focus=true .

Virtual environments affected

  • Ubuntu 16.04
  • Ubuntu 18.04
  • Ubuntu 20.04
  • macOS 10.15
  • macOS 11
  • Windows Server 2016
  • Windows Server 2019

Image version and build link

20210801.1

Is it regression?

No response

Expected behavior

Consistent times. I guess a <25% deviation is expected, but not >100%.

Actual behavior

More than 2x slower run times.

Repro steps

https://github.com/andreineculau/actions-ve-repro-2707/blob/85c642afe6c8d19f3e30bfa51de7d2075d6a5414/.github/workflows/workflow.yml

@LeonidLapshin
Copy link
Contributor

Hey, @andreineculau
I'll take a look and will return with new information soon :)
Thanks for reporting!

@andreineculau
Copy link
Author

I modified a bit my workflow, in order to separate more network from disk i/o

brew update in 2m31s https://github.com/andreineculau/actions-ve-repro-2707/runs/3299170641?check_suite_focus=true vs 41s https://github.com/andreineculau/actions-ve-repro-2707/runs/3299170575?check_suite_focus=true

You can see though how numbers point towards disk i/o e.g. brew update in 1m12s but brew install sysbench (with no network i/o!) in 57s https://github.com/andreineculau/actions-ve-repro-2707/runs/3299207551?check_suite_focus=true vs 28s and 19s in https://github.com/andreineculau/actions-ve-repro-2707/runs/3299207510?check_suite_focus=true

@andreineculau
Copy link
Author

One last comment before I put this to rest: I wanted to print some system info, and hit a very big diff 1m26s vs 4m43s

https://github.com/andreineculau/actions-ve-repro-2707/runs/3299223287?check_suite_focus=true vs https://github.com/andreineculau/actions-ve-repro-2707/runs/3299223358?check_suite_focus=true

A command like system_profiler SPSoftwareDataType SPDeveloperToolsDataType ran in 2s or 1m9s !!!

andreineculau added a commit to ysoftwareab/yplatform that referenced this issue Aug 11, 2021
@LeonidLapshin LeonidLapshin added Area: Apple investigate Collect additional information, like space on disk, other tool incompatibilities etc. and removed needs triage labels Aug 12, 2021
@LeonidLapshin
Copy link
Contributor

Hey, @andreineculau
We have investigated this issue and created some internal tasks to fix it, we'll update this ticket when we'll get the new information :)
Thanks!

@andreineculau
Copy link
Author

I haven't noticed flimsy performance for a while but I'm wondering if a cause has been found and a permanent fix pushed so this issue could be closed. Thanks.

@miketimofeev
Copy link
Contributor

@andreineculau some environments were fixed and some are still in progress, so the performance can be flaky.

@andreineculau
Copy link
Author

@miketimofeev thanks for the update! Just to be sure, there isn't anything one can do as part of the job configuration as a local fix, right?

@miketimofeev
Copy link
Contributor

@andreineculau unfortunately, yes. It's all about underlying infrastructure at the moment.

@JJ
Copy link
Contributor

JJ commented Jul 20, 2022

Performance is so slow that any CPU-bound task, like caching, is rendered totally useless. Unpacking a 120 MB file takes longer than actually running all the different npm i that create the node_modules directories (4x as slow).

@JJ
Copy link
Contributor

JJ commented Jul 20, 2022

@andreineculau unfortunately, yes. It's all about underlying infrastructure at the moment.

Is it possible that the fact that the runners use the CSharp runtime makes it not perform so well in a Mac environment?

@smorimoto
Copy link
Contributor

@JJ That's unlikely. At least it only executes system commands. In other words, there is absolutely nothing that humans can recognize about the overhead in that part, and the cause is clear from my past investigations. I'm a little confused as to why this problem still exists. See #2707

@sidekick-eimantas
Copy link

What a rabbithole of closed issues
Related issues:
#2707
#1336
#6547

Control (ubuntu-22.04):
image
image

Macos 12:
image
image

Macos 11:
image

Anyone got an update on the state of this work?

@PatTheMav
Copy link

On top of the observed fluctuations in performance (sometimes a Homebrew installation takes 10 seconds, sometimes the same installation takes 5 minutes), I also observed file I/O to be very slow recently and the Actions UI itself being unresponsive:

The runtime counter just stops, no visible log output is displayed and it can take minutes until any output is shown - I do however experience the same issue with other runners and the UI on the whole: It gets "stuck" showing a single job as active even though it has finished and doesn't update the state of other jobs.

@dockay
Copy link

dockay commented Nov 21, 2022

We are on Azure having the same issues recently. The performance of the macos VMs never has been great, but recently (1-2 weeks) the performance dropped to a new low. We have runners that run into the 60 minute timeout, that don't update the UI anymore (even after refresh etc.), jobs not even getting a runner (timeout here too, nonetheless there are runners available). I/O performance is on a new low as well, we have build tasks for an app, that took around 10min now we see build times around 15-20min, if they don't just die in the middle of the task.

Please MS/GH fix your macos setups, this is not usable anymore.

@connorjclark
Copy link

connorjclark commented Dec 14, 2022

Random ~100x regression on CPU heavy tasks. Here's an example, where this only happened for one Mac job (the others have normal performance): https://github.com/connorjclark/ZeldaClassic/actions/runs/3690258663/jobs/6247167952 Hope this information helps.

@niall-shaw
Copy link

@miketimofeev - is there any update on this? I use macos-latest, and downloading a (quite small) 26mb file sometimes takes >9 minutes. This network problem is causing some builds to be much slower than others.

eleith pushed a commit to eleith/emailjs that referenced this issue May 12, 2023
our macOS tests fails more than half the time while linux and windows
pass. the errors are always due to timeout issues.

while we can improve the performance of our tests (particularly
test/messages.ts) sometimes the timeouts happen when testing against the
local SMTPServer

as of now, we can't get insight into whether our tests or passing or
failing as the majority of test fails when macOS is included.

github is away of the issue: actions/runner-images#3885
@smorimoto
Copy link
Contributor

@miketimofeev Any updates on this?

@jozefizso
Copy link

We are still experiencing performance issues with Xcode builds on GitHub runners.

@christiangal-indi
Copy link

Any updates on this issue? Still a problem, and seems to get worse if you use xcode15 for testing and compiling.

He have seen cache download tasks that takes more than doble the time it would take if compiling every dependencies.
And also, fluctuation in overall use time of the agent as high as 2x for the same job.

@mikhailkoliada
Copy link
Contributor

Hey all! We have performed a lot of work for images optimisation and got lots of positive reports regarding images speed, the most significant up is now can be seen in OS13 runners (both intel and m1), also OS12 got speeded up as well due to some hardware modifications, we hope most of the customers will find current situation suitable for their needs, gonna close this ticket now, but we are always glad to hear feedback.

@erik-bershel
Copy link
Contributor

Separately, I should note that the performance of Xcode 15 on macOS-13 is monitored in another issue #7971 and is not related to this problem and hardware in general.

@smorimoto
Copy link
Contributor

Nothing seems to be fixed. For many things that need I/O performance, macOS Runner still has the worst performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Apple bug report investigate Collect additional information, like space on disk, other tool incompatibilities etc.
Projects
None yet
Development

No branches or pull requests