Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building a NET6/NET7 iOS project on agent M2 ARM 64 hangs/freezes #17825

Closed
vincentcastagna opened this issue Mar 16, 2023 · 42 comments · Fixed by #18793
Closed

Building a NET6/NET7 iOS project on agent M2 ARM 64 hangs/freezes #17825

vincentcastagna opened this issue Mar 16, 2023 · 42 comments · Fixed by #18793
Labels
bug If an issue is a bug or a pull request a bug fix performance If an issue or pull request is related to performance
Milestone

Comments

@vincentcastagna
Copy link

vincentcastagna commented Mar 16, 2023

Steps to Reproduce

  1. Create an agent on M2 ARM 64 agent (3.214.0)
  2. Build a .NET6/.NET7 iOS project
  3. Notice the build MIGHT hang sometimes on Apple Clang process

We don't face the issue on X64 on prem agents or even hosted.
There is no real consistency on when the build will hangs or not. It depends on the run.

We already tried removing Trimmer, which doesn't seem to have any effect. With or without, the behavior is the same.

Expected Behavior

Build should never hang

Actual Behavior

Build hangs sometimes and never ends, until timeout

Environment

  • Xcode 14.2
  • Visual Studio for mac 17.5.1
  • This is the .csproj that we try to build

AGENT CAPABILITIES :

Agent.Name MACOS-2C83F31C-42D1-4BA5-9686-611EB3632BD4    
  Agent.Version 3.214.0  
  _ ./externals/node16/bin/node  
  __CF_USER_TEXT_ENCODING 0x1F5:0x0:0x52  
 
  CP_HOME_DIR /Users/administrator/agent/_work/_temp/.cocoapods  
  curl /usr/bin/curl  
  dotnet /usr/local/share/dotnet/dotnet  
  DOTNET_ROOT /usr/local/share/dotnet  
  git /usr/bin/git  
  HOME /Users/administrator  
  InteractiveSession False  
  java /usr/bin/java  
  JDK /usr/bin/javac  
  LANG en_CA.UTF-8  
  LOGNAME administrator  
  make /usr/bin/make  
  MSBuild /Library/Frameworks/Mono.framework/Versions/Current/Commands/msbuild  
  NUGET_HTTP_CACHE_PATH /Users/administrator/agent/_work/_temp/.nuget-http-cache  
  NUGET_PACKAGES /Users/administrator/agent/_work/_temp/.nuget  
  PATH /Users/administrator/.rbenv/shims:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/share/dotnet:~/.dotnet/tools:/Library/Apple/usr/bin:/Library/Frameworks/Mono.framework/Versions/Current/Commands  
  PWD /Users/administrator/agent  
  python3 /usr/bin/python3  
  rake /Users/administrator/.rbenv/shims/rake  
  ruby /Users/administrator/.rbenv/shims/ruby  
  sh /bin/sh  
  SHELL /bin/zsh  
  SSH_AUTH_SOCK /private/tmp/com.apple.launchd.MgBJHUlv5M/Listeners  
  TMPDIR /var/folders/33/ph0v51hd30n2frx557550mnc0000gn/T/  
  USER administrator  
  VSTS_AGENT_SVC 1  
  Xamarin.iOS /Applications/Visual Studio.app/Contents/MacOS/vstool  
  Xamarin.iOS_Version 16.1.1  
  XamarinBuildDownloadDir /Users/administrator/agent/_work/_temp/.xbcache  
  xcode /Applications/Xcode.app/Contents/Developer  
  Xcode_Version 14.2  
  XPC_FLAGS 0x0  
  XPC_SERVICE_NAME 0

Build Logs

MSBUILD BINLOG (seem corrupted ...)

build-net7.0-ios.zip

Example Project (If Possible)

https://github.com/nventive/UnoApplicationTemplate/blob/dev/vica/make-usage-new-agents-net7.0/src/app/ApplicationTemplate.Mobile/ApplicationTemplate.Mobile.csproj

@vincentcastagna vincentcastagna changed the title Building a NET6/NET7 iOS project on M2 ARM 64 hangs/freezes Building a NET6/NET7 iOS project on agent M2 ARM 64 hangs/freezes Mar 16, 2023
@rolfbjarne
Copy link
Member

@vincentcastagna have you ever seen this on an M1 machine? or have you never tried on M1?

@rolfbjarne rolfbjarne added the need-info Waiting for more information before the bug can be investigated label Mar 16, 2023
@ghost
Copy link

ghost commented Mar 16, 2023

Hi @vincentcastagna. We have added the "need-info" label to this issue, which indicates that we have an open question for you before we can take further action. This issue will be closed automatically in 7 days if we do not hear back from you by then - please feel free to re-open it if you come back to this issue after that time.

@rolfbjarne rolfbjarne added this to the Future milestone Mar 16, 2023
@vincentcastagna
Copy link
Author

@vincentcastagna have you ever seen this on an M1 machine? or have you never tried on M1?

I don't believe we tried on a M1 machine with a .NET6/.NET7 iOS project (only Xamarin.IOS). We will try asap and give feedback here.

@ghost ghost added need-attention An issue requires our attention/response and removed need-info Waiting for more information before the bug can be investigated labels Mar 16, 2023
@vincentcastagna
Copy link
Author

vincentcastagna commented Mar 16, 2023

I believe this might also be linked with this issue that I opened recently on ADO pipelines repo : microsoft/azure-pipelines-agent#4205

@mattjohnsonpint
Copy link

Not sure if this is the same thing, but this hangs during the AOT compilation every time (running on an M1):

dotnet new ios
dotnet build --sc -r iossimulator-arm64 -c Release

Works fine for iossimulator-x64 or ios-arm64. Also works fine for iossimulator-arm64 in debug builds, just not in release builds. (I believe debug builds aren't AOT compiled, right?)

@mattjohnsonpint
Copy link

Related, is there a reason that the AOT process on an arm64 machine has to run though x64 emulation? I see it's using Microsoft.NETCore.App.Runtime.AOT.osx-x64.Cross.iossimulator-arm64. I don't see any published packages for ...AOT.osx-arm64...
It seemed strange to run emulate just to cross compile back to the original architecture. Certainly not the best for perf.

@vincentcastagna
Copy link
Author

@rolfbjarne We have tested on a M1 machine. The behavior is exactly the same, sometimes it builds successfully, sometimes it just hangs. Seem to be happening half the time, exactly like M2.

iOS BUILD M1 - HANGS.txt

@rolfbjarne
Copy link
Member

Related, is there a reason that the AOT process on an arm64 machine has to run though x64 emulation? I see it's using Microsoft.NETCore.App.Runtime.AOT.osx-x64.Cross.iossimulator-arm64. I don't see any published packages for ...AOT.osx-arm64...
It seemed strange to run emulate just to cross compile back to the original architecture. Certainly not the best for perf.

Just time constraints. We're fixing it for .NET 8 (dotnet/runtime#74175).

@rolfbjarne
Copy link
Member

Not sure if this is the same thing, but this hangs during the AOT compilation every time (running on an M1):

dotnet new ios
dotnet build --sc -r iossimulator-arm64 -c Release

Works fine for iossimulator-x64 or ios-arm64. Also works fine for iossimulator-arm64 in debug builds, just not in release builds. (I believe debug builds aren't AOT compiled, right?)

I think this is a different issue, because I believe this is just the build taking very long because of a few things add up:

  • When building for ARM64, we use the AOT compiler (which is quite slow).
  • When building for the simulator, we disable the trimmer (so everything in the BCL has to be AOT compiled).
  • When building for Release, we enable LLVM (which is very slow) - this is the most significant change wrt the Debug configuration.

If you add <MtouchUseLlvm>false</MtouchUseLlvm> to the csproj, I believe your Release build will be faster.

This might also work for your device builds (for different reasons - we've seen llvm run into infinite loops in the past) - so could you try and see if you notice any difference?

@rolfbjarne rolfbjarne added need-info Waiting for more information before the bug can be investigated and removed need-attention An issue requires our attention/response labels Mar 21, 2023
@ghost
Copy link

ghost commented Mar 21, 2023

Hi @vincentcastagna. We have added the "need-info" label to this issue, which indicates that we have an open question for you before we can take further action. This issue will be closed automatically in 7 days if we do not hear back from you by then - please feel free to re-open it if you come back to this issue after that time.

@vincentcastagna
Copy link
Author

@rolfbjarne Not sure what extra info you would need ? I can provide.
mattjohnsonpint comments are unrelated to this issue I believe.

I omitted to precise that those agents are fully working on ANY Xamarin project, the build is super fast and never fails. Only our .NET6/.NET7 agents randomly hangs.

@ghost ghost added need-attention An issue requires our attention/response and removed need-info Waiting for more information before the bug can be investigated labels Mar 22, 2023
@mattjohnsonpint
Copy link

@rolfbjarne - you were right on all accounts. I just had to wait about 6 minutes instead of the normal 5 to 10 seconds. Setting MtouchUseLlvm=false returned it to normal speed.

@vincentcastagna - Sorry. I didn't mean to hijack this thread. Just thought it could be useful. Not sure if that's what's happening on your build agents or not. Thanks.

@rolfbjarne
Copy link
Member

@vincentcastagna I'm assuming you only see this when building in Azure DevOps, and never locally?

One theory is that something pops up a permission dialog for some reason, and that blocks the build until it times out. Unfortunately these issues can be hard to track down unless you can access the built bot remotely (and catch it when the build is stuck).

One idea might be to make the build as verbose as possible, that should pinpoint a bit better exactly where it stops, and this is done by passing /v:diagnostic to the dotnet command:

dotnet build myapp.csproj /v:diagnostic

Could you do this and see what it shows?

@rolfbjarne rolfbjarne added need-info Waiting for more information before the bug can be investigated and removed need-attention An issue requires our attention/response labels Mar 23, 2023
@ghost
Copy link

ghost commented Mar 23, 2023

Hi @vincentcastagna. We have added the "need-info" label to this issue, which indicates that we have an open question for you before we can take further action. This issue will be closed automatically in 7 days if we do not hear back from you by then - please feel free to re-open it if you come back to this issue after that time.

@vincentcastagna
Copy link
Author

vincentcastagna commented Mar 24, 2023

@rolfbjarne here logs with /v:diagnostic you can see the instruction at the top of the logs. I don't see a real difference with or without this instruction. I have access to the machine of the agent, and I have never seen a permission dialog poping up though ... even in CLI logs or else.

iOS BUILD diagnostics - HANGS.txt

iOS BUILD diagnostics - OK.txt

@ghost ghost added need-attention An issue requires our attention/response and removed need-info Waiting for more information before the bug can be investigated labels Mar 24, 2023
@rolfbjarne
Copy link
Member

I don't see a real difference with or without this instruction.

Because right after /v:diagnostic it's changed again to -verbosity:n:

/v:diagnostic -verbosity:n

@vincentcastagna
Copy link
Author

vincentcastagna commented Mar 27, 2023

I don't see a real difference with or without this instruction.

Because right after /v:diagnostic it's changed again to -verbosity:n:

/v:diagnostic -verbosity:n

Oh my bad, I missunderstood your previous comment, I will provide logs wtih verbosity level set to diagnostic asap.

@rolfbjarne
Copy link
Member

rolfbjarne commented Apr 20, 2023

@vincentcastagna I'm sorry I didn't answer earlier, but unfortunately I don't have any good ideas.

I see you're building the 'Release' configuration, does the same thing happen if you build 'Debug'? If so, one idea might be to turn off LLVM (by setting <MtouchUseLlvm>false</MtouchUseLlvm> in the project file or on the command line as /p:MtouchUseLlvm=false and see if that makes a difference).

@vincentcastagna
Copy link
Author

@vincentcastagna I'm sorry I didn't answer earlier, but unfortunately I don't have any good ideas.

I see you're building the 'Release' configuration, does the same thing happen if you build 'Debug'? If so, one idea might be to turn off LLVM (by setting <MtouchUseLlvm>false</MtouchUseLlvm> in the project file or on the command line as /p:MtouchUseLlvm=false and see if that makes a difference).

We already tried deactivating LLVM when I created the issue, but in case, I retried. And the behavior remains the same, sometimes it goes through, sometimes it just hangs.

@rolfbjarne
Copy link
Member

@vincentcastagna I'm sorry I didn't answer earlier, but unfortunately I don't have any good ideas.
I see you're building the 'Release' configuration, does the same thing happen if you build 'Debug'? If so, one idea might be to turn off LLVM (by setting <MtouchUseLlvm>false</MtouchUseLlvm> in the project file or on the command line as /p:MtouchUseLlvm=false and see if that makes a difference).

We already tried deactivating LLVM when I created the issue, but in case, I retried. And the behavior remains the same, sometimes it goes through, sometimes it just hangs.

What about a debug build that's not signed, so something like this (i.e. dotnet build instead of dotnet publish, and not passing /p:CodesignProvision=...)):

dotnet build -f:net7.0-ios ...

@filipnavara
Copy link
Contributor

If you happen to have a way to run something on the machine with the stuck process then dotnet-stack would be useful (more info here). You install the tool with dotnet tool install --global dotnet-stack and then run it with dotnet stack report -p <id of the stuck process>. Something like pgrep dotnet | xargs -L1 dotnet stack report -p would dump stacks of all the dotnet processes on the machine.

@vincentcastagna
Copy link
Author

vincentcastagna commented May 23, 2023

I have ran a dotnet stack report -p for each msbuild processes I found running using pstree once a build hangs. I don't see much information here, but hopefully this would be useful to you :

msbuildstack.zip

@filipnavara I tried to run pgrep dotnet | xargs -L1 dotnet stack report -p , but the command line gets frozen and nothing happens. I also tried to write the output in a file but it just hangs

image

@filipnavara
Copy link
Contributor

Both of the stack traces contain OutOfProcNode.Run so they seem to be waiting for some other MSBuild (?) process.

I tried to run pgrep dotnet | xargs -L1 dotnet stack report -p , but the command line gets frozen and nothing happens.

There are two possible explanations for this. Either I messed up and it's trying to dump itself in a loop, or some process is stuck so badly that not even the diagnostic pipes work. The former is not very likely since I tested that very same command locally. The later would likely imply hitting some .NET runtime bug (and there's only one thread-suspension bug that comes to mind which was fixed in .NET 7 iirc)...

@vincentcastagna
Copy link
Author

vincentcastagna commented May 23, 2023

Thank you for your quick answer.

Both of the stack traces contain OutOfProcNode.Run so they seem to be waiting for some other MSBuild (?) process.

As you saw I found two msbuild processes ... could it be that they wait on each other, driving an endless waiting loop. Any advice maybe to try confirm that or seek for other processes that would be waited by msbuild ?

I decided to let pgrep dotnet | xargs -L1 dotnet stack report -p run . Finally ended ...

[ERROR] System.IO.EndOfStreamException: Unable to read beyond the end of the stream.
   at System.IO.BinaryReader.InternalRead(Int32 numBytes)
   at System.IO.BinaryReader.ReadUInt16()
   at Microsoft.Diagnostics.NETCore.Client.IpcHeader.Parse(BinaryReader[ERROR] System.IO.EndOfStreamException: Unable to read beyond the end of the stream.
   at System.IO.BinaryReader.InternalRead(Int32 numBytes)
   at System.IO.BinaryReader.ReadUInt16()
   at Microsoft.Diagnostics.NETCore.Client.IpcHeader.Parse(BinaryReader reader) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsIpc/IpcHeader.cs:line 55
   at Microsoft.Diagnostics.NETCore.Client.IpcMessage.Parse(Stream stream) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsIpc/IpcMessage.cs:line 117
   at Microsoft.Diagnostics.NETCore.Client.IpcClient.Read(Stream stream) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsIpc/IpcClient.cs:line 107
   at Microsoft.Diagnostics.NETCore.Client.IpcClient.SendMessageGetContinuation(IpcEndpoint endpoint, I reader) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsIpc/IpcHeader.cs:line 55
   at Microsoft.Diagnostics.NETCore.Client.IpcMessage.Parse(Stream stream) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsIpc/IpcMessage.cs:line 117
   apcMessage message) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsIpc/IpcClient.cs:line 44
   at Microsoft.Diagnostics.NETCore.Client.EventPipeSession.Start(IpcEndpoint endpoint, IEnumerable`1 providers, Boolean requestRundown, Int32 circularBufferMB) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsClient/EventPipeSession.cs:line 34
   at Microsoft.Diagnostics.Tools.Stack.ReportCommandHandler.Report(CancellationToken ct, IConsole console, Int32 processId, String name, TimeSpan duration)t Microsoft.Diagnostics.NETCore.Client.IpcClient.Read(Stream stream) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsIpc/IpcClient.cs:line 107
   at Microsoft.Diagnostics.NETCore.Client.IpcClient.SendMessageGetContinuation(IpcEndpoint endpoint, I
pcMessage message) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsIpc/IpcClient.cs:line 44
   at Microsoft.Diagnostics.NETCore.Client.EventPipeSession.Start(IpcEndpoint endpoint, IEnumerable`1 providers, Boolean requestRundown, Int32 circularBufferMB) in /_/src/Microsoft.Diagnostics.NETCore.Client/DiagnosticsClient/EventPipeSession.cs:line 34
   at Microsoft.Diagnostics.Tools.Stack.ReportCommandHandler.Report(CancellationToken ct, IConsole console, Int32 processId, String name, TimeSpan duration)
xargs: dotnet: exited with status 255; aborting

I'll also try to target latest .NET 7

@vincentcastagna vincentcastagna closed this as not planned Won't fix, can't repro, duplicate, stale Jun 5, 2023
@svaldetero
Copy link

I think I'm running into this issue also. I recently moved from microsoft hosted to a self-hosted M2 Max MacStudio. Changing nothing in the pipeline definition, the command line dotnet publish 'ProjectName.csproj' -f net7.0-ios --self-contained -r ios-arm64 -c Release -p:BuildIpa=True always freezes and eventually times out at 60 minutes or I have to cancel it. I tried switching it to dotnet build 'ProjectName.csproj' -f net7.0-ios -c Release and it has the same result. What's frustrating is I can copy the exact command to terminal and run it in the same directory and it works just fine.

I tried running dotnet stack but it just hung and never finished. I got the same EndOfStreamException when I finally cancelled the pipeline.

@rolfbjarne
Copy link
Member

At this point I believe this is either a bug in msbuild or in the runtime, not in any of our MSBuild logic, so I'm moving to dotnet/msbuild.

@rolfbjarne
Copy link
Member

This issue was moved to dotnet/msbuild#8970

@ghost ghost locked as resolved and limited conversation to collaborators Jul 28, 2023
@rolfbjarne rolfbjarne reopened this Aug 23, 2023
@rolfbjarne
Copy link
Member

The MSBuild team analyzed this, and found that a potential culprit is that we're not limiting parallization of AOT processes to the number of CPUs, so we can end up with hundreds of concurrent processes competing for resources.

Ref: dotnet/msbuild#8970 (comment)

So I'm reopening this issue to fix the parallelization problem. Note: this may not turn out to be the actual culprit, but it's a good thing to fix anyways.

@rolfbjarne rolfbjarne added performance If an issue or pull request is related to performance bug If an issue is a bug or a pull request a bug fix labels Aug 23, 2023
@rolfbjarne rolfbjarne modified the milestones: Future, .NET 8 Aug 23, 2023
rolfbjarne added a commit to rolfbjarne/xamarin-macios that referenced this issue Aug 23, 2023
…of processors.

This might fix xamarin#17825, but even if it doesn't, it's a good thing to do to not
overload machines.

Ref: xamarin#17825
rolfbjarne added a commit that referenced this issue Aug 25, 2023
…of processors. (#18793)

This might fix #17825, but even if it doesn't, it's a good thing to do
to not overload machines.

Ref: #17825
@rolfbjarne
Copy link
Member

The fix to limit parallelization has been merged, and I'm closing this tentatively.

I'll try to get the fix in a service release for .NET 7 (it's too late for the next one, but it'll likely be in the one after that).

Otherwise it'll also be in .NET 8 RC 2 (not RC 1, too late for that too).

Feel free to reopen this issue if the hangs/freezes persist even with the fix.

rolfbjarne added a commit to rolfbjarne/xamarin-macios that referenced this issue Aug 25, 2023
…ers to the number of processors.

This might fix xamarin#17825, but even if it doesn't, it's a good thing to do to not
overload machines.

Ref: xamarin#17825

Backport of xamarin#18793.
rolfbjarne added a commit to rolfbjarne/xamarin-macios that referenced this issue Aug 25, 2023
…mpilers to the number of processors. (xamarin#18793)

This might fix xamarin#17825, but even if it doesn't, it's a good thing to do
to not overload machines.

Ref: xamarin#17825
rolfbjarne added a commit that referenced this issue Sep 13, 2023
…ers to the number of processors. (#18817)

This might fix #17825, but even if it doesn't, it's a good thing to do to not
overload machines.

Ref: #17825

Backport of #18793.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug If an issue is a bug or a pull request a bug fix performance If an issue or pull request is related to performance
Projects
None yet
5 participants