Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime-dev-innerloop leg is frequently timing out on PRs #49309

Closed
stephentoub opened this issue Mar 8, 2021 · 17 comments
Closed

runtime-dev-innerloop leg is frequently timing out on PRs #49309

stephentoub opened this issue Mar 8, 2021 · 17 comments
Labels
area-Infrastructure untriaged New issue has not been triaged by the area owner
Milestone

Comments

@stephentoub
Copy link
Member

e.g.
https://dev.azure.com/dnceng/public/_build/results?buildId=1027241&view=logs&jobId=e93c566c-b6be-5acb-6fc2-de2972d01fec
#49270 (comment)

@stephentoub stephentoub added this to the 6.0.0 milestone Mar 8, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 8, 2021
@ghost
Copy link

ghost commented Mar 8, 2021

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

Issue Details

e.g.
https://dev.azure.com/dnceng/public/_build/results?buildId=1027241&view=logs&jobId=e93c566c-b6be-5acb-6fc2-de2972d01fec
#49270 (comment)

Author: stephentoub
Assignees: -
Labels:

area-Infrastructure

Milestone: 6.0.0

@ViktorHofer
Copy link
Member

ViktorHofer commented Mar 8, 2021

Sad to see that some of the devices on CI can't build dotnet/runtime in under an hour. As a short term mitigation we probably want to bump the timeout to 90min.

@safern
Copy link
Member

safern commented Mar 8, 2021

Actually by looking at the timestamps it seems like the build hung after creating the Microsoft.NETCore.App.Runtime package.

2021-03-08T17:19:28.5205084Z   Microsoft.NETCore.App.Ref -> 
2021-03-08T17:20:02.3356021Z   Microsoft.NETCore.App.Ref -> D:\workspace\_work\1\s\artifacts\packages\Release\Shipping\dotnet-targeting-pack-6.0.0-ci-win-x86.msi
2021-03-08T17:20:05.0806213Z   Microsoft.NETCore.App.Ref -> D:\workspace\_work\1\s\artifacts\packages\Release\NonShipping\VS.Redist.Common.NetCore.TargetingPack.x86.6.0.6.0.0-ci.nupkg
2021-03-08T17:20:05.0981076Z   Zipping directory "D:\workspace\_work\1\s\artifacts\obj\Microsoft.NETCore.App.Ref\Release\net6.0\win-x86\output\" to "D:\workspace\_work\1\s\artifacts\packages\Release\Shipping\dotnet-targeting-pack-6.0.0-ci-win-x86.zip".
2021-03-08T17:20:05.6861052Z   D:\workspace\_work\1\s\artifacts\obj\Microsoft.NETCore.App.Ref\Release\net6.0\win-x86\output/ -> D:\workspace\_work\1\s\artifacts\packages\Release\Shipping\/dotnet-targeting-pack-6.0.0-ci-win-x86.zip
2021-03-08T17:20:06.8078995Z   Successfully created package 'D:\workspace\_work\1\s\artifacts\packages\Release\Shipping\Microsoft.NETCore.App.Ref.6.0.0-ci.nupkg'.
2021-03-08T17:20:07.5170054Z   Successfully created package 'D:\workspace\_work\1\s\artifacts\packages\Release\Shipping\Microsoft.NETCore.App.Ref.6.0.0-ci.symbols.nupkg'.
2021-03-08T17:20:07.7619271Z   Microsoft.NETCore.App.Runtime -> 
2021-03-08T17:26:50.0751189Z Attempting to cancel the build...
2021-03-08T17:26:50.4984776Z Terminate batch job (Y/N)? 
2021-03-08T17:26:50.6479745Z ##[error]The operation was canceled.

It spent 6 mins doing nothing there.

@ViktorHofer
Copy link
Member

ViktorHofer commented Mar 8, 2021

Why do you think it's doing nothing? That could be crossgen as part of the runtime pack or any other component sequenced into the pack or after it which doesn't log anything. The binlog might tell.

@safern
Copy link
Member

safern commented Mar 8, 2021

Why do you think it's doing nothing?

Hmm I didn't expect crossgen to take 6 mins but based on the binlog it seems like that is what was running before the build was killed.

Is it intentional that ReadyToRunCompiler task is invoked per assembly we want to compile? https://github.com/dotnet/sdk/blob/94d0b50de0b3e58242c7d1e2ec606e0f408a37ee/src/Tasks/Microsoft.NET.Build.Tasks/targets/Microsoft.NET.Publish.targets#L404 -- rather than invoking it once with the whole list?

cc: @trylek

@trylek
Copy link
Member

trylek commented Mar 8, 2021

Yes, I believe this pause corresponds to Crossgen2. Right now RunReadyToRunCompiler is designed so that it processes just a single file - and if it did support multiple files, it would still need to spin a loop over the assemblies calling the compiler individually as the compiler itself doesn't support multiple output files (I've always been told that doesn't belong to the compiler as it's a duplication of build orchestration msbuild should be used for). While Crossgen2 does support internal parallelism, it's much less efficient than running multiple instances of the compiler process so I'd ideally expect RunReadyToRunCompiler to be run using the msbuild parallel process build; I'm not sure if it actually happens.

@ViktorHofer
Copy link
Member

The shared framework invokes the ReadyToRunCompiler per runtime package once for all assemblies in it: https://github.com/dotnet/arcade/blob/6cc4c1e9e23d5e65e88a8a57216b3d91e9b3d8db/src/Microsoft.DotNet.SharedFramework.Sdk/targets/sharedfx.targets#L70. Meaning the ReadyToRunCompiler task in the SDK will only be invoked once and gets all assemblies as an input.

I believe we have two options:

  1. Invoke the ReadyToRunCompiler task per runtime assembly
  2. Invoke the ReadyToRunCompiler once but let the task handle the parallelization to invoke the compiler underneath multiple times.

@trylek thoughts?

@trylek
Copy link
Member

trylek commented Mar 8, 2021

Hmm, I must admit I don't yet understand clearly where the loop over the input assemblies is getting expanded. The RunReadyToRunCompiler task apparently expects just a single CompilationEntry as its input:

https://github.com/dotnet/sdk/blob/94d0b50de0b3e58242c7d1e2ec606e0f408a37ee/src/Tasks/Microsoft.NET.Build.Tasks/RunReadyToRunCompiler.cs#L20

I guess it may be the case that this is somehow automatically expanded by msbuild when running the task from this place,

https://github.com/dotnet/sdk/blob/94d0b50de0b3e58242c7d1e2ec606e0f408a37ee/src/Tasks/Microsoft.NET.Build.Tasks/targets/Microsoft.NET.Publish.targets#L404

when it recognizes that is passes multiple items in CompilationEntry while the task parameter is just a single ITaskItem; adding @AntonLapounov who worked on this code before me in case I'm missing something. One way or another, I suspect that this automated expansion in msbuild probably doesn't support parallelization.

If we want to achieve parallelization, I guess there are two ways to go about it: first is rewriting the bit in Microsoft.NET.Publish.targets running the RunReadyToRunCompiler task to use the parallel execution of subordinate msbuild projects; the second is adjusting the RunReadyToRunCompiler task itself to implement this parallelization.

At the first glance I think it would be much easier to modify the script to use parallel msbuild subproject build as the entire implementation of ToolTask / RunReadyToRunCompiler is centered around running a single compilation process and changing this concept would require basically throwing away all the basic underpinnings provided by ToolTask and manually open code the parallel process execution loop.

@safern
Copy link
Member

safern commented Mar 8, 2021

Meaning the ReadyToRunCompiler task in the SDK will only be invoked once and gets all assemblies as an input.

Looking at the binlog it suggest it runs per assembly, I believe it is because of the Batching at the Inputs in the _CreateR2RImages target.
image

@AntonLapounov
Copy link
Member

FYI, we plan to improve logging when running crossgen2; see the second item in dotnet/sdk#15466.

@ViktorHofer
Copy link
Member

Still happening very frequently so we should apply the short term fix (bumping the timeout).

@ViktorHofer
Copy link
Member

Looking at the binlog it suggest it runs per assembly, I believe it is because of the Batching at the Inputs in the _CreateR2RImages target.

@safern you are right, I should have double checked. The _CreateR2RImages target batching happens sequentially which results in the non parallel behavior. I wonder if that behavior is by design or just something that hasn't been optimized yet.

Aside from the crossgen discussion, we should diff builds that time out and ones from a few weeks ago that don't and see if perf somehow unintentionally regressed.

@trylek
Copy link
Member

trylek commented Mar 10, 2021

In the absence of parallelization it is easily possible that Crossgen2 compilation of the framework assemblies is slower - according to perf repo measurements, for larger assemblies like Corelib we can be actually faster than Crossgen1 due to the internal parallelism but for small builds the managed startup will likely always be outperformed by the native Crossgen1 app, at least until we start selfhosting (compiling Crossgen2 with itself) and optimizing the startup path. For now I'm afraid that a certain degree of compilation throughput regression is known and expected, of course if it's 3 times slower, that wouldn't be expected and likely signifies a more substantial problem. I switched the framework over in this PR:

#47349

@ViktorHofer
Copy link
Member

The timeout was bumped to 90 minutes and a root cause analysis showed that machines got slower (probably related to core count reduction). There was no regression in the actual build of dotnet/runtime apart from the usual churn. Let's track potential optimizations to crossgen2 invocation in a separate issue. Closing.

@trylek
Copy link
Member

trylek commented Mar 23, 2021

Thanks Viktor, that sounds good to me especially in light of the fact that last Monday we found out that in the offending time period Crossgen2 wasn't being used at all due to a bug I had made in the original check-in of the publishing switchover, it was only last Wednesday I merged in the fix.

@ViktorHofer
Copy link
Member

Do you know which version of the SDK the fix will be in? Preview 3?

@trylek
Copy link
Member

trylek commented Mar 23, 2021

The fix is actually in the runtime repo - the installer script wasn't passing the property ReadyToRunPublishUseCrossgen2 to the SDK logic, I originally messed that up while resolving merge conflicts when I was addressing your PR feedback regarding the DOTNET_ROOT manipulation; and yes, it should ship as part of Preview 3.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Infrastructure untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

5 participants