Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opt Into Perf Logging #6274

Merged
merged 35 commits into from
Apr 15, 2021
Merged

Conversation

benvillalobos
Copy link
Member

@benvillalobos benvillalobos commented Mar 17, 2021

Fixes #5900

Context

It's time we opt into performance logging: #5861

Changes Made

  • PerformanceLogEventListener creates the DOTNET_PERFLOG_DIR directory if it doesn't exist.
  • Opt into perf logging in eng\cibuild_bootstrapped_msbuild.cmd and the .sh equivalent.
  • Added a powershell script that runs after windows/core and windows/full bootstrapped_msbuild scripts. This script runs taskkill /f /im msbuild.exe and taskkill /f /im VBCSCompiler.exe (these tasks were seen as holding a lock on some generated perf files). If we don't kill these tasks before trying to copy over the perf logs, it will fail the entire perf-log upload because the out of proc node (node mode 1) is still holding a lock on the file.

Testing

We should see perf logs under artifacts/log/<configuration>/PerformanceLogs

Notes

Don't review commit by commit 😬 needless to say, this PR should be squashed.

Copy link
Member

@brianrob brianrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@benvillalobos
Copy link
Member Author

Note to self: Contact dnceng and ask the failures in copying perf logging artifacts.

https://devdiv.visualstudio.com/DevDiv/_build/results?buildId=4565213&view=results shows a failure that it couldn't find the files to copy. Is BuildConfiguration not set at this point in vsts-dotnet.yml?
env: { DOTNET_PERFLOG_DIR: '$(Build.SourcesDirectory)\artifacts\log\$(BuildConfiguration)\perf' }

@benvillalobos
Copy link
Member Author

Perf logs can fail to publish if msbuild nodes hang onto them.

Fail to upload 'D:\a\1\s\artifacts\log\Debug\PerformanceLogs\perf-7724-f1fd5808709141d399664b2c8ca3a4f7.log' due to 'The process cannot access the file 'D:\a\1\s\artifacts\log\Debug\PerformanceLogs\perf-7724-f1fd5808709141d399664b2c8ca3a4f7.log' because it is being used by another process.'.
System.IO.IOException: The process cannot access the file 'D:\a\1\s\artifacts\log\Debug\PerformanceLogs\perf-7724-f1fd5808709141d399664b2c8ca3a4f7.log' because it is being used by another process.
   at System.IO.FileStream.ValidateFileHandle(SafeFileHandle fileHandle)
   at System.IO.FileStream.CreateFileOpenHandle(FileMode mode, FileShare share, FileOptions options)
   at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options)
   at Microsoft.VisualStudio.Services.Agent.Worker.Build.FileContainerServer.UploadAsync(IAsyncCommandContext context, Int32 uploaderId, Boolean uploadToBlob, CancellationToken token)
Async Command End: Upload Artifact
##[error]File upload failed even after retry.

Not the end of the world. Looks like CoreOnWindows published its perf logs. CoreOnMac and CoreOnLinux don't seem to have captured any perflogs. @brianrob is there any reason they shouldn't be?

@brianrob
Copy link
Member

@benvillalobos, I would expect that if the environment variable is set to a writeable location, then it should just work. Are you able to reproduce this locally? If not, it might be worth it to add some printf style debugging, and then you can look at the log to see what happened.

@benvillalobos
Copy link
Member Author

The environment variable is being set properly, logs are being created and populated. The publish step is running into issues because of a lock on a log file.

Uploading perf logs status

Windows full: consistently fails
Windows core: sometimes fails
macos core: works
linux core: works

There's typically a lock on a log file, which is strange because msbuild runs with /nr so the processes should be terminated when the build completes.

Fail to upload 'D:\a\1\s\artifacts\log\Debug\PerformanceLogs\perf-6700-556fe571db2441b38b38519d6a040b36.log' due to 'The process cannot access the file 'D:\a\1\s\artifacts\log\Debug\PerformanceLogs\perf-6700-556fe571db2441b38b38519d6a040b36.log' because it is being used by another process.'.
System.IO.IOException: The process cannot access the file 'D:\a\1\s\artifacts\log\Debug\PerformanceLogs\perf-6700-556fe571db2441b38b38519d6a040b36.log' because it is being used by another process.
   at System.IO.FileStream.ValidateFileHandle(SafeFileHandle fileHandle)
   at System.IO.FileStream.CreateFileOpenHandle(FileMode mode, FileShare share, FileOptions options)
   at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options)
   at Microsoft.VisualStudio.Services.Agent.Worker.Build.FileContainerServer.UploadAsync(IAsyncCommandContext context, Int32 uploaderId, Boolean uploadToBlob, CancellationToken token)

I tried to get the publish task to allow multiple retries, but it looks like that parameter just isn't being used: microsoft/azure-pipelines-tasks#11451

Some ideas:
A. Set up the script such that a failed file upload doesn't prevent the rest from uploading
B. Ensure perflogeventlistener is disposed of earlier and therefore drops the lock on the file? /cc: @brianrob

Total side note: I noticed our cibuilds still try to publish netcoreapp2.1 logs, when our logs are net5.0 now. I don't think those publish steps are necessary.

@brianrob
Copy link
Member

Interesting. Ideally we don't make upload of the log conditional, as this is a nice functional test of the whole system.

I think it could be worth scripting a check to see what process is locking the file. It's going to be an MSBuild process, but perhaps the process is still alive even though we think it shouldn't be. The file name includes the PID for the process that owns the log. Could you fetch the command line for the PID and print it to the console?

Since this is limited to Windows, you could do something like:

Get-WmiObject -Query "SELECT CommandLine FROM Win32_Process WHERE ProcessID = <PID>"

.vsts-dotnet-ci.yml Outdated Show resolved Hide resolved
src/MSBuild.UnitTests/PerfLog_Tests.cs Show resolved Hide resolved
{
Console.WriteLine("!!! We threw!\nInner Exception: {0}\nMessage: {1}", e.InnerException, e.Message);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are Console.WriteLines the right way to log? If so, are they meant to be user-facing? If so, shouldn't we localize them? Either way, why so many exclamation points? Also, this should be much simpler if it's actually needed:

Suggested change
Console.WriteLine("!!! We threw!\nInner Exception: {0}\nMessage: {1}", e.InnerException, e.Message);
Console.WriteLine(e.Message + "Inner Exception: " + e.InnerException);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was part of a debug commit that will be reverted when the PR is ready to go in.

@benvillalobos
Copy link
Member Author

Unfortunately it looks like passing parameters to this task (ie, retry 5 times) isn't supported, as its deprecated. And moving to a newer task would require updating the pool of machines we're running on. I'm not sure how involved that process would be, but I assume it's a yak to be shaved.

@benvillalobos
Copy link
Member Author

@brianrob
Here's the script:

      Get-ChildItem -Path "artifacts\log\Debug\PerformanceLogs\*" | Where-Object {$_.Extension -eq '.log'} | ForEach-Object {
        $s = $_.BaseName.Split('-')[1];
        Write-Host "Checking Process ID: $s"
        Get-WmiObject -Query "SELECT CommandLine FROM Win32_Process WHERE ProcessID = $s"
      }

Details from the process that's locking the file

Windows Full

Looks legit.

Checking Process ID: 1096


__GENUS          : 2
__CLASS          : Win32_Process
__SUPERCLASS     : 
__DYNASTY        : 
__RELPATH        : 
__PROPERTY_COUNT : 1
__DERIVATION     : {}
__SERVER         : 
__NAMESPACE      : 
__PATH           : 
CommandLine      : D:\a\1\s\artifacts\bin\Microsoft.Build.Engine.UnitTests\Debug\net472\MSBuild.exe /nologo 
                   /nodemode:1 /nodeReuse:true /low:false
PSComputerName   : 

And

This one may be a newer process that spawned and happened to take the same PID.

Checking Process ID: 3232
__GENUS          : 2
__CLASS          : Win32_Process
__SUPERCLASS     : 
__DYNASTY        : 
__RELPATH        : 
__PROPERTY_COUNT : 1
__DERIVATION     : {}
__SERVER         : 
__NAMESPACE      : 
__PATH           : 
CommandLine      : \??\C:\Windows\system32\conhost.exe 0x4
PSComputerName   : 

Looks legit.

Checking Process ID: 5300
__GENUS          : 2
__CLASS          : Win32_Process
__SUPERCLASS     : 
__DYNASTY        : 
__RELPATH        : 
__PROPERTY_COUNT : 1
__DERIVATION     : {}
__SERVER         : 
__NAMESPACE      : 
__PATH           : 
CommandLine      : "D:\a\1\s\artifacts\bin\Microsoft.Build.Tasks.UnitTests\Debug\net472\Roslyn\VBCSCompiler.exe" 
                   "-pipename:NuYm3XeVYobbHIPpGrn+XHTco0Zkr6qZXvQrQA55Cqw"
PSComputerName   : 

Windows Core

Looks like it caught the previous powershell script running? Or it's catching itself? Likely catching itself.

Checking Process ID: 2668


__GENUS          : 2
__CLASS          : Win32_Process
__SUPERCLASS     : 
__DYNASTY        : 
__RELPATH        : 
__PROPERTY_COUNT : 1
__DERIVATION     : {}
__SERVER         : 
__NAMESPACE      : 
__PATH           : 
CommandLine      : "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe"  -NoLogo -NoProfile -NonInteractive 
                   -ExecutionPolicy Unrestricted -Command ". 'D:\a\_temp\ce3ad429-b778-4ca5-a071-347039ee03c9.ps1'"
PSComputerName   : 

Checking Process ID: 2668
__GENUS          : 2
__CLASS          : Win32_Process
__SUPERCLASS     : 
__DYNASTY        : 
__RELPATH        : 
__PROPERTY_COUNT : 1
__DERIVATION     : {}
__SERVER         : 
__NAMESPACE      : 
__PATH           : 
CommandLine      : "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe"  -NoLogo -NoProfile -NonInteractive 
                   -ExecutionPolicy Unrestricted -Command ". 'D:\a\_temp\ce3ad429-b778-4ca5-a071-347039ee03c9.ps1'"
PSComputerName   : 

This one looks legit.

Checking Process ID: 3252
__GENUS          : 2
__CLASS          : Win32_Process
__SUPERCLASS     : 
__DYNASTY        : 
__RELPATH        : 
__PROPERTY_COUNT : 1
__DERIVATION     : {}
__SERVER         : 
__NAMESPACE      : 
__PATH           : 
CommandLine      : "D:\a\1\s\artifacts\bin\Microsoft.Build.Engine.UnitTests\Debug\net472\Roslyn\VBCSCompiler.exe" 
                   "-pipename:8Zt8SJ9zCljCMApPK4hTcAue0XXCpOUXsyHtUwkrKSQ"
PSComputerName   : 

Checking Process ID: 5184
__GENUS          : 2
__CLASS          : Win32_Process
__SUPERCLASS     : 
__DYNASTY        : 
__RELPATH        : 
__PROPERTY_COUNT : 1
__DERIVATION     : {}
__SERVER         : 
__NAMESPACE      : 
__PATH           : 
CommandLine      : D:\a\1\s\artifacts\bin\Microsoft.Build.Engine.UnitTests\Debug\net472\MSBuild.exe /nologo 
                   /nodemode:1 /nodeReuse:true /low:false
PSComputerName   : 

@benvillalobos
Copy link
Member Author

benvillalobos commented Mar 31, 2021

I have a killnodereuse script I often run that does what you'd expect whenever processes are hanging or colliding with each other.

It's quick and dirty, but we could insert a script that does:

taskkill /f /im msbuild.exe
taskkill /f /im dotnet.exe
taskkill /f /im vbcscompiler.exe
taskkill /f /im omnisharp.exe

After the build script runs. Or change the timing where a /nr:true msbuild node will die off? Not sure exactly where this happens.

@benvillalobos
Copy link
Member Author

Ready for review. Recommend reviewing the diff rather than by commits. Lots of test commits used. The final diff looks a lot cleaner. I think it's good to go unless we have other ideas for getting around the failed perf log upload.

Copy link
Member

@Forgind Forgind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the taskkill, but this LGTM other than that!

@@ -34,6 +34,10 @@ jobs:
mergeTestResults: true
continueOnError: true
condition: always()
- powershell: |
taskkill /f /im msbuild.exe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This very much looks like a workaround. Why are there hanging processes in the first place? Can we fix that? Also, this makes it a little cleaner:

Suggested change
taskkill /f /im msbuild.exe
taskkill /f /im msbuild.exe /im vbcscompiler.exe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know you can pass /im multiple times!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to clarify here: My expectation is that there were always processes hanging around. I don't believe that this change modifies process lifetime. You could try running dotnet build-server shutdown to kill the compiler server. One thing to confirm - are we getting full perf logs from these runs, or are they truncated because of the call to taskkill?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we kill the processes, the file that they generate isn't completed:

[2021-04-05T19:37:33.6451866Z] Event=Microsoft-Build/MSBuildExeStart ProcessID=7688 ThreadID=1	 commandLine="D:\a\1\s\artifacts\bin\Microsoft.Build.Engine.UnitTests\Debug\net472\MSBuild.exe /nologo /nodemode:1 /nodeReuse:true /low:false" 
[2021-04-05T19:37:34.0831946Z] Event=Microsoft-Build/RequestThreadProcStart ProcessID=7688 ThreadID=11	 
[2021-04-05T19:37:34.0872085Z] Event=Microsoft-Build/BuildProjectStart ProcessID=7688 ThreadID=11	 projectPath="C:\Users\VssAdministrator\AppData\Local\Temp\041qbmta.yej\tmp4965ff255b434b4e8c54fa1b6dc9cf10.tmp" 
[2021-04-05T19:37:34.0962009Z] Event=Microsoft-Build/LoadDocumentStart ProcessID=7688 ThreadID=11	 fullPath="C:\Users\VssAdministrator\AppData\Local\Temp\041qbmta.yej\tmp4965ff255b434b4e8c54fa1b6dc9cf10.tmp" 
[2021-04-05T19:37:34.1042039Z] Event=Microsoft-Build/LoadDocumentStop ProcessID=7688 ThreadID=11	 fullPath="C:\Users\VssAdministrator\AppData\Local\Temp\041qbmta.yej\tmp4965ff255b434b4e8c54fa1b6dc9cf10.tmp" 
[2021-04-05T19:37:34.1072967Z] Event=Microsoft-Build/ParseStart ProcessID=7688 ThreadID=11	 projectFileName="C:\Users\VssAdministrator\AppData\Local\Temp\041qbmta.yej\tmp4965ff255b434b4e8c54fa1b6dc9cf10.tmp" 
[2021-04-05T19:37:34.1142095Z] Event=Microsoft-Build/ParseStop ProcessID=7688 ThreadID=11	 projectFileName="C:\Users\VssAdministrator\AppData\Local\Temp\041qbmta.yej\tmp4965ff255b434b4e8c54fa1b6dc9cf10.tmp" 
[2021-04-05T19:37:34.3042076Z] Event=Microsoft-Build/TargetStart ProcessID=7688 ThreadID=11	 targetName="msbuild" 
[2021-04-05T19:37:34.3582095Z] Event=Microsoft-Build/LoadDocumentStart ProcessID=7688 ThreadID=11	 fullPath="D:\a\1\s\artifacts\bin\Microsoft.Build.Engine.UnitTests\Debug\net472\Microsoft.Common.tasks" 
[2021-04-05T19:37:34.3602064Z] Event=Microsoft-Build/LoadDocumentStop ProcessID=7688 ThreadID=11	 fullPath="D:\a\1\s\artifacts\bin\Microsoft.Build.Engine.UnitTests\Debug\net472\Microsoft.Common.tasks" 
[2021-04-05T19:37:34.3602064Z] Event=Microsoft-Build/ParseStart ProcessID=7688 ThreadID=11	 projectFileName="D:\a\1\s\artifacts\bin\Microsoft.Build.Engine.UnitTests\Debug\net472\Microsoft.Common.tasks" 
[2021-04-05T19:37:34.3621959Z] Event=Microsoft-Build/ParseStop ProcessID=7688 ThreadID=11	 projectFileName="D:\a\1\s\artifacts\bin\Microsoft.Build.Engine.UnitTests\Debug\net472\Microsoft.Common.tasks" 
[2021-04-05T19:37:34.4402082Z] Event=Microsoft-Build/TargetStop ProcessID=7688 ThreadID=11	 targetName="msbuild" 
[2021-04-05T19:37:34.4442095Z] Event=Microsoft-Build/BuildProjectStop ProcessID=7688 ThreadID=11	 projectPath="C:\Users\VssAdministrator\AppData\Local\Temp\041qbmta.yej\tmp4965ff255b434b4e8c54fa1b6dc9cf10.tmp" targets="msbuild" 
[2021-04-05T19:37:34.4462038Z] Event=Microsoft-Build/RequestThreadProcStop ProcessID=7688 ThreadID=11	 
[2021-04-05T19:37:34.5532097Z] Event=Microsoft-Build/RequestThreadProcStart ProcessID=7688 ThreadID=11	 
[2021-04-05T19:37:34.5532097Z] Event=Microsoft-Build/BuildProjectStart ProcessID=7688 ThreadID=11	 projectPath="C:\Users\VssAdministrator\AppData\Local\Temp\041qbmta.yej\tmp4965ff255b434b4e8c54fa1b6dc9cf10.tmp" 
[2021-04-05T19:37:34.5571985Z] Event=Microsoft-Build/TargetStart ProcessID=7688 ThreadID=11	 targetName="msbuild" 
[2021-04-05T19:37:34.5612210Z] Event=Microsoft-Build/TargetStop ProcessID=7688 ThreadID=11	 targetName="msbuild" 
[2021-04-05T19:37:34.5612210Z] Event=Microsoft-Build/BuildProjectStop ProcessID=7688 ThreadID=11	 projectPath="C:\Users\VssAdministrator\AppData\Local\Temp\041qbmta.yej\tmp4965ff255b434b4e8c54fa1b6dc9cf10.tmp" targets="msbuild" 
[2021-04-05T19:37:34.5612210Z] Event=Microsoft-Build/RequestThreadProcStop ProcessID=7688 ThreadID=11	 
[2021-04-05T19:40:57.3420042Z] Event=Microsoft-Build/RequestThreadProcStart ProcessID=7688 ThreadID=11	 
[2021-04-05T19:40:57.3420042Z] Event=Microsoft-Build/BuildProjectStart ProcessID=7688 ThreadID=11	 projectPath="C:\Users\VssAdministrator\AppData\Local\Temp\041qbmta.yej\tmpe37eba57a280484fb0a6015308bebaff.tmp" 
[2021-04-05T19:40:57.3430119Z] Event=Microsoft-Build/LoadDocumentStart ProcessID=7688 ThreadID=11	 fullPath="C:\Users\VssAdministrator\App

Notice how it ends on LoadDocumentStart and has no MSBuildExeStop

I'll try dotnet build-server shutdown, which I didn't realize was an option!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If certain events hadn't fired, wouldn't that still mean those events wouldn't fire as it tries to get out of the build as fast as possible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I'd like to get this in a state that is "good enough" if we want to see this perf data and act on it. Even if it's not the perfect solution. Part of me suspects digging into this node issue could take longer than expected. I've been digging into it without much result so far. The current idea is to set the MSBUILDDISABLENODEREUSE env var to 1 and see if the node will realize it needs to die.

@benvillalobos
Copy link
Member Author

Notes from PR review:

We want to get Rainer's take on it.
Some options:

  1. Take the change as-is, killing the tasks like this is hacky but doesn't impact any user scenario.
  2. Reduce the time it takes for OOP nodes to shut down after an nr:false build.
  3. Update the task we're using to allow for multiple retries
    • The task we're using, PublishBuildArtifacts, has a bug that doesn't allow arguments to be passed. The newer task, PublishPipelineArtifacts doesn't take any arguments and potentially doesn't use robocopy so I'm not sure if there are even arguments to pass.

@benvillalobos
Copy link
Member Author

Current status is setting MSBUILDDISABLENODEREUSE to 1 with no custom scripts and the node properly shuts down. Obviously the bigger issue here is a node being passed nodeReuse true. Running this a second time to be safe and I'll be fine merging it as-is and investigating further.

@Forgind
Copy link
Member

Forgind commented Apr 5, 2021

When did we switch from waiting a week is ok to merging despite known issues?

@benvillalobos
Copy link
Member Author

When did we switch from waiting a week is ok to merging despite known issues?

@Forgind What? We didn't. I want this in a state that is "acceptable to merge" should we decide that we want to so we can act on the perf data. My last comment can be interpreted as "The nodereuse thing looks like a bug, but setting the environment variable will allow us to act on this perf data while I dig into what the real issue here is".

@Forgind
Copy link
Member

Forgind commented Apr 7, 2021

I'll be fine merging it as-is and investigating further.

This sounded to me like merging early. If you meant to say that you think it's ready to merge but still intend to wait a week, then sure, though I'm less confident.

@benvillalobos
Copy link
Member Author

After mulling this over some, I don't think this should be blocked on the hanging process. It sheds light on a potential issue but isn't the cause.

Locally run eng\cibuild_bootstrapped_msbuild.cmd and see that there will always be a hanging process that was launched with nodereuse:true despite our build scripts passing /nr:false. I'll get an issue written up for it during PR review, but my best guess is that running our build scripts with -test causes nodereuse to be true for at least one node to help tests run faster.

@benvillalobos
Copy link
Member Author

Feedback from PR review: Maybe the hanging node is actually spawned from a unit test. I didn't consider this and it sounds very plausible!

@benvillalobos
Copy link
Member Author

StopOnFirstFailureandBuildInParallelMultipleNode, PropertyOverrideContainsSemicolon, and all tests under TargetsFile_Test are the culprits.

StopOnFirstFailureandBuildInParallelMultipleNode spawns a process with /nr:true (default for buildmanager)

PropertyOverrideContainsSemicolon and tests under TargetsFile_Test spawn VBCSCompiler.

@benvillalobos
Copy link
Member Author

Resolved the issue by disabling DOTNET_PERFLOG_DIR for all unit test assemblies. Created an issue tracking hanging assemblies here: #6344.

Confirmed locally that no extra perf logs get generated from tests. CI should succeed with no issues.

@benvillalobos benvillalobos added the merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. label Apr 14, 2021
@benvillalobos benvillalobos merged commit 369631b into dotnet:main Apr 15, 2021
@benvillalobos benvillalobos deleted the opt-into-perf-logging branch April 15, 2021 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Opt into and capture perf logging in second-stage bootstrap builds
5 participants