Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault during restore: CoreCLR Product Build Linux_musl arm release #43826

Closed
CoffeeFlux opened this issue Oct 26, 2020 · 15 comments
Closed

Segfault during restore: CoreCLR Product Build Linux_musl arm release #43826

CoffeeFlux opened this issue Oct 26, 2020 · 15 comments
Labels
area-Infrastructure-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Milestone

Comments

@CoffeeFlux
Copy link
Contributor

At the start of "Build native test components":

Building step 'Restore product binaries (build tests)' via "/__w/1/s/eng/common/msbuild.sh"  --warnAsError false /__w/1/s/src/tests/build.proj /p:RestoreDefaultOptimizationDataPackage=false /p:PortableBuild=true /p:UsePartialNGENOptimization=false /maxcpucount "/flp:Verbosity=normal;LogFile=/__w/1/s/artifacts/log/Restore_Product.Linux.arm.Release.log" "/flp1:WarningsOnly;LogFile=/__w/1/s/artifacts/log/Restore_Product.Linux.arm.Release.wrn" "/flp2:ErrorsOnly;LogFile=/__w/1/s/artifacts/log/Restore_Product.Linux.arm.Release.err" /t:BatchRestorePackages /p:TargetArchitecture=arm /p:Configuration=Release /p:TargetOS=Linux /nodeReuse:false    
  [15:05:18.25] Restoring all packages...
  Segmentation fault (core dumped)
/__w/1/s/src/tests/build.proj(53,5): error MSB3073: The command ""/__w/1/s/.dotnet/dotnet" restore -r linux-musl-arm /__w/1/s/src/tests/Common/scripts/scripts.csproj  /p:SetTFMForRestore=true /p:TargetOS=Linux /p:TargetArchitecture=arm /p:Configuration=Release " exited with code 139.

Build FAILED.

https://dev.azure.com/dnceng/public/_build/results?buildId=865863&view=logs&jobId=2796eae7-6bff-580e-7515-5bfa4409543c&j=2796eae7-6bff-580e-7515-5bfa4409543c&t=925eae2f-7374-55ef-fc58-6001c38b9348

Hit in #43798

@CoffeeFlux CoffeeFlux added area-Infrastructure-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' labels Oct 26, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Oct 26, 2020
@CoffeeFlux CoffeeFlux added this to the 6.0.0 milestone Oct 26, 2020
@CoffeeFlux
Copy link
Contributor Author

Not sure what area to put this under, so I labeled it with infrastructure for now to ensure it's seen by the right people. Feel free to move it wherever appropriate.

@safern
Copy link
Member

safern commented Oct 26, 2020

cc: @trylek @jkoritzinsky

@trylek
Copy link
Member

trylek commented Oct 26, 2020

@janvorli, am I right to recall these are the runs you recently enabled? I guess that we can either get to the dump and follow from that by identifying which component crashes (looks like the nuget downloader but the log isn't sufficiently detailed to be sure) or find out that there's no dump and primarily track this as the infra deficiency of not having a dump available.

@trylek trylek removed the untriaged New issue has not been triaged by the area owner label Oct 26, 2020
@janvorli
Copy link
Member

Yes, I have recently added the linux-musl arm builds. It is a cross-build, so the crash happens on x64 Linux. As we don't seem to capture core dumps of crashes of build, it doesn't seem actionable.

@trylek
Copy link
Member

trylek commented Oct 26, 2020

@dotnet/runtime-infrastructure - do we know how to enable dump collection on AzDO build machines and / or investigate why it doesn't work in case it should be already enabled?

@hoyosjs
Copy link
Member

hoyosjs commented Oct 26, 2020

Dumps are configured on helix test machines, not build machines as far as I know. @MattGal please correct me if I am wrong here. https://github.com/dotnet/runtime/blob/master/docs/design/coreclr/botr/xplat-minidump-generation.md has the information of the environment variables needed if the crash is happening somewhere in managed code - I'd say that's unlikely given this is an x64 process, but the logs point at it happening during a restore. Otherwise, for native dumps we'd need to set ulimit -c -1, let the OS collect it, and then move it to a place we'd upload. The last part would be to upload the folder conditionally if it's found. I also know helix is sensitive to changes in the ulimit and corefilter settings, so I'd defer to dnceng on if build machines have the same issues.

@MattGal
Copy link
Member

MattGal commented Oct 26, 2020

Dumps are configured on helix test machines, not build machines as far as I know. @MattGal please correct me if I am wrong here.

There isn't really a difference in how a Helix machine is provisioned. Specifically, we still create and upload dumps for anything that crashes even on the build machines, and if you know your build's info and have Kusto access you can definitely find these dumps.

This isn't likely to get any more helpful than it currently is, because the helix work items in question come from an Azure Devops pool provider and there's no built in part of this interface to promote files to be "part" of a build outside of the build's execution.

However, if you're keen to know how to get dumps (if they exist) off of a given AzDO build, send me the build in corpnet email and I'll walk you through how to find them.

@safern
Copy link
Member

safern commented Oct 27, 2020

Is there a way to get the dump? This was executed inside a docker container and the container is not preserved, right?

@hoyosjs
Copy link
Member

hoyosjs commented Oct 27, 2020

I tried, but didn't find any data on Kusto pointing to one (or even a failure on such a leg). I sent a message to matt and will share any findings.

@MattGal
Copy link
Member

MattGal commented Oct 27, 2020

Ah. If the Azure Devops agent drove a build inside a docker container, no we definitely don't have any record of core dumps done inside that container.

@janvorli
Copy link
Member

If we wanted to enable capturing dumps in containers, we could possibly map a folder on the machine running the container into the container and let the dumps go there. And also pass --ulimit core=-1 option to the docker, which is an equivalent of ulimit -c unlimited

@safern
Copy link
Member

safern commented Oct 27, 2020

Would it make sense to instead add a step that tries to upload dumps as artifacts if one exists when the build fails?

@hoyosjs hoyosjs modified the milestones: 6.0.0, 7.0.0 Jul 22, 2021
@hoyosjs
Copy link
Member

hoyosjs commented Jul 22, 2021

Haven't seen this recently.

@krwq
Copy link
Member

krwq commented Sep 30, 2021

Should we close this issue and re-open if we see further occurances? If not should we at least remove the blocking-clean-ci label if it wasn't seen for some time?

@jakobbotsch
Copy link
Member

Closing this as we haven't seen it for a while.

@ghost ghost locked as resolved and limited conversation to collaborators Jun 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Infrastructure-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Projects
None yet
Development

No branches or pull requests

9 participants