Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llvm-symbolizer not present in base queue #11631

Closed
1 of 5 tasks
kunalspathak opened this issue Nov 14, 2022 · 54 comments
Closed
1 of 5 tasks

llvm-symbolizer not present in base queue #11631

kunalspathak opened this issue Nov 14, 2022 · 54 comments
Assignees
Labels
Ops - Service Maintenance Used to track issues related to maintaining the services .NET Eng Supports

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Nov 14, 2022

Build

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-77578-merge-965165820fec43e19e/JIT.Stress/1/console.f7c5d70b.log?helixlogtype=result

https://dev.azure.com/dnceng-public/public/_build/results?buildId=82793&view=ms.vss-test-web.build-test-results-tab&runId=1731386&resultId=102137&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

Pull Request

dotnet/runtime#77578

Action required for the engineering services team

Additional information about the issue reported

To triage this issue (First Responder / @dotnet/dnceng):

  • Open the failing build above and investigate
  • Add a comment explaining your findings

In dotnet/runtime#77578, we are trying to generate the crash stacktrace using llvm-symbolizer. While it is present in containers, the base Linux and macOS queues doesn't have it and we see error using it. See the logs I referenced in the issue. Can we get it and lldb installed on base image?

CC: @hoyosjs @JulieLeeMSFT

Release Note Category

  • Feature changes/additions
  • Bug fixes
  • Internal Infrastructure Improvements

Release Note Description

Add llvm and llvm-symbolizer to Ubunut.1804.Amd64 and RedHat.7.Amd64

@michellemcdaniel
Copy link
Contributor

Hi Kunal, we will get on this. @hoyosjs do you know if this just comes built in with llvm? lldb 3.9 is already being installed on the base ubuntu.1804 queues. Do you need a different version? This is the test queue, so I don't think it would be an issue to upgrade that to something newer, but I'd like to check before making any major changes.

@michellemcdaniel michellemcdaniel added the Ops - Compliance First-responder-style issues handled by the Operations V-Team due to prioritization or urgency level label Nov 14, 2022
@hoyosjs
Copy link
Member

hoyosjs commented Nov 14, 2022

Do you know why 3.9? And llvm sounds good.

@michellemcdaniel
Copy link
Contributor

michellemcdaniel commented Nov 14, 2022

I do not know why 3.9. Possibly historic reasons? @MattGal it looks like we set our lldb version to 3.9 back in 2020. Do you know why we're using that?

Edit Oh, actually, we set this in 2019.

Edit: that is also a lie. I am still digging to how long ago we chose 3.9 and never updated it.

@hoyosjs
Copy link
Member

hoyosjs commented Nov 14, 2022

Probably for diagnostics...

@michellemcdaniel
Copy link
Contributor

Yeah. I think that's also what's on the docker images that y'all are using and upgrading to something more modern is also breaking things. I worry updating that will break y'all

@MattGal
Copy link
Member

MattGal commented Nov 14, 2022

@kunalspathak we support several different linux distros, not all of which may have a usable version of llvm-symbolizer. Would it be acceptable if this were only added to Ubuntu Helix machines, or do you need it everywhere? Odds are it's not going to work with some of our more unusual linuxes.

@kunalspathak
Copy link
Member Author

@kunalspathak we support several different linux distros, not all of which may have a usable version of llvm-symbolizer. Would it be acceptable if this were only added to Ubuntu Helix machines, or do you need it everywhere? Odds are it's not going to work with some of our more unusual linuxes.

@hoyosjs - what do you think?

@hoyosjs
Copy link
Member

hoyosjs commented Nov 14, 2022

Updating the queues the runtime uses directly would be the first priority:

  • Ubuntu.1804.Amd64.Open
  • RedHat.7.Amd64.Open
  • OSX.1200.ARM64

We'll have to evaluate the helix containers, but those are much easier to update and we've even built the toolset in some of the containers historically.

@hoyosjs
Copy link
Member

hoyosjs commented Nov 14, 2022

@MattGal do you know where the symbolizer might not be available? cc: @jkoritzinsky since this might be interesting for your *SAN work

@MattGal
Copy link
Member

MattGal commented Nov 15, 2022

@MattGal do you know where the symbolizer might not be available? cc: @jkoritzinsky since this might be interesting for your *SAN work

Offhand I'd venture it might not be available on old SLES or Mariner. It's one of those things we don't know until we try.

@hoyosjs
Copy link
Member

hoyosjs commented Nov 15, 2022

Those don't tend to impact our priority scenario - the PR analysis checks

@michellemcdaniel michellemcdaniel self-assigned this Nov 16, 2022
@michellemcdaniel
Copy link
Contributor

PR to add them to the two linux based queues: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/27535

I think for OSX, we're going to have to get ddfun involved

@michellemcdaniel
Copy link
Contributor

Opened https://portal.microsofticm.com/imp/v3/incidents/details/349676322/home to get llvm added to the OSX queue.

@michellemcdaniel
Copy link
Contributor

(Moved to tracking while we wait for DDFun to update the systems)

@JulieLeeMSFT
Copy link
Member

(Moved to tracking while we wait for DDFun to update the systems)

@michellemcdaniel do we know the time estimate until DDFun to update the system?

@michellemcdaniel
Copy link
Contributor

I do not. I know it's been assigned, but I haven't seen any movement on it. I will ping the ICM

@michellemcdaniel
Copy link
Contributor

In general, it takes 1-2 weeks to get this many systems updated (100ish machines), and next week is Thanksgiving, so it's likely going to be at the longer end of that estimate.

@kunalspathak
Copy link
Member Author

kunalspathak commented Nov 28, 2022

PR to add them to the two linux based queues: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/27535

Does this rollout llvm to our linux helix queues? I kicked off a run on #77578 that would consume it and still see failure about llvm-symbolizer not present. See https://dev.azure.com/dnceng-public/public/_build/results?buildId=94545&view=ms.vss-test-web.build-test-results-tab .

@ulisesh ulisesh assigned ulisesh and unassigned michellemcdaniel Nov 28, 2022
@michellemcdaniel
Copy link
Contributor

We did not have a rollout last week due to the US holiday. The linux changes should rollout this week.

@michellemcdaniel
Copy link
Contributor

Heads up: DDFun says the OSX queue has been updated to have llvm on them

@kunalspathak
Copy link
Member Author

I tried this out but seems there is still some issue.

Test Infrastructure Failure: System.ComponentModel.Win32Exception (2): An error occurred trying to start process 'llvm-symbolizer' with working directory '/private/tmp/helix/working/ADD7099B/w/A75E0909/e'. No such file or directory

@ulisesh
Copy link
Contributor

ulisesh commented Dec 5, 2022

@kunalspathak the job was executed in the queue osx.1200.amd64.open but the request was to install llvm in OSX.1200.ARM64 so it is expected for it to not be available in the amd64 queue. In which queue do you need it?

@kunalspathak
Copy link
Member Author

was executed in the queue osx.1200.amd64.open but the request was to install llvm in OSX.1200.ARM64 so it is expected for it to not be available in the amd64 queue. In which queue do you need it?

I just noticed this from @hoyosjs . I think we also need it for OSX x64, right @hoyosjs ?

Updating the queues the runtime uses directly would be the first priority:

  • Ubuntu.1804.Amd64.Open
  • RedHat.7.Amd64.Open
  • OSX.1200.ARM64

@hoyosjs
Copy link
Member

hoyosjs commented Dec 6, 2022

Yes, sorry - it would be needed on osx.*.*.open

@missymessa missymessa removed their assignment Mar 6, 2024
@garath
Copy link
Member

garath commented Mar 6, 2024

Thanks @hoyosjs and @JulieLeeMSFT. To be clear, this isn't blocking builds or preventing releases, but it is making it hard to diagnose test failures. Is there anything else we should know to help set priority? (Unfortunately our Ops team has a rather large backlog right now and we need to be very crisp to be sure we're handling issues in the best order.)

@hoyosjs
Copy link
Member

hoyosjs commented Mar 7, 2024

These three and #11868 are queues where we can't enable blocking on build analysis for runtime easily, since no crash info will be available for those.

@garath
Copy link
Member

garath commented Mar 7, 2024

Is it correct that the llvm package on, for example, Ubuntu would include llvm-symbolizer?

@garath
Copy link
Member

garath commented Mar 7, 2024

Ah, I misunderstood. I see that Ubuntu.2204.Amd64.Open was not part of the original request, so it's a "new install" rather than "why are these missing" for that queue.

As for the state of the MacOS queues... I'll have to dig a bit deeper there.

@JulieLeeMSFT
Copy link
Member

We are blocking all PR merge on red from 3/19 in dotnet/runtime. It will be a big pain to developers if they don't get traces to debug the failure and unblock themselves to merge on green. We have worked on this feature for almost 2 years, and this is the last piece that needs to be in place to ensure smooth developer experience when we enforce merge on green on 3/19.
We have been requesting this feature for so many months. So, please prioritize this support.

Thanks @hoyosjs and @JulieLeeMSFT. To be clear, this isn't blocking builds or preventing releases, but it is making it hard to diagnose test failures. Is there anything else we should know to help set priority? (Unfortunately our Ops team has a rather large backlog right now and we need to be very crisp to be sure we're handling issues in the best order.)

@hoyosjs
Copy link
Member

hoyosjs commented Mar 7, 2024

Is it correct that the llvm package on, for example, Ubuntu would include llvm-symbolizer?

On ubuntu that's likely enough for now. But for macOS it's likely very different :)

@garath
Copy link
Member

garath commented Mar 7, 2024

AzureDevOpsTests
| where Repository == 'dotnet/runtime' and RunCompleted > ago(10d)
| where Message contains "An error occurred trying to start process 'llvm-symbolizer' with working directory"
| extend QueueAndContainer = trim(' ', substring(TestRunName, indexof(TestRunName, '@') + 1))
| summarize count() by QueueAndContainer

@hoyosjs I'm not seeing any results from this query. Should it still be working?

@garath
Copy link
Member

garath commented Mar 7, 2024

I don't have bandwidth to take up this issue yet, but in an effort to speed things up a bit I've opened a request to DDFUN asking them to check on the MacOS systems in question. I'll follow-up here with the results. -- ICM 479938683

@garath
Copy link
Member

garath commented Mar 7, 2024

@hoyosjs DDFUN spot checked a few machines in the MacOS queues and have confirmed that llvm-symbolizer is installed and should be available on the path. I asked them for the specific path to the bins and they found these:

AMD64: /usr/local/opt/llvm/bin/llvm-symbolizer
ARM64: /opt/homebrew/Cellar/llvm/15.0.7_1/bin/llvm-symbolizer

Does this match what you're seeing in your builds?

@hoyosjs
Copy link
Member

hoyosjs commented Mar 7, 2024

Are these on the path? I still see hits on runs from today:

AzureDevOpsTests
| where Repository endswith('runtime') and RunCompleted > ago(10d)
| where Message contains "'llvm-symbolizer' with working directory"
| extend QueueAndContainer = trim(' ', substring(TestRunName, indexof(TestRunName, '@') + 1))
Processing /cores/coredump.96726.dmp.crashreport.json
Printing stacktrace from '/cores/coredump.96726.dmp.crashreport.json'
Invoking llvm-symbolizer --pretty-print
Errors while running llvm-symbolizer --pretty-print
System.ComponentModel.Win32Exception (2): An error occurred trying to start process 'llvm-symbolizer' with working directory '/private/tmp/helix/working/B2090961/w/A22F08D1/e/Interop/Interop'. No such file or directory
   at System.Diagnostics.Process.ForkAndExecProcess(ProcessStartInfo startInfo, String resolvedFilename, String[] argv, String[] envp, String cwd, Boolean setCredentials, UInt32 userId, UInt32 groupId, UInt32[] groups, Int32& stdinFd, Int32& stdoutFd, Int32& stderrFd, Boolean usesTerminal, Boolean throwOnNoExec) in /_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/Process.Unix.cs:line 496
   at System.Diagnostics.Process.StartCore(ProcessStartInfo startInfo) in /_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/Process.Unix.cs:line 456
   at CoreclrTestLib.CoreclrTestWrapperLib.TryPrintStackTraceFromCrashReport(String crashReportJsonFile, TextWriter outputWriter)

image

@garath
Copy link
Member

garath commented Mar 8, 2024

Are these on the path? I still see hits on runs from today:

They've confirmed the right path is listed in /etc/paths.

I've extracted a random sample of failing machines and asked for those to be checked to rule out an inconsistent configuration.

Your query gives a good view of failing cases but I wonder if we can establish if there have been any successful cases. Do you know of a message that would be printed if it was successful?

@hoyosjs
Copy link
Member

hoyosjs commented Mar 12, 2024

I tried looking - I see no successful invocations of it on macOS. On linux containers it looks like:

Processing /home/helixbot/dotnetbuild/dumps/coredump.2203.dmp.crashreport.json
Printing stacktrace from '/home/helixbot/dotnetbuild/dumps/coredump.2203.dmp.crashreport.json'
Invoking llvm-symbolizer --pretty-print
Stack trace:
----------------------------------
Thread Id: 0x89b
      Child SP               IP Call Site
 0x7ffca315f5f0 0x7f8585cfb1d8 libclrjit.so!?? at ??:0:0
 0x7ffca315f710 0x7f8585ea7d39 libclrjit.so!Compiler::impImportBlockCode(BasicBlock*) at /__w/1/s/src/coreclr/jit/importer.cpp:7987:56
 0x7ffca315f8f0 0x7f8585d0448a libclrjit.so!insTupleTypeInfos at emitxarch.cpp:0:0
 0x7ffca315f9f0 0x7f8585cfb479 libclrjit.so!?? at ??:0:0
 0x7ffca315fb10 0x7f8585e1ecce libclrjit.so!Compiler::fgSwitchToOptimized(char const*) at /__w/1/s/src/coreclr/jit/flowgraph.cpp:473:5
 0x7ffca315fb80 0x7f8585f616fa libclrjit.so!Compiler::fgMorphExpandCast(GenTreeCast*) at /__w/1/s/src/coreclr/jit/morph.cpp:562:9
 0x7ffca315fbb0
...

@janvorli
Copy link
Member

Instead of symbolizer, macOS has atos tool. An old note from my personal onenote has an example:

atos -o artifacts/bin/coreclr/OSX.arm64.Debug/libcoreclr.dylib.dwarf 0x7ac654
EEStartupHelper() (in libcoreclr.dylib.dwarf) (ceemain.cpp:1001)
(use the dwarf file to get the source line)

Or 
atos -o artifacts/bin/coreclr/OSX.arm64.Debug/libcoreclr.dylib.dwarf 0x7ac654 -fullPath
EEStartupHelper() (in libcoreclr.dylib.dwarf) (/Users/janvorli/git/runtime/src/coreclr/vm/ceemain.cpp:1001)

@garath
Copy link
Member

garath commented Mar 12, 2024

DDFUN confirmed that llvm-symbolizer is callable from the home directory, so it must be present on the path.

There must be something different about the build, but without looking through YAML or debugging an actual build, I'm at a loss. Does anyone have any other suggestions on what to check?

@garath garath self-assigned this Mar 12, 2024
@garath
Copy link
Member

garath commented Mar 12, 2024

For recordkeeping, I've taken dci-mac-build-197 in queue osx.1200.amd64.open offline for use in the investigation.

@garath garath added the Ops - P2 Operations task, priority 2 label Mar 25, 2024
@garath garath removed the Ops - P2 Operations task, priority 2 label Apr 9, 2024
@riarenas riarenas added Ops - Service Maintenance Used to track issues related to maintaining the services .NET Eng Supports and removed Ops - Compliance First-responder-style issues handled by the Operations V-Team due to prioritization or urgency level labels Jun 5, 2024
@missymessa missymessa self-assigned this Jun 10, 2024
@missymessa
Copy link
Member

@ilyas1974 ilyas1974 added Ops - P2 Operations task, priority 2 and removed Ops - P2 Operations task, priority 2 labels Jul 24, 2024
@missymessa
Copy link
Member

DDFUN resolved the IcM. Please follow up on the IcM if this issue persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ops - Service Maintenance Used to track issues related to maintaining the services .NET Eng Supports
Projects
None yet
Development

No branches or pull requests