Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running dotnet in Linux chroot with FreeBSD host fails #33051

Closed
am11 opened this issue Mar 2, 2020 · 18 comments
Closed

Running dotnet in Linux chroot with FreeBSD host fails #33051

am11 opened this issue Mar 2, 2020 · 18 comments
Labels
area-PAL-coreclr needs-author-action An issue or pull request that requires more info or actions from the author.
Milestone

Comments

@am11
Copy link
Member

am11 commented Mar 2, 2020

The helloworld app in Linux chroot of FreeBSD still failed when invoking the compiler (at dotnet exec /path/csc.dll..., during the build). Tested with 3.1.2, without the workaround. Here is the truss output: https://api.cirrus-ci.com/v1/task/6667788790005760/logs/emulate%20dotnet.log. With the sched_getcpu workaround, it continues to work.

Originally posted by @am11 in #13475 (comment)

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Mar 2, 2020
@am11 am11 changed the title The helloworld app in Linux chroot of FreeBSD still failed when invoking the compiler (at dotnet exec /path/csc.dll..., during the build). Tested with 3.1.2, without the [workaround](https://github.com/dotnet/runtime/issues/13475#issuecomment-559854433). Here is the truss output: https://api.cirrus-ci.com/v1/task/6667788790005760/logs/emulate%20dotnet.log. With the sched_getcpu workaround, it continues to work. Running dotnet in Linux chroot with FreeBSD host fails Mar 2, 2020
@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Apr 10, 2020
@mangod9 mangod9 added this to the 5.0 milestone Apr 10, 2020
@janvorli
Copy link
Member

@am11 would you be able to check if this issue still occurs? It seems that it should be fixed now.

@am11
Copy link
Member Author

am11 commented Jul 15, 2020

@janvorli, sorry the above links are dead. I have tested with latest SDK and the build is failing with socket issue: https://cirrus-ci.com/task/6736777027256320. Underlying issue is: #39279. I will keep an eye on it and close if it works after the socket issue isresolved.

@mangod9
Copy link
Member

mangod9 commented Aug 12, 2020

@am11 any updates on this? If not might make sense to close this out and reopen if the issue repros. Thx.

@am11
Copy link
Member Author

am11 commented Aug 12, 2020

The issue with dotnet build still persists, so this is waiting for fix. I am not sure needs more info is the right tag for this issue. I have already provided the public CI link where the build is breaking..

@mangod9
Copy link
Member

mangod9 commented Aug 12, 2020

Can you please clarify what version are you still seeing the issue on (the original comment mentions 3.1.2)? Are there more details on the failure details? Thx

@am11
Copy link
Member Author

am11 commented Aug 12, 2020

The issue is from March, at that time it was tested on 3.1. Last link I shared (#33051 (comment)) is using preview 8. At this point I cannot test whether it works without sched_getcpu workaround due to another issue with the SDK, which CI logs show

@janvorli
Copy link
Member

@am11 are you sure the underlying issue with the socket is #39279? Looking at the call stack in the Cirrus CI logs you have provided link to, it seems that there is a problem with a relatively recent change that uses socketpair to implement Process.Start redirection: #34861

@am11
Copy link
Member Author

am11 commented Aug 12, 2020

@janvorli, you are right, it is a socketpair issue not the socket. There is another issue reported today for WSL #40727, with similar callstack.

@adamsitnik
Copy link
Member

@am11 is there any chance you could see if #40851 has fixed the problem for you?

@am11
Copy link
Member Author

am11 commented Aug 17, 2020

@adamsitnik, thanks for the reminder. Waiting for dotnet/sdk#12935 to pick up that change.

@ghost ghost added the no-recent-activity label Dec 7, 2020
@dotnet dotnet deleted a comment Dec 7, 2020
@ghost ghost removed the no-recent-activity label Dec 7, 2020
@danmoseley
Copy link
Member

@am11 did you get a chance to check whether it's fixed?

@am11
Copy link
Member Author

am11 commented Dec 7, 2020

I am sorry that I forgot to follow-up here. With .NET 5.0 RTM, we can publish a simple .NET application for Linux, and run it in Linuxulator (FreeBSD Linux simulator), without the sched_getcpu workaround.

Green CI showing a working HelloWorld app: https://cirrus-ci.com/task/4833901384302592 (the logs are not permanent and will be deleted by CI service once their retention period will elapse).

However, for more complex applications (such as, csc.dll, the C# compiler or dotnet-build), there are still some missing/unimplemented syscalls in Linuxlator environment, that causes crashes like dotnet/roslyn#46772. Those missing calls are being implemented by the FreeBSD community members, as the time is passing by.

On the other hand, applications such as Firefox, go-lang's go(1) and others are known to be working in Linuxulator environment: https://wiki.freebsd.org/Linuxulator. I would love to see that .NET achieves the similar working status in Linuxulator (and similar limited / chroot-like / WSL-like subsystem) environments. 😎

It might require a few tweaks around syscalls (maybe mprotect, mlock etc.), which can be detected during the run-time (or build-time, if it affects the performance). The (strace equivalent) truss -f dotnet build tracer output, which shows what happens before exit code 139 (SIGSEGV), was captured from CirrusCI and saved at: https://gist.github.com/am11/ec04e27cc93884dbcadf94691340c3e4 ⬅️ this is very verbose.

@janvorli
Copy link
Member

janvorli commented Dec 7, 2020

@am11 thank you for the details. What would be the tweaks around mprotect and mlock that you have mentioned?
As for the SIGSEGV, I assume we don't have any call stack available for that, do we?

@am11
Copy link
Member Author

am11 commented Dec 7, 2020

@janvorli, I was guessing from this part of of the trace that the unimplemented membarrier is causing the problem for mprotect/mlock:

 6618: munmap(0x801276000,13132)		 = 0 (0x0)
 6618: linux_sched_getaffinity(0x19da,0x80,0x7fffffffccf8) = 32 (0x20)
 6618: linux_membarrier(0x0,0x0)		 ERR#-38 'Function not implemented'
 6618: linux_mmap2(0x0,0x1000,0x3,0x22,0xffffffffffffffff,0x0) = 34379096064 (0x801276000)
 6618: mlock(0x801276000,4096)			 = 0 (0x0)
 6618: linux_mprotect(0x80396a000,0x1000,0x7)	 = 0 (0x0)
 6618: linux_madvise(0x80396a000,0x1000,0x11)	 = 0 (0x0)

Unfortunately, collecting callstack is tricky and it would probably require recompilation of gdb on FreeBSD with Linux coredump handler, as described in these notes: https://papers.freebsd.org/2018/bsdcan/tuffli-Running_Linux_applications_on_FreeBSD.files/tuffli-Running_Linux_applications_on_FreeBSD-notes.txt. The problems I was running into are:

Debugging

  • FreeBSD's gdb only shows child forks and their exit codes.
  • Linux gdb in linuxulator (distro: CentOS 7) fails with:
    Starting program: /home/newhdd/.dotnet/dotnet run
    warning: linux_test_for_tracefork: unexpected result from waitpid (3790, status 0x57f)
    Couldn't get CS register: Invalid argument.

CoreDump analysis

  • the coredump which gets produced is in FreeBSD format, so while FreeBSD LLDB and GDB do understand the dump, they do not recognize the dotnet binary.
  • Linux GDB does understand the binary format but not the FreeBSD coredump format.
  • dotnet dump analyze also does not understand the FreeBSD coredump format.

SOS plugin

  • fails in FreeBSD's lldb, as it is a linux shared object.

(I haven't yet tried installing lldb in linuxulator)

@janvorli
Copy link
Member

janvorli commented Dec 8, 2020

The missing membarrier should not be a problem, the mprotect / mlock is a fallback that we use if the membarrier doesn't work. Only arm64 would have problems because the mprotect / mlock thing doesn't work on it.

@am11
Copy link
Member Author

am11 commented Dec 8, 2020

Opened microsoft/clrmd#875 for dump analyze support.

@mangod9 mangod9 modified the milestones: 6.0.0, 7.0.0 Jul 9, 2021
@ghost
Copy link

ghost commented Oct 9, 2021

This issue has been automatically marked no recent activity because it has been marked as needs more info but has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no recent activity.

Please refer to our contribution guidelines for tips on what information might be required.

@ghost
Copy link

ghost commented Nov 5, 2021

This issue will now be closed since it had been marked no recent activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.

@ghost ghost closed this as completed Nov 5, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Dec 5, 2021
@eiriktsarpalis eiriktsarpalis added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 19, 2022
@ghost ghost removed the no-recent-activity label Jan 19, 2022
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-PAL-coreclr needs-author-action An issue or pull request that requires more info or actions from the author.
Projects
None yet
Development

No branches or pull requests

8 participants