Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ crashes when calling any function in nativeaot shared library #110074

Closed
CeSun opened this issue Nov 22, 2024 · 47 comments · Fixed by #110082
Closed

C++ crashes when calling any function in nativeaot shared library #110074

CeSun opened this issue Nov 22, 2024 · 47 comments · Fixed by #110082
Labels
area-PAL-coreclr in-pr There is an active PR which will close this issue when it is merged

Comments

@CeSun
Copy link

CeSun commented Nov 22, 2024

The system developer said it was caused by selinux permissions. Is there a way to bypass this system call?

if (syscall(__NR_get_mempolicy, NULL, NULL, 0, 0, 0) < 0 && errno == ENOSYS)

Image

syscall Disassembly:201
NUMASupportInitialize() 0x0000005c8b963458
GCToOSInterface::Initialize() 0x0000005c8b962480
::PalInit() 0x0000005c8b96072c
::RhInitialize(bool) 0x0000005c8b91b1f0
InitializeRuntime() 0x0000005c8b914ebc
Thread::EnsureRuntimeInitialized() 0x0000005c8b91d0e8
Thread::ReversePInvokeAttachOrTrapThread(ReversePInvokeFrame*) 0x0000005c8b91d094
libavalonia_Entry_napi_init__RegisterEntryModule napi_init.cs:12
::RegisterAvaloniaNativeModule() napi_init.cpp:16

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Nov 22, 2024
Copy link
Contributor

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@MichalStrehovsky
Copy link
Member

It doesn't look like there is a way to avoid calling into the NUMA API.

Cc @janvorli @am11 for ideas

Is this with the default SELinux policies or something more locked down?

@CeSun
Copy link
Author

CeSun commented Nov 22, 2024

If I modify the source code and return directly in the first line of the NUMASupportInitialize function, will it work?
If it is theoretically possible, I will try to invest my energy in learning how to compile the dotnet sdk.

@janvorli
Copy link
Member

@CeSun are you running in a docker container? And what is the distro you are using?

@CeSun
Copy link
Author

CeSun commented Nov 22, 2024

@janvorli Hi, Thanks for your reply,

I am using HarmonyOS Next, a new mobile operating system developed by Huawei. This system is similar to Android. And on this system, you can call the native shared library of linux-musl.

Currently, this system is in the public beta stage.

I have two devices, one with enforcing selinux and the other with disabled selinux.

On the device with disabled selinux, the native shared library released by nativeaot works fine, but not on the other.

But in the future, the selinux status of the system used by users will be enforcing

@CeSun
Copy link
Author

CeSun commented Nov 22, 2024

I have also posted a work order in the Huawei Developer Center to seek help from Huawei and am waiting for a response.

@huoyaoyuan
Copy link
Member

If I modify the source code and return directly in the first line of the NUMASupportInitialize function, will it work?

It should work as-if there's no NUMA support, like the non TARGET_LINUX path.

@CeSun
Copy link
Author

CeSun commented Nov 22, 2024

If I modify the source code and return directly in the first line of the NUMASupportInitialize function, will it work?

It should work as-if there's no NUMA support, like the non TARGET_LINUX path.

I also noticed the macro "TARGET_LINUX", but I know nothing about NUMA.
There are no assertions in the source code, so I guess it is logically allowed not to execute NUMA-related initialization code.

@huoyaoyuan
Copy link
Member

NUMA refers to Non-Unified Memory Access, for different physical memory controllers connected with different CPU core(s). Accessing memory or cache connected with different memory controller requires going through the slow interconnect bus, like multiple CPU chips.
It's usually not a concern on consumer hardware before Ryzen 9 brings two chiplets.
Not initializing NUMA information will just increase the chance of inefficient memory accesses, for HEDT and server CPUs with many memory channels.

@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Nov 22, 2024
@janvorli
Copy link
Member

@CeSun do you know if the crash happened while calling the syscall or at some later point?

@am11
Copy link
Member

am11 commented Nov 22, 2024

I was testing with:

#include <unistd.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sys/syscall.h>

int main(void)
{
  if (syscall(__NR_get_mempolicy, NULL, NULL, 0, 0, 0) < 0)
        printf("syscall failed with errno %d: %s\n", errno, strerror(errno));
  else
    printf("didn't fail\n");

  return 0;
}

cc getmempolicy.c && ./a.out

@CeSun
Copy link
Author

CeSun commented Nov 22, 2024

@CeSun do you know if the crash happened while calling the syscall or at some later point?

when calling the syscall

@CeSun
Copy link
Author

CeSun commented Nov 22, 2024

@am11
In addition, by using tools similar to adb to enter the shell environment, executing the executable program published by nativeoot will not have any problems. Only by accompanying the native binary shared library published by nativeoot with the software package of this system (similar to Android apk) will it crash.

@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Nov 22, 2024
@am11
Copy link
Member

am11 commented Nov 22, 2024

@CeSun, I couldn't figure out which "class" to use in SELinux profile, so I went with seccomp security model to repro it. To do that, first the host kernel needs to support get_mempolicy syscall. e.g. the linux host kernel used by docker for mac doesn't support it so I created a fedora VM and installed docker in it). Built the repro (#110074 (comment)) in the VM and ran the container with and without the cap:

$ docker run -v$(pwd):/app --rm --cap-add=SYS_NICE fedora /app/a.out
didn't fail

$ docker run -v$(pwd):/app --rm fedora /app/a.out
syscall failed with errno 1: Operation not permitted

With docker mac (whose host doesn't have get_mempolicy syscall), I was getting:

syscall failed with errno 38: Function not implemented

We were handling errno 38 but not 1, so this is somewhat of a corner case (host kernel supports get_mempolicy and container does not enable the capability). The daily build with changes will be out in a few hours or by tomorrow, you can give it a try.

@janvorli
Copy link
Member

when calling the syscall

@CeSun does it crash for you or does it print the "syscall failed with errno ..." message? If it crashes, then I think it is a likely a bug in the syscall implementation. My theory would be that it for some reason may not properly handle the first argument being NULL.

@CeSun
Copy link
Author

CeSun commented Nov 23, 2024

I have an assembly code for a crash here, I don't know if it helps
Image

@CeSun
Copy link
Author

CeSun commented Nov 23, 2024

I try to call the pull request code and it crashes too
Image

@CeSun
Copy link
Author

CeSun commented Nov 25, 2024

@am11 Is this what dailybuild is? https://aka.ms/dotnet/9.0/daily/dotnet-runtime-win-x64.exe

Image

@CeSun
Copy link
Author

CeSun commented Nov 25, 2024

@janvorli @am11 I think this issue needs to be reopened, but I don't have permission.

@huoyaoyuan
Copy link
Member

Does the syscall crashes unconditionally, or does it return an error?

@janvorli janvorli reopened this Nov 25, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Nov 25, 2024
@aadog
Copy link

aadog commented Nov 25, 2024

@janvorli Can we work so hard?

@CeSun
Copy link
Author

CeSun commented Nov 25, 2024

Hi, still crashing

@am11
Copy link
Member

am11 commented Nov 25, 2024

Still crashing😭 Image

Is it crashing on syscall though? You added error logging (OH_LOG_ERROR) which is probably not advancing to the next line on the device? Can you try changing it to info or debug log?

@aadog
Copy link

aadog commented Nov 25, 2024

This seems to have nothing to do with permissions, I tested it under android in c++ arm64 and it crashed at the same time

@am11
Copy link
Member

am11 commented Nov 25, 2024

I meant change OH_LOG_ERROR to OH_LOG_DEBUG and retry

@CeSun
Copy link
Author

CeSun commented Nov 25, 2024

I meant change OH_LOG_ERROR to OH_LOG_DEBUG and retry

Image

@CeSun
Copy link
Author

CeSun commented Dec 25, 2024

Huawei's reply to me is that the phone does not support the "__NR_get_mempolicy" system call

If ordinary Linux does not support __NR_get_mempolicy, how should we do defense?

@am11
Copy link
Member

am11 commented Dec 25, 2024

If ordinary Linux does not support __NR_get_mempolicy, how should we do defense?

Then syscall(__NR_get_mempolicy, ...) should return < 0 per the manpage and not terminate the process.
Assuming you have tried the daily build https://github.com/dotnet/sdk/blob/main/documentation/package-table.md (main (10.0.x Runtime)) with <TargetFramework>net10.0 and it continues to be a problem, I have no idea.

@CeSun
Copy link
Author

CeSun commented Dec 26, 2024

Latest news: HarmonyOS uses seccomp to limit the system calls of apps. When the system call is not in the allowed list, the process will be killed.

https://gitee.com/openharmony/startup_init/blob/master/services/modules/seccomp/seccomp_policy/app.seccomp.policy

What is the best solution for this problem? Push the system to modify the whitelist or find a way to determine whether the call can be made in runtime?

@am11 @janvorli @MichalStrehovsky

@am11
Copy link
Member

am11 commented Dec 26, 2024

Looks like there is more to it. .NET runtime is using the following syscalls:

__NR_copy_file_range
__NR_fork
__NR_get_mempolicy
__NR_getunwind
__NR_mbind
__NR_membarrier
__NR_memfd_create
__NR_perf_event_open
__NR_riscv_flush_icache
__NR_rt_sigreturn
__NR_sigreturn

HarmonyOS is not supporting some of these, and killing the process in response, which is hardly necessary and paranoid level of security measure. The POSIX standard specifies in Section 2.3 (Error Numbers) and in the descriptions of various system calls:

    ENOSYS:

        ENOSYS - Function not implemented.
        An attempt was made to use a function that is not available on this system.

so I am not sure if we want to go out of our way to support this non-complaint system. If there are enough people using dotnet on HarmonyOS, then perhaps it could be considered, no idea. 🤷‍♀

@CeSun
Copy link
Author

CeSun commented Dec 26, 2024

According to Huawei developers, previous systems developed based on Android already use this solution, so I plan to test .NET Android programs (Mono) and NativeAot programs on previous systems.

@CeSun
Copy link
Author

CeSun commented Dec 26, 2024

I tried it and this system call also crashes on the Android platform.
So apart from adapting it specifically for HarmonyOS, there seems to be no other way to go?

@am11
Copy link
Member

am11 commented Dec 26, 2024

Android support is being worked on: #106748 and we treat Android and Linux as separate platforms in number of places. If HarmonyOS is based on Android, you should use dotnet publish -p:PublishAot=true -r linux-bionic-arm64 etc. and subscribe to #106748 for the complete end to end support.

@CeSun
Copy link
Author

CeSun commented Dec 26, 2024

HarmonyOS 5.0 (maybe called Next version in some places) is a brand new operating system, and libc is not bionic but musl.

So it seems that the situation of HarmonyOS is similar to that of Android, and it needs to be compatible as an independent platform?

@CeSun
Copy link
Author

CeSun commented Dec 26, 2024

HarmonyOS system versions 1.0~4.0 are operating systems developed based on Android. A set of HarmonyOS system APIs and development frameworks are added on the basis of Android, and they have been iterated and updated for many years.
HarmonyOS system 5.0 (called Next version in some places) only implements the HarmonyOS system standard and does not include the Android part.

In addition, many people will mention an OpenHarmony project. The OpenHarmony project is an open source project that only implements the HarmonyOS system standard. It has been updated from 1.0 to 5.0, and is compatible with the HarmonyOS source code level, maintaining the same API Level.

For me, a brand new mobile operating system is exciting, and after trying it for about a month, I think many interactions are more advanced than Android. Since there is no historical baggage, many obvious lags in the Android system no longer occur. So I want to try to make some contributions to the HarmonyOS ecosystem, for example, I am porting the Avalonia framework to the HarmonyOS platform.

As for the number of developers using .NET on HarmonyOS, I can’t give an exact number. In China, more and more companies will port their mobile apps to HarmonyOS. I believe C# is a better choice than TypeScript, which is natively supported by HarmonyOS.

@am11
Copy link
Member

am11 commented Dec 27, 2024

Porting .NET to a new OS is ususally a non-trivial task and it is not suitable for this issue. #103627 is tracking the HarmonyOS work. This is the first of many errors you have encountered.

As for this issue, seccomp profile can be configured in various ways; ranging from whitelisting syscalls to take actions when disallowed/unimplemented syscalls are made. .NET supports standard actions like "system will issue ENOSYS when disallowed/unimplemented syscall is made" and we adjust the code, but not "system will kill the entire process" unless it is specifically built for that seccomp profile (with many #ifdef THAT_PLATFORM). In its most restrictive form, environment running under seccomp model can disallow all syscalls and put a SIGKILL action penalty when attempted. That makes it pretty much unusable for any real-world application.

For a general purpose seccomp support, we would need to build a mechanism that collects the profile data ahead of time (during the deployment e.g.) which will contain the list of syscalls, the actions environment has imposed (returning ENOSYS vs. SIGKILL) and the context on which the action would be taken (thread or the whole process). Then we can make calls deterministically. It sounds simpler, but consider the indirect calls (runtime calls a library, that library make a prohibited syscall or calls libc, which makes the syscall). This will require intense amount of testing for each permutation of versatile seccomp options. Such system does not exist in .NET and there are no plans to build one: #92196 (comment).

HarmonyOS has a well-known seccomp profile, we can implement that profile as part of the port work properly #103627. Runtime, its libraries and third-party libraries in .NET ecosystem are using many syscalls directly and indirectly, so you can expect many surprises.

@CeSun
Copy link
Author

CeSun commented Jan 21, 2025

The final solution to this problem is to compile the static library of runtime myself and delete the syscall code of numb.
https://github.com/dotnet/runtime/blob/main/docs/workflow/building/coreclr/nativeaot.md
https://github.com/dotnet/runtime/blob/main/docs/workflow/using-docker.md

I hope the official has a better solution. Before the official has a better solution, this issue will not be closed to track the latest situation.

@driver1998
Copy link

driver1998 commented Feb 11, 2025

The "better solution" will be just add ohos as a new rid and apply this change to it specifically. Tizen also has its RID so I don't think why (Open)Harmony can't.

That will require quite a lot of work in the build system though, (and a new RID will require all unmanaged nuget packages to specifically add support for it, like libSkiaSharp. Or can we somehow describe ohos-arm64 is compatible with linux-musl-arm64? How is it handled in android-* and linux-bionic-*?), so it is not something worth doing in the early stages.

@CeSun
Copy link
Author

CeSun commented Feb 11, 2025

To be precise, the current issue has been resolved. In the branch I modified, I can successfully publish native shared libraries available for (Open) HarmonyOS. The issue of .NET support for HarmonyOS should be discussed in a more appropriate post.

@CeSun
Copy link
Author

CeSun commented Feb 11, 2025

This is a branch created for adapting HarmonyOS(informal, for experimental use only): https://github.com/CeSun/dotnet-runtime-openharmony
This is the compiled binary. By importing the targets of this repository, you can publish native dynamic libraries for HarmonyOS: https://github.com/CeSun/OpenHarmonyRuntime.Net

@CeSun CeSun closed this as completed Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-PAL-coreclr in-pr There is an active PR which will close this issue when it is merged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants