Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with musl libc #13

Closed
r-ricci opened this issue Jan 6, 2022 · 11 comments
Closed

Issues with musl libc #13

r-ricci opened this issue Jan 6, 2022 · 11 comments

Comments

@r-ricci
Copy link

r-ricci commented Jan 6, 2022

I tried to build brlcad-7.32.4-1 on a musl-based linux system (Void Linux) and I encountered various issues. These don't occur on the glibc version of Void Linux.

  1. #ifdef HAVE_CPU_SET_T
    cpu_set_t set_of_cpus; /* bsd */
    #else
    cpuset_t set_of_cpus; /* linux */
    #endif

    It looks like neither of musl and glibc have cpuset_t, while both have cpu_set_t, but for some reason HAVE_CPU_SET_T is not defined with musl. I think the ifdef could be removed completely.

  2. misc/tools/env2c/env2c.cxx segfaults. The problem is not related to brlcad and could be fixed by using a link flag such as -Wl,-z,stack-size=2097152. However the LDFLAGS environment variable is not used when building this program, is this intentional?

  3. gethostbyname(3), used in src/util/ttcp.c, is no longer in POSIX. glibc provides it by default, while musl requires an appropriate feature macro, e.g. #define _POSIX_C_SOURCE 200112L.

  4. Various programs crash at runtime during build with several errors of the kind Error relocating: symbol not found. Any ideas about what could be the cause?

The logs below illustrate the last problem, after I applied patches to fix the first three.

cmake log

cmake -DBRLCAD_ENABLE_STRICT=OFF -DCMAKE_BUILD_TYPE=Debug -DBRLCAD_BUNDLED_LIBS=SYSTEM -DBRLCAD_LEMON=ON -DBRLCAD_ITCL=ON -DBRLCAD_ITK=ON -DBRLCAD_IWIDGETS=ON -DBRLCAD_TKHTML=ON -DBRLCAD_TKTABLE=ON -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -G Ninja ..

cmake.txt

build log

ninja -v -j 1

build.txt

@r-ricci r-ricci closed this as completed Feb 13, 2022
@brlcad
Copy link
Member

brlcad commented Feb 22, 2022

@4ricci did you have a workaround for the issues you reported or was something else wrong on the system?

@brlcad brlcad reopened this Feb 22, 2022
@r-ricci
Copy link
Author

r-ricci commented Feb 22, 2022

These patches fix the first three problems I described:

--- a/src/libbu/affinity.c
+++ b/src/libbu/affinity.c
@@ -58,11 +58,7 @@ parallel_set_affinity(int cpu)
 
     /* Linux and BSD pthread affinity */
 
-#ifdef HAVE_CPU_SET_T
-    cpu_set_t set_of_cpus; /* bsd */
-#else
-    cpuset_t set_of_cpus; /* linux */
-#endif
+    cpu_set_t set_of_cpus;
     int ret;
     int ncpus = bu_avail_cpus();
 
--- a/misc/tools/env2c/CMakeLists.txt
+++ b/misc/tools/env2c/CMakeLists.txt
@@ -55,6 +55,7 @@ endif(BRLCAD_ENABLE_ADDRESS_SANITIZER AND ${BRLCAD_OPTIMIZED_BUILD} MATCHES "OFF
 
 add_executable(env2c env2c.cxx)
 target_link_libraries(env2c Threads::Threads)
+target_link_options(env2c PRIVATE -Wl,-z,stack-size=2097152)
 
 if (O3_COMPILER_FLAG)
   # If we have the O3 flag, use it

(See this. A proper solution could be to use LDFLAGS for all executables, not just those ending up in the installation)

--- a/src/util/ttcp.c
+++ b/src/util/ttcp.c
@@ -20,6 +20,7 @@
  */
 
 #undef _POSIX_C_SOURCE
+#define _POSIX_C_SOURCE 200112L
 #undef _XOPEN_SOURCE
 #define BSD43
 /* #define BSD42 */

I don't know what causes the symbol not found errors and I don't have a workaround, but I doubt there's something wrong on the system.
I tried to build 9a63f60 with -DBRLCAD_BUNDLED_LIBS=Bundled (on Void), and I got similar results. Building on Alpine 3.15 doesn't work either.

@brlcad
Copy link
Member

brlcad commented Feb 22, 2022

Thank you for the response; and interesting set of changes. Alas, the changes cannot be committed as-is as they're not portable -- other Linux distros need cpuset_t for example and -Wl,-z,stack-size is not valid on various compilers (e.g., visual studio, ibm compiler, ..). The posix_source change might be okay, but it's doing the complete opposite of what is intended (i.e., there is non-posix code in here that conforms with bsd43 instead).

With the first issue, what's going on wrong is our cmake test for cpu_set_t changed, and the preprocessor symbol used here (HAVE_CPU_SET_T) wasn't updated to reflect it. Looks like it was broken in 9649d61. The cmake test is arguably wrong, so the proper fix is change the cmake logic so HAVE_CPU_SET_T correctly toggles to the right construct and musl should just work. I'm going to test a change here for that and, if you would, let me know if it works for you. For the other two issues, can you provide more information on what went wrong? An error log would help.

@r-ricci
Copy link
Author

r-ricci commented Feb 23, 2022

For the other two issues, can you provide more information on what went wrong? An error log would help.

2.

After further investigation, I discovered env2c segfaults when std::regex_match() is used on long lines, such as:

void(*custom_fake)(ARG0_TYPE arg0, ARG1_TYPE arg1, ARG2_TYPE arg2, ARG3_TYPE arg3, ARG4_TYPE arg4, ARG5_TYPE arg5, ARG6_TYPE arg6, ARG7_TYPE arg7, ARG8_TYPE arg8, ARG9_TYPE arg9, ARG10_TYPE arg10, ARG11_TYPE arg11, ARG12_TYPE arg12, ARG13_TYPE arg13, ARG14_TYPE arg14, ARG15_TYPE arg15, ARG16_TYPE arg16, ARG17_TYPE arg17, ARG18_TYPE arg18, ARG19_TYPE arg19); \

Applying this change:

--- a/misc/tools/env2c/env2c.cxx
+++ b/misc/tools/env2c/env2c.cxx
@@ -89,7 +89,10 @@ process_file(env_outputs &env_t)
        return;
     }
     while (std::getline(fs, sline)) {
-       if (!std::regex_match(sline, getenv_regex)) {
+       std::cerr << "Doing line: " << sline << "\n";
+       bool match = std::regex_match(sline, getenv_regex);
+       std::cerr << "Done  line: " << sline << "\n";
+       if (!match) {
            continue;
        }
        std::smatch envvar;
@@ -217,7 +220,7 @@ main(int argc, const char *argv[])
        /*************************************************************/
        /*  Process the files (mulithreaded mapping for performance) */
        /*************************************************************/
-       unsigned int hwc = std::thread::hardware_concurrency();
+       unsigned int hwc = 1;
        if (!hwc) {
            hwc = 10;
        }

produces this output:

[...]
Doing line:             DECLARE_CUSTOM_FAKE_SEQ_VARIABLES \                                                                                                              
Done  line:             DECLARE_CUSTOM_FAKE_SEQ_VARIABLES \                                                                                                              
Doing line:             void(*custom_fake)(ARG0_TYPE arg0, ARG1_TYPE arg1, ARG2_TYPE arg2, ARG3_TYPE arg3, ARG4_TYPE arg4, ARG5_TYPE arg5, ARG6_TYPE arg6, ARG7_TYPE arg7, ARG8_TYPE arg8, ARG9_TYPE arg9, ARG10_TYPE arg10, ARG11_TYPE arg11, ARG12_TYPE arg12, ARG13_TYPE arg13, ARG14_TYPE arg14, ARG15_TYPE arg15, ARG16_TYPE arg16, ARG17_TYPE arg17, ARG18_TYPE arg18, ARG19_TYPE arg19); \                                                                                                                          
Segmentation fault                                                                  
ninja: build stopped: subcommand failed.

If I understand correctly, this happens because musl has a small default thread stack size, it's not the program's fault.

-Wl,-z,stack-size is not valid on various compilers

Of course, my patch was just a quick workaround, since the build system ignores LDFLAGS when building env2c (and others). I think it would be nice to provide a way to pass compile and link options when building this kind of programs, too.
Maybe the program could be modified to skip long lines instead.

3.

On musl, without an appropriate macro, src/util/ttcp.c implicitly declares gethostbyname(3), which was removed in POSIX 2008. GCC considers this an error if using -pedantic-errors and -std= >= C99.

The posix_source change might be okay, but it's doing the complete opposite of what is intended (i.e., there is non-posix code in here that conforms with bsd43 instead).

Actually it is POSIX, but an older version (200112L instead of 200809L). Anyway, defining _BSD_SOURCE or _DEFAULT_SOURCE should do the same here. BSD43 doesn't really mean anything on glibc and musl, but on glibc (2.32) it happens to work (but it shouldn't) because gethostbyname is always visible even when __STRICT_ANSI__ is defined.
I can't provide a build log because the build fails before arriving there, I don't remember how did I discover this. Manually running the commands should be enough to show the problem.

$ git checkout 9a63f606548d12e18f7f4b91b0a609c1862312f7
$ gcc --version
gcc (GCC) 10.2.1 20201203

# musl 1.1.24
$ gcc -std=c90 -pedantic-errors src/util/ttcp.c
src/util/ttcp.c: In function 'main':
src/util/ttcp.c:542:16: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  542 |      if ((addr=(struct hostent *)gethostbyname(host)) == NULL)
      |
$ gcc -std=c11 -pedantic-errors src/util/ttcp.c
src/util/ttcp.c: In function 'main':
src/util/ttcp.c:542:34: error: implicit declaration of function 'gethostbyname'; did you mean 'gethostname'? [-Wimplicit-function-declaration]
  542 |      if ((addr=(struct hostent *)gethostbyname(host)) == NULL)
      |                                  ^~~~~~~~~~~~~
      |                                  gethostname
src/util/ttcp.c:542:16: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
  542 |      if ((addr=(struct hostent *)gethostbyname(host)) == NULL)
      |
$ gcc -std=c11 -pedantic-errors -D_DEFAULT_SOURCE src/util/ttcp.c
# ok

# glibc 2.32
$ gcc -std=c90 -pedantic-errors src/util/ttcp.c
# ok
$ gcc -std=c11 -pedantic-errors src/util/ttcp.c
# ok

4.

I already provided the build log.

build.txt

@r-ricci
Copy link
Author

r-ricci commented Mar 4, 2022

The issue 4. is caused by librt, which has the same name of a libc component.
For various reasons, musl puts the entire libc in a single .so file and, for compatibility, lib(m|crypt|pthread|rt).so are always resolved to the libc, bypassing RPATH.
The musl developers will not change this because the name "librt" is reserved by POSIX, so it should not be used at all by other libraries; and things will break anyway (perhaps even on non-musl systems) if a program is linked against both the ray-trace library and the system librt, which could easily happen with external libraries.

The correct solution would be to rename the conflicting libraries (or at least just librt), but since AFAIK this is not going to happen, I fear BRL-CAD will never work with musl.

Fixing the other problems might still make sense on non-musl systems, if not, then this issue could be closed.

@brlcad
Copy link
Member

brlcad commented Mar 5, 2022

It will take some time to sort through each of the issues you've reported here, but I wanted to address this latest since it's a rather well-known issue, not new.

The issue 4. is caused by librt, which has the same name of a libc component. For various reasons, musl puts the entire libc in a single .so file and, for compatibility, lib(m|crypt|pthread|rt).so are always resolved to the libc, bypassing RPATH.

Speaking of which, I believe that's a POSIX violation in itself. In my draft, it's mentioned around page 2434 of 1003.1 how libraries are to be resolved in order to avoid naming collisions like this.

The musl developers will not change this because the name "librt" is reserved by POSIX, so it should not be used at all by other libraries; and things will break anyway (perhaps even on non-musl systems) if a program is linked against both the ray-trace library and the system librt, which could easily happen with external libraries.

Do you have a citation for the name being reserved? My understanding is the name of the library is not dictated by POSIX.1b (and it's not in my draft). It was merely Sun's original convention in the early 90s. Many platforms rolled the aio realtime interface into libc and no longer even have a librt. Other platforms are POSIX-conformant and do not have a librt (but do sport the headers and functions defined by POSIX.1b.

That said, BRL-CAD's librt library predates POSIX by more than 10 years and is our flagship library with many custom integrations that would be detrimentally and expensively affected by a name change.

The correct solution would be to rename the conflicting libraries (or at least just librt), but since AFAIK this is not going to happen, I fear BRL-CAD will never work with musl.

As I started off with, this is not a new issue and I dare say was even a much harder problem in the mid-90's. For all platforms thus far including incredibly isoteric systems with hard-coded librt's, we've been able to figure out solutions without renaming and impacting our integrations. There is almost certainly a solution that does not involve renaming librt.

Note, the comment you cite in our build system is regarding direct installation into /usr such that installing BRL-CAD could very well overwrite a system library (not just because of librt, there are several potential collisions). The general recommendation of LSB and other standards is to not have user apps in /usr regardless, instead defaulting to /opt or a subdir like /usr/local/.

To figure out a solution for this, I think the other issues will need to be addressed first, and then review docs on the compiler+linker to see how we can avoid collision (e.g., via versioning, naming, prioritization, location, etc).

@r-ricci
Copy link
Author

r-ricci commented Mar 5, 2022

The musl developers will not change this because the name "librt" is reserved by POSIX, so it should not be used at all by other libraries; and things will break anyway (perhaps even on non-musl systems) if a program is linked against both the ray-trace library and the system librt, which could easily happen with external libraries.

Do you have a citation for the name being reserved? My understanding is the name of the library is not dictated by POSIX.1b (and it's not in my draft). It was merely Sun's original convention in the early 90s. Many platforms rolled the aio realtime interface into libc and no longer even have a librt. Other platforms are POSIX-conformant and do not have a librt (but do sport the headers and functions defined by POSIX.1b.

On a system with posix man-pages installed, you can look at the descriptions of -L and -l rt in c99(1P).

-L directory
Change the algorithm of searching for the libraries named in the -l objects to look in the directory named by the directory pathname before looking in the usual places. Directories named in -L options shall be searched in the order specified. If the -L option is used to specify a directory that is one of the usual places searched by default, the results are unspecified. Implementations shall support at least ten instances of this option in a single c99 command invocation. If a directory specified by a -L option contains files with names starting with any of the strings "libc.", "libl.", "libpthread.", "libm.", "librt.", "libtrace.", "libxnet.", or "liby.", the results are unspecified.

-l rt This option shall make available all interfaces referenced in <aio.h>, <mqueue.h>, <sched.h>, <semaphore.h>, and <spawn.h>, interfaces marked as optional in <sys/mman.h>, interfaces marked as ADV (Advisory Information) in <fcntl.h>, and interfaces beginning with the prefix clock_ and timer_ in <time.h>. An implementation may search this library in the absence of this option.

This is the behavior of the compiler, not the dynamic linker, but the argument that conforming libraries should not use those names still stands, in my opinion. But I understand renaming would be unfeasible for BRL-CAD.

@starseeker
Copy link
Member

A couple quick comments:

On the "-Wl,-z,stack-size" issue - if the CMake logic tests at configure time to see if those options are valid for the current compiler, it should be fine to go ahead and apply them when they are valid. I'd much rather do that than modifying to skip long lines, since the point of that utility is to survey the code looking for environment flags the user can set and I don't want to miss one because the line is long - very unexpected user behavior. I doubt it's an issue right now, but that would be exactly the kind of weird corner case problem that would have a developer pulling their hair out down the road wondering why it doesn't work... I don't know if we have a linker options test set up at the moment, but it certainly should be possible to do (even up to running a long line regex match test case, if necessary.)

On the librt issue, I should perhaps mention that there is an option short of an all-encompassing, across-the-board rename (if it comes to that and musl in the end truly does preclude a functional solution with the librt name.) CMake gives us an option to change the output file name of a target without otherwise having to alter the build logic, and that command could be hidden behind a user level CMake option to allow the output name to be changed if necessary. That will of course still not work for situations where external integrations expect BRL-CAD's librt to be present, but for use cases where such integrations are not a concern that is a technically feasible answer.

@r-ricci
Copy link
Author

r-ricci commented Mar 5, 2022

CMake gives us an option to change the output file name of a target without otherwise having to alter the build logic

Oh, nice. I was able to build successfully and it seems to work, I didn't expect it to be as easy as set_target_properties(librt PROPERTIES OUTPUT_NAME "libbrlcadrt"). Thanks for the hint. A different solution is required in order to make extensions work, but for my use case it's enough.

Regarding the env2c issue, just to make sure it can't be fixed in the source, in C++ there's no equivalent for pthread_attr_setstacksize(3), so the application cannot request a custom stack size through the API, right?

@brlcad
Copy link
Member

brlcad commented Jul 27, 2022

Hi @r-ricci I just wanted to update you that I've been working on a musl-based build recently, and have been working to get all the necessary changes incorporated. Thank you for all your efforts and debugging the issues. Having gone through it myself, now, I'm impressed how far along you got in discerning the issues and finding fixes.

As an aside, I found another simpler workaround to the issue of ld resolving to /lib/ld-musl-x86_64.so.1 for librt. If you export LD_PRELOAD=/path/to/build/lib/librt.so, that avoids the relocation symbol errors and compilation succeeds. Of course, that's just a temporary workaround and has to be updated and set post-install too, but it is an option that seems to work.

What'll probably be needed unless/until musl changes their ld is to create a cmake test that checks if ld resolves librt so anything other than what we specify and, if it does, modify the output name. That said, I did find the specific line in musl's ld code where it skips libraries named 'librt' (and 'libpthread' and 'libm' and 'libdl' and 'libutil' and 'libxnet'), so I'll see if they're possibly amenable to a patch that simply doesn't.

Thanks again for your efforts and feedback!

@brlcad
Copy link
Member

brlcad commented Aug 3, 2022

We're now building cleanly out of the box on musl, so this issue can be closed. Please feel free to open a new tickets if you observe any additional issues. Thank you again for all the excellent rundown details on each issue. They definitely were helpful.

@brlcad brlcad closed this as completed Aug 3, 2022
zhuodannychen added a commit to zhuodannychen/brlcad that referenced this issue May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants