Skip to content
This repository has been archived by the owner on Feb 4, 2021. It is now read-only.

Segfault in macOS message_filters tests #189

Closed
clalancette opened this issue May 8, 2019 · 6 comments
Closed

Segfault in macOS message_filters tests #189

clalancette opened this issue May 8, 2019 · 6 comments
Assignees

Comments

@clalancette
Copy link

Some tests are failing with a segfault on macOS, e.g. https://ci.ros2.org/view/nightly/job/nightly_osx_debug/1243/testReport/junit/(root)/projectroot/message_filters_test_simple/

I did some debugging on this on mini2. The crash always happens when the program is quitting. Further, the crash always happens while trying to cleanup log4cxx via rcl_logging_log4cxx.cpp::rcl_logging_external_shutdown.

From poking around in the debugger and at log4cxx directly, the eventual result is that during cleanup log4cxx fails to acquire a mutex with an EINVAL. The only reason that this can really happen is if the mutex is uninitialized, or was destroyed. By adding some additional debugging, I could see that there are some object lifetime issues in log4cxx which cause the mutex in question to be destroyed before this final cleanup. Going deeper, there is a "pool" of memory that the mutex is allocated from at the apr level (Apache Portable Runtime, which log4cxx uses). There should be several pools and subpools, but for some reason when one of the subpools gets destroyed, it actually goes and frees the memory for all of the pools, which renders the mutex invalid. This feels like a bug in log4cxx or apr (1.6.5 on macOS), but it may be possible to work around it by some judicious pointer wrangling. More debugging needed.

@clalancette clalancette self-assigned this May 8, 2019
@dirk-thomas
Copy link
Member

@nburek can you please investigate this problem and work on a fix. Thanks.

@nburek
Copy link

nburek commented May 8, 2019

Yes, I'll start investigating.

@clalancette
Copy link
Author

Update: I poked quite a bit at the source for log4cxx and its underlying library, APR. The basic problem has to do with the C++ static object destruction order across translation units. Essentially the "APRInitializer" destructor here can get called way before the rest of the shutdown of the system. If that happens, then anything allocated out of the APR "pools" (which is basically everything in log4cxx) is now invalid, and future accesses to them may result in a segmentation fault. The problem is reported on the upstream bug tracker here and here.

There does seem to be some changes in newer log4cxx that tries to avoid this. However, there are no current releases with that fix available in them (so it is not available in homebrew or in the Linux packages).

With all of that said, the issues above have been open for a decade, so it seems like we should probably start considering options for a different logging backend.

@nburek
Copy link

nburek commented May 13, 2019

Great information, @clalancette . Given that this is breaking a number of things it seems like the safest action is to roll back making rcl_logging_log4cxx the default 3rd party logger for the dashing release.

Reading all the information in the bug tracker, it looks like the recent changes to avoid the issue are based on the patch that Ubuntu made when they switched to log4cxx as their logger. They note that this creates a memory leak on shutdown, but that they were okay with that because running on a modern operating system allows the memory to be reclaimed when the process dies. I don't think that this is a fully acceptable solution for the default logging backend in ROS2.

We could explore creating a log4cxx vendor package to pickup a lot of the bug fixes from the past couple of years since the last release and also implement the dynamic loading feature for logging backends. That would allow ROS 2 to ship with rcl_logging_noop and users who are okay with the pitfalls could use rcl_logging_log4cxx. However I think this work will take us past the Dashing release.

@clalancette
Copy link
Author

FWIW, I played around with using another logging library today, spdlog. I have a backend implementation in the spdlog branch on rcutils, rcl, and rcl_logging that seems to work on both Linux and Windows (I haven't tried macOS yet, and I haven't put it through its full paces yet). The only feature that this implementation is missing is the ability to configure it externally using a config file. Since spdlog doesn't have a file format for configuration, we'd have to come up with something and then parse it ourselves.

While that spdlog implementation may be an option going forward, I have to agree with @nburek that the lowest-risk option is to go back to noop and revisit this after Dashing.

@clalancette
Copy link
Author

This is now "fixed" by ros2/rcl#436 (message filter tests stopped segfaulting in https://ci.ros2.org/view/nightly/job/nightly_osx_debug/1251), so I'll close this one out.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants