Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pthread_mutex_trylock deadlock in jemalloc #2727

Merged
merged 1 commit into from
Aug 20, 2024

Conversation

chenBright
Copy link
Contributor

@chenBright chenBright commented Aug 6, 2024

What problem does this PR solve?

Issue Number: resolve #2726

Problem Summary:

#2692 未使用__dl_sym的原因是,UT无法运行,报错信息:symbol lookup error: ./libbrpc.so: undefined symbol: pthread_mutex_trylock。相关issue:#2266 #1086

报错原因总结:libpthread.so先于libbrpc.dbg.so加载,导致使用__dl_sym RTLD_NEXT在后续加载的动态库中找不到pthread_mutex_trylock符号。

解决方法有两个:

  1. libbrpc.dbg.so先于libpthread.so加载。
  2. 最终的可执行文件静态链接bRPC静态库。

具体分析:

man文档提到RTLD_NEXT的作用:

Find the next occurrence of the desired symbol in the search order after the current object.

在本场景下,大致意思是在从加载顺序在libbrpc.dbg.so之后的动态库中查找pthread_mutex_*符号。那么,libbrpc.dbg.so要先于libpthread.so加载,才能找到pthread_mutex_*系列符号。

在master分支下编译出brpc_channel_unittest程序用作调试。为了更好地展示,会对输出进行适当的处理(过滤、删减)。

通过LD_DEBUG=libs查看动态库加载顺序,发现libpthread.so先于libbrpc.dbg.so加载了。

LD_DEBUG=libs ./brpc_channel_unittest | grep 'needed by\|generating link map'

file=libgflags.so.2.2 [0];  needed by ./brpc_channel_unittest [0]
file=libgflags.so.2.2 [0];  generating link map
file=libprotobuf.so.17 [0];  needed by ./brpc_channel_unittest [0]
file=libprotobuf.so.17 [0];  generating link map
==================================================
file=libpthread.so.0 [0];  needed by ./brpc_channel_unittest [0]
file=libpthread.so.0 [0];  generating link map
==================================================
file=libssl.so.1.1 [0];  needed by ./brpc_channel_unittest [0]
file=libssl.so.1.1 [0];  generating link map
file=libcrypto.so.1.1 [0];  needed by ./brpc_channel_unittest [0]
file=libcrypto.so.1.1 [0];  generating link map
file=libdl.so.2 [0];  needed by ./brpc_channel_unittest [0]
file=libdl.so.2 [0];  generating link map
file=libz.so.1 [0];  needed by ./brpc_channel_unittest [0]
file=libz.so.1 [0];  generating link map
file=librt.so.1 [0];  needed by ./brpc_channel_unittest [0]
file=librt.so.1 [0];  generating link map
file=libleveldb.so.1d [0];  needed by ./brpc_channel_unittest [0]
file=libleveldb.so.1d [0];  generating link map
file=libtcmalloc_and_profiler.so.4 [0];  needed by ./brpc_channel_unittest [0]
file=libtcmalloc_and_profiler.so.4 [0];  generating link map
==================================================
file=libbrpc.dbg.so [0];  needed by ./brpc_channel_unittest [0]
file=libbrpc.dbg.so [0];  generating link map
==================================================
file=libstdc++.so.6 [0];  needed by ./brpc_channel_unittest [0]
file=libstdc++.so.6 [0];  generating link map
file=libm.so.6 [0];  needed by ./brpc_channel_unittest [0]
file=libm.so.6 [0];  generating link map
file=libgcc_s.so.1 [0];  needed by ./brpc_channel_unittest [0]
file=libgcc_s.so.1 [0];  generating link map
file=libc.so.6 [0];  needed by ./brpc_channel_unittest [0]
file=libc.so.6 [0];  generating link map
file=libsnappy.so.1 [0];  needed by /usr/lib/x86_64-linux-gnu/libleveldb.so.1d [0]
file=libsnappy.so.1 [0];  generating link map
file=libunwind.so.8 [0];  needed by /usr/lib/x86_64-linux-gnu/libtcmalloc_and_profiler.so.4 [0]
file=libunwind.so.8 [0];  generating link map
file=libprotoc.so.17 [0];  needed by ./libbrpc.dbg.so [0]
file=libprotoc.so.17 [0];  generating link map
file=liblzma.so.5 [0];  needed by /lib/x86_64-linux-gnu/libunwind.so.8 [0]
file=liblzma.so.5 [0];  generating link map

同时,发现了使用dlsym也有同样的报错,但是dlsym不会让进程退出,而是通过dlerror返回错误信息(#2726 的死锁问题是因为这一块申请内存导致的)。

calling init: ./libbrpc.dbg.so
./libbrpc.dbg.so: error: symbol lookup error: undefined symbol: pthread_mutex_trylock (fatal)

所以,此时sys_pthread_mutex_trylockNULL。UT之所以没有crash,应该是所有UT以及依赖的库都没用pthread_mutex_trylock

另一方面,没有pthread_mutex_lockpthread_mutex_unlock相关的报错,换而言之,它们的符号是能被找到的。那么,这两个符号来自于哪里呢?

增加一行代码,方便识别出pthread_mutex_lockpthread_mutex_unlock符号的相关绑定信息。

static void init_sys_mutex_lock() {
    if (_dl_sym) {
        sys_pthread_mutex_trylock = (MutexOp)dlsym(RTLD_NEXT, "pthread_mutex_trylock");

        sys_pthread_mutex_lock = (MutexOp)_dl_sym(RTLD_NEXT, "pthread_mutex_lock", (void*)init_sys_mutex_lock);
        sys_pthread_mutex_unlock = (MutexOp)_dl_sym(RTLD_NEXT, "pthread_mutex_unlock", (void*)init_sys_mutex_lock);

        sys_pthread_mutex_trylock = (MutexOp)dlsym(RTLD_NEXT, "pthread_mutex_trylock");
    }
    ...
}

通过LD_DEBUG=bindings,libs找到了,pthread_mutex_lockpthread_mutex_unlock符号来自于libc.so.6(两个pthread_mutex_trylock报错之间的输出)。

LD_DEBUG=bindings,libs ./brpc_channel_unittest 

......
7240:	calling init: ./libbrpc.dbg.so
7240:	binding file ./libbrpc.dbg.so [0] to /usr/lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `pthread_mutex_lock'
7240:	binding file ./libbrpc.dbg.so [0] to /usr/lib/x86_64-linux-gnu/libpthread.so.0 [0]: normal symbol `pthread_mutex_unlock'
==================================================
7240:	./libbrpc.dbg.so: error: symbol lookup error: undefined symbol: pthread_mutex_trylock (fatal)
7240:	binding file ./libbrpc.dbg.so [0] to /usr/lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_mutex_lock'
7240:	binding file ./libbrpc.dbg.so [0] to /usr/lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `pthread_mutex_unlock'
7240:	./libbrpc.dbg.so: error: symbol lookup error: undefined symbol: pthread_mutex_trylock (fatal)
==================================================

libc.so.6搜索pthread_mutex_*相关符号,确实没有pthread_mutex_trylock的符号。

nm -D /usr/lib/x86_64-linux-gnu/libc.so.6 | grep pthread_mutex

0000000000094480 T pthread_mutex_destroy
00000000000944b0 T pthread_mutex_init
00000000000944e0 T pthread_mutex_lock
0000000000094510 T pthread_mutex_unlock

libc.so中的pthread_mutex_*相关函数应该是stub function,参考[1] [[2] [3]。

stub function is a function which cannot be implemented on a particular machine or operating system. Stub functions always return an error, and set errno to ENOSYS (Function not implemented).

在这个场景下,即使pthread_mutex_lockpthread_mutex_unlock使用了错误的函数,pthread_mutex_trylockNULL,也不会影响进程运行。因为libpthread.so先加载了,这时候进程使用的pthread_mutex_*符号都来自于libpthread.so,即libbrpc.dbg.so的hook失效了。

What is changed and the side effects?

Changed:

  1. 使用__dl_sym加载pthread_mutex_try,规避malloc库死锁问题。使用时需要满足以下其中一点:
    1. libbrpc.dbg.so先于libpthread.so加载。(UT使用了这个方法)
    2. 最终的可执行文件静态链接bRPC静态库。
  2. 对于像紧急求助:报错 symbol lookup error: /usr/local/lib/libbrpc.so: undefined symbol: pthread_mutex_lock #2266 无法修改链接顺序或者无法使用静态库的场景,支持使用NO_PTHREAD_MUTEX_HOOK宏关闭pthread_mutex_*相关的hook。关闭后,只是contention profiler采集不到pthread_mutex的竞争,在可接受范围内。

Side effects:

  • Performance effects(性能影响):

  • Breaking backward compatibility(向后兼容性):


Check List:

  • Please make sure your changes are compilable(请确保你的更改可以通过编译).
  • When providing us with a new feature, it is best to add related tests(如果你向我们增加一个新的功能, 请添加相关测试).
  • Please follow Contributor Covenant Code of Conduct.(请遵循贡献者准则).

@chenBright
Copy link
Contributor Author

@wwbmmm 有空看看

@wwbmmm wwbmmm merged commit 25130b4 into apache:master Aug 20, 2024
20 checks passed
@chenBright chenBright deleted the fix_jemalloc_trylock_deadlock branch August 20, 2024 06:08
yiguolei pushed a commit to apache/doris that referenced this pull request Oct 15, 2024
…ock (#41891)

BRPC contention profiler hooks pthread mutex, which may deadlock when
used with Jemalloc.
This PR remove pthread mutex hook and disable BRPC contention profiler.


![image](https://github.com/user-attachments/assets/62ccc04c-718a-43db-8354-b1bbc0565958)

similar issue: apache/brpc#2726
reference fix: apache/brpc#2727
xinyiZzz added a commit to xinyiZzz/incubator-doris that referenced this pull request Oct 16, 2024
…ock (apache#41891)

BRPC contention profiler hooks pthread mutex, which may deadlock when
used with Jemalloc.
This PR remove pthread mutex hook and disable BRPC contention profiler.

![image](https://github.com/user-attachments/assets/62ccc04c-718a-43db-8354-b1bbc0565958)

similar issue: apache/brpc#2726
reference fix: apache/brpc#2727
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

疑似覆盖pthread_mutex_trylock后与jemalloc造成死锁
2 participants