Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

1.7.0.post2: consistent segfault in ~LibraryInitializer() with loaded extensions on OSX Mojave #20411

Open
david-seiler opened this issue Jul 1, 2021 · 4 comments · Fixed by #20523
Assignees

Comments

@david-seiler
Copy link
Contributor

david-seiler commented Jul 1, 2021

Description

I've recently started using the C++ custom operator framework. It's great and I like it a lot, but for some reason mxnet always segfaults on program exit when I've loaded a custom operator. This happens with all custom operators, even the tutorial operator gemm_lib.

This is on OSX Mojave, g++ version "Apple LLVM version 10.0.1 (clang-1001.0.46.4)". mxnet comes from pip, version 1.7.0.post2.

Error Message

Since it's a segfault there's not a backtrace as such, but here's the tail of a representative lldb session:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x0000000118105283 libmxnet.dylib`mxnet::LibraryInitializer::~LibraryInitializer() + 67
libmxnet.dylib`mxnet::LibraryInitializer::~LibraryInitializer:
->  0x118105283 <+67>: movq   (%rcx), %rcx
    0x118105286 <+70>: testq  %rcx, %rcx
    0x118105289 <+73>: jne    0x118105280               ; <+64>
    0x11810528b <+75>: jmp    0x1181052b0               ; <+112>
Target 0: (python) stopped.
(lldb) thread backtrace
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x0000000118105283 libmxnet.dylib`mxnet::LibraryInitializer::~LibraryInitializer() + 67
    frame #1: 0x00007fff786ef3cf libsystem_c.dylib`__cxa_finalize_ranges + 319
    frame #2: 0x00007fff786ef6b3 libsystem_c.dylib`exit + 55
    frame #3: 0x00007fff786493dc libdyld.dylib`start + 8
    frame #4: 0x00007fff786493d5 libdyld.dylib`start + 1

To Reproduce

make gemm_lib && python test_gemm.py from the custom op tutorial shows the error for me.

Things That Didn't Work

Building my own mxnet 1.7.0 from source didn't help. The content of the custom op doesn't seem to matter either.

Environment

Environment Information

----------Python Info----------
Version : 3.6.5
Compiler : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
Build : ('default', 'Apr 26 2018 08:42:37')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 21.1.3
Directory : /Users/davseile/miniconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
/Users/davseile/miniconda3/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
Version : 1.7.0
Directory : /Users/davseile/miniconda3/lib/python3.6/site-packages/mxnet
Commit Hash : 64f737c
Library : ['/Users/davseile/miniconda3/lib/python3.6/site-packages/mxnet/libmxnet.dylib']
Build features:
✖ CUDA
✖ CUDNN
✖ NCCL
✖ CUDA_RTC
✖ TENSORRT
✔ CPU_SSE
✔ CPU_SSE2
✔ CPU_SSE3
✔ CPU_SSE4_1
✖ CPU_SSE4_2
✖ CPU_SSE4A
✖ CPU_AVX
✖ CPU_AVX2
✖ OPENMP
✖ SSE
✖ F16C
✖ JEMALLOC
✖ BLAS_OPEN
✖ BLAS_ATLAS
✖ BLAS_MKL
✔ BLAS_APPLE
✔ LAPACK
✔ MKLDNN
✔ OPENCV
✖ CAFFE
✖ PROFILER
✖ DIST_KVSTORE
✖ CXX14
✖ INT64_TENSOR_SIZE
✔ SIGNAL_HANDLER
✖ DEBUG
✖ TVM_OP
----------System Info----------
Platform : Darwin-18.7.0-x86_64-i386-64bit
system : Darwin
node : 00e04c166570
release : 18.7.0
version : Darwin Kernel Version 18.7.0: Mon May 3 20:41:19 PDT 2021; root:xnu-4903.278.68~1/RELEASE_X86_64
----------Hardware Info----------
machine : x86_64
processor : i386
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 HLE AVX2 SMEP BMI2 ERMS INVPCID RTM FPU_CSDS MPX RDSEED ADX SMAP CLFSOPT IPT MDCLEAR TSXFA IBRS STIBP L1DF SSBD'
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0430 sec, LOAD: 0.7044 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0961 sec, LOAD: 0.5755 sec.
Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833)>, DNS finished in 0.08476710319519043 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0732 sec, LOAD: 0.3876 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0459 sec, LOAD: 1.0816 sec.
Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.06121397018432617 sec.
----------Environment----------

@github-actions
Copy link

github-actions bot commented Jul 1, 2021

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue.
Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly.
If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.

@david-seiler
Copy link
Contributor Author

We've also seen this crash on 1.8.0, but only on OSX; linux seems to be fine.

@samskalicky
Copy link
Contributor

Hi @david-seiler thanks for filing the issue. We have not done much testing of c++ custom ops on Mac (if at all). Mostly on linux and very slightly on windows. Im not aware of how OS X is different than linux kernel handling of dynamic libraries, so any help you can provide in that area would be great.

in 1.7.x we have some code in the LibraryInitializer::~LibraryInitializer() destructor that tries to close the custom libraries that have been opened:

https://github.com/apache/incubator-mxnet/blob/a22abce0ce576ef4630aaea00cc9ad4d844f99f9/src/initialize.cc#L100-L102

but in 2.0 we removed this so now the destructor doesnt do anything:

https://github.com/apache/incubator-mxnet/blob/9ed058202ac1f299a1b11caf74c2a719650bf89f/src/initialize.cc#L100

And then we expect the user to close the library (by calling dlclose or letting it get cleaned up by the kernel when the process exits).

Can you try commenting out the close_open_libs(); in the destructor and see if that works for you?

@samskalicky samskalicky self-assigned this Aug 12, 2021
@samskalicky
Copy link
Contributor

samskalicky commented Aug 12, 2021

Update: I built from source (1.7.x) and commented out the close_open_libs(); and the example works for me on my Mac.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants