Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightGBM is incompatible with libomp 12 and 13 on macOS #4229

Closed
SchantD opened this issue Apr 26, 2021 · 28 comments
Closed

LightGBM is incompatible with libomp 12 and 13 on macOS #4229

SchantD opened this issue Apr 26, 2021 · 28 comments

Comments

@SchantD
Copy link

SchantD commented Apr 26, 2021

Description

LightGBM cannot be used to fit multiple models in parallel using threads with the latest libomp.
On 2014 MacBook Pro:

OMP: Error #13: Assertion failure at kmp_runtime.cpp(3689).
OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
[1]    17358 abort      python myfile2.py

On 2019 MacBook Pro:

OMP: Error #131: Thread identifier invalid.

Setting nthreads=1 doesn't solve the problem.

Reproducible example

from lightgbm import LGBMClassifier
import numpy as np
from concurrent.futures import ThreadPoolExecutor

x = np.random.random((200, 4))
y = x.sum(axis=1) >= 2


def myfunc(a=7):
    test = LGBMClassifier().fit(x, y)
    print(test.predict(x))


with ThreadPoolExecutor(20) as tpe:
    print(list(tpe.map(myfunc, range(20))))

Environment info

LightGBM version or commit hash: 3.1.1 (with python 3.7.3) and 3.2.1 (with python 3.9.4)

brew install libomp

libomp: stable 12.0.0 (bottled)
LLVM's OpenMP runtime library
https://openmp.llvm.org/
/usr/local/Cellar/libomp/12.0.0 (9 files, 1.5MB) *
Poured from bottle on 2021-04-26 at 11:06:26
From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/libomp.rb

Command(s) you used to install LightGBM

pip install lightgbm

Additional Comments

The code does work with libomp version 11. Downgraded using

wget https://raw.githubusercontent.com/Homebrew/homebrew-core/fb8323f2b170bd4ae97e1bac9bf3e2983af3fdb0/Formula/libomp.rb
brew unlink libomp
brew install libomp.rb
@StrikerRUS
Copy link
Collaborator

All our tests are passing with libomp 12: https://github.com/microsoft/LightGBM/runs/2437586276

==> Pouring libomp--12.0.0.catalina.bottle.tar.gz

...

-- The C compiler identification is AppleClang 12.0.0.12000032
-- The CXX compiler identification is AppleClang 12.0.0.12000032
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /Applications/Xcode_12.4.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /Applications/Xcode_12.4.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -Xclang -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -Xclang -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Performing Test MM_PREFETCH
-- Performing Test MM_PREFETCH - Success
-- Using _mm_prefetch
-- Performing Test MM_MALLOC
-- Performing Test MM_MALLOC - Success
-- Using _mm_malloc
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/runner/work/LightGBM/LightGBM/build
[  5%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/gbdt_model_text.cpp.o
[  5%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/boosting.cpp.o
[  8%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/gbdt.cpp.o
[ 11%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/gbdt_prediction.cpp.o
[ 14%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/prediction_early_stop.cpp.o
[ 17%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/bin.cpp.o
[ 20%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/config.cpp.o
[ 23%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/config_auto.cpp.o
[ 26%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/dataset.cpp.o
[ 29%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/dataset_loader.cpp.o
[ 32%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/file_io.cpp.o
[ 35%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/json11.cpp.o
[ 38%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/metadata.cpp.o
[ 41%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/parser.cpp.o
[ 44%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/train_share_states.cpp.o
[ 47%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/tree.cpp.o
[ 50%] Building CXX object CMakeFiles/_lightgbm.dir/src/metric/dcg_calculator.cpp.o
[ 52%] Building CXX object CMakeFiles/_lightgbm.dir/src/metric/metric.cpp.o
[ 55%] Building CXX object CMakeFiles/_lightgbm.dir/src/network/ifaddrs_patch.cpp.o
[ 58%] Building CXX object CMakeFiles/_lightgbm.dir/src/network/linker_topo.cpp.o
[ 61%] Building CXX object CMakeFiles/_lightgbm.dir/src/network/linkers_mpi.cpp.o
[ 64%] Building CXX object CMakeFiles/_lightgbm.dir/src/network/linkers_socket.cpp.o
[ 67%] Building CXX object CMakeFiles/_lightgbm.dir/src/network/network.cpp.o
[ 70%] Building CXX object CMakeFiles/_lightgbm.dir/src/objective/objective_function.cpp.o
[ 73%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/cuda_tree_learner.cpp.o
[ 76%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/data_parallel_tree_learner.cpp.o
[ 79%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/feature_parallel_tree_learner.cpp.o
[ 82%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/gpu_tree_learner.cpp.o
[ 85%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/linear_tree_learner.cpp.o
[ 88%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/serial_tree_learner.cpp.o
[ 91%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/tree_learner.cpp.o
[ 94%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/voting_parallel_tree_learner.cpp.o
[ 97%] Building CXX object CMakeFiles/_lightgbm.dir/src/c_api.cpp.o
[100%] Linking CXX shared library ../lib_lightgbm.so
[100%] Built target _lightgbm

...

====== 234 passed, 4 skipped, 2 xfailed, 79 warnings in 120.17s (0:02:00) ======

I'm not sure LightGBM was ever able to

fit multiple models in parallel using threads

Refer to https://lightgbm.readthedocs.io/en/latest/FAQ.html#lightgbm-hangs-when-multithreading-openmp-and-using-forking-in-linux-at-the-same-time.

I think you can migrate to the bug-free Intel toolchain or compile threadless version: https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-threadless-version-not-recommended.

@Zahlii
Copy link

Zahlii commented Apr 26, 2021

I'm not sure LightGBM was ever able to

fit multiple models in parallel using threads

Well, we have been using a similar approach as stated above (ThreadPool + fit) successfully in production settings for quite some time, and also facing this problem now. As already commented, this issue is quickly solved by downgrading to the older libomp version, without any side effects. Maybe this has been something without any tests, which just now happens to fail?

I would also like to point out that this issue could also happen with different scikit-learn wrappers using the joblib/delayed approach. The default here is to use multiprocessing (which works), but threading (in order to save memory etc) does not.

from joblib import parallel_backend
import numpy as np
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_validate

x = np.random.random((200, 5))
y = x.sum(axis=1) > 2.5


def run(*args, **kwargs):
    estimator = LGBMClassifier()
    estimator.fit(x, y)
    return estimator.predict(x)


with parallel_backend('threading', n_jobs=5):
    print(cross_validate(LGBMClassifier(), x, y, n_jobs=5, cv=5))

Refer to https://lightgbm.readthedocs.io/en/latest/FAQ.html#lightgbm-hangs-when-multithreading-openmp-and-using-forking-in-linux-at-the-same-time.

Interestingly, it seems that this (somewhat) fixes the problem. Setting n_jobs=1 works for me, but also higher values (up to around n_jobs=5) seem to work. Maybe this is simply a question of spawning too many threads in total?

Some results:

For whatever reason it seems the threshold is between 40 (working) and 42 (failing).

@trivialfis
Copy link

On XGBoost we are also facing issues with updated libomp. It has internal error: https://github.com/dmlc/xgboost/pull/6912/checks?check_run_id=2459890229

@StrikerRUS
Copy link
Collaborator

@StrikerRUS
Copy link
Collaborator

Upstream bug report: https://bugs.llvm.org/show_bug.cgi?id=50579.

@seahrh
Copy link

seahrh commented Jul 2, 2021

Moving the import statement import lightgbm as lgb to line 1 in my file actually got rid of the error. As per suggestion from dmlc/xgboost#7039 (comment)

libomp version /usr/local/Cellar/libomp/12.0.0

Error dump when loading booster model. Putting it out here in case it is useful:

Process:               Python [6481]
Path:                  /Library/Frameworks/Python.framework/Versions/3.7/Resources/Python.app/Contents/MacOS/Python
Identifier:            Python
Version:               3.7.3 (3.7.3)
Code Type:             X86-64 (Native)
Parent Process:        zsh [511]
Responsible:           iTerm2 [403]
User ID:               501

Date/Time:             2021-07-02 10:35:36.911 +0800
OS Version:            macOS 11.4 (20F71)
Report Version:        12
Bridge OS Version:     5.4 (18P4663)

Time Awake Since Boot: 14000 seconds
Time Since Wake:       1500 seconds

System Integrity Protection: enabled

Crashed Thread:        41

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000000000048
Exception Note:        EXC_CORPSE_NOTIFY

Termination Signal:    Segmentation fault: 11
Termination Reason:    Namespace SIGNAL, Code 0xb
Terminating Process:   exc handler [6481]

VM Regions Near 0x48:
--> 
    __TEXT                      10388e000-10388f000    [    4K] r-x/rwx SM=COW  /Library/Frameworks/Python.framework/Versions/3.7/Resources/Python.app/Contents/MacOS/Python

Thread 0:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib        	0x00007fff204dc206 _kernelrpc_mach_vm_protect_trap + 10
1   libsystem_kernel.dylib        	0x00007fff204df1da mach_vm_protect + 33
2   libsystem_pthread.dylib       	0x00007fff20512589 _pthread_create + 533
3   libomp.dylib                  	0x0000000183c99568 __kmp_create_worker + 264
4   libomp.dylib                  	0x0000000183c6f2a4 __kmp_allocate_thread + 954
5   libomp.dylib                  	0x0000000183c6ac21 __kmp_allocate_team + 1311
6   libomp.dylib                  	0x0000000183c6c51c __kmp_fork_call + 5365
7   libomp.dylib                  	0x0000000183c61295 __kmpc_fork_call + 293
8   lib_lightgbm.so               	0x00000001838d5036 LightGBM::ParallelPartitionRunner<int, false>::ParallelPartitionRunner(int, int) + 118
9   lib_lightgbm.so               	0x00000001838c9379 LightGBM::GBDT::GBDT() + 777
10  lib_lightgbm.so               	0x00000001838be0f1 LightGBM::Boosting::CreateBoosting(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, char const*) + 1745
11  lib_lightgbm.so               	0x0000000183abf490 LightGBM::Booster::Booster(char const*) + 400

@mldeveloper01
Copy link

Facing the exact reported issue.Subscribed for more updates.

@mkos
Copy link

mkos commented Jul 9, 2021

I have the same issue and did some testing: basically libomp 12.0 works with Catalina, but results in segfault for Big Sur. Downgrading to 11.1 worked for Big Sur (tested on Intel MBP and M1 MBP via rosetta2)

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jul 9, 2021

Unfortunately, LLVM developers haven't fixed this bug (#4229 (comment)) in 12.0.1 release.

@StrikerRUS
Copy link
Collaborator

One suggested workaround in the upstream bug report without downgrading libomp version is to set some environmental variables:

LIBOMP_USE_HIDDEN_HELPER_TASK=0
LIBOMP_NUM_HIDDEN_HELPER_THREADS=0

https://bugs.llvm.org/show_bug.cgi?id=50579#c1

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Oct 4, 2021

New major LLVM version 13 was released 4 days ago: https://github.com/llvm/llvm-project/releases/tag/llvmorg-13.0.0. And the latest Homebrew libomp formulae is pointing to that version now: https://github.com/Homebrew/homebrew-core/blob/4343aee9c28d28b9ed3208b5933df54c29b916fb/Formula/libomp.rb#L4.

But unfortunately this bug (#4229 (comment)) wasn't fixed in stable 13 release.
I'm going to reflect this fact in the issue's title.

@StrikerRUS StrikerRUS changed the title LightGBM incompatible with libomp 12.0 (MacOS) LightGBM is incompatible with libomp 12 and 13 on macOS Oct 4, 2021
devernay added a commit to NatronGitHub/Natron that referenced this issue Oct 26, 2021
@StrikerRUS
Copy link
Collaborator

LLVM has changed Bugzilla to GitHub Issues as the main issue tracker.

New replies to the original bug report contains the following:

This bug is being removed from the LLVM 13.0.1 release milestone. If you have a fix or think this bug is important enough to block the release, please explain why in a comment and add the bug back to the LLVM 13.0.1 release milestone.

I cannot reproduce the failure on macOS with trunk as well. Besides, helper thread should be disabled on macOS. I don't know if the latest HomeBrew version has already covered newer code base. Please let me know if the problem still exists.

Everyone who is subscribed to this issue and has easy access to macOS, please check the latest available libomp version from Homebrew (stable ✅ 13.0.0 at the moment of writing this comment) and report your results here: llvm/llvm-project#49923.

@guolinke
Copy link
Collaborator

guolinke commented Mar 1, 2022

@StrikerRUS I saw llvm/llvm-project#49923 is closed, is this problem solved?

@StrikerRUS
Copy link
Collaborator

@guolinke I haven't seen this. According to the conversation in llvm/llvm-project#49923, they closed that issue due to the inability to reproduce the issue.

Please, anyone subscribed to this issue, check whether the error occurs with the most recent libomp version 13.0.1.

@nickordoodle
Copy link

This worked for me in a Local dataspell notebook on M1 ARM. Looks like if you only need tabular package, then you may be in luck:

!pip install -U pip
!pip install -U setuptools wheel
!pip install "mxnet<2.0.0"
!pip install "autogluon.tabular"

@jameslamb
Copy link
Collaborator

@nickordoodle Thanks for posting this. Can you please explain how your post is related to the topic "LightGBM is incompatible with OpenMP 12 and 13 on macOS"?

@jameslamb
Copy link
Collaborator

I found tonight that upgrading to the latest libomp shipped by Homebrew (v15.0.6), I was able to compile LightGBM, build the Python package, and run all of its tests without issue on my macbook (Intel chip, macOS 12.2.1).

brew install libomp
cd ./python-package
pip install .
cd ..
pytest tests/python_package_tests

@jameslamb
Copy link
Collaborator

Assigning this to myself... I'll prioritize this for the next release of LightGBM (after v4.2.0).

I observed a deadlock in this simple example tonight:

rm -rf ./dist
sh build-python.sh sdist
pip install ./dist/lightgbm-*.tar.gz
import lightgbm as lgb
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=10_000)
dtrain = lgb.Dataset(X, label=y)
dtrain.construct()

With the following:

  • OS: macOS 14.1.2 (Sonoma)
  • CPU: M2 chip
  • compiler: AppleClang 15.0.0
  • Python: 3.11.7
  • OpenMP: 17.0.6
brew info libomp
==> libomp: stable 17.0.6 (bottled) [keg-only]
LLVM's OpenMP runtime library
https://openmp.llvm.org/
/opt/homebrew/Cellar/libomp/17.0.6 (7 files, 1.7MB)
  Poured from bottle using the formulae.brew.sh API on 2023-12-19 at 22:06:33
From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/lib/libomp.rb
License: MIT

Installing with OpenMP turned off, I didn't experience any deadlocks or other issues.

pip install \
    --config-settings=cmake.define.USE_OPENMP=OFF \
    ./dist/lightgbm-*.tar.gz

For more details: #6191 (comment)

@borchero
Copy link
Collaborator

FYI (not sure if this is common knowledge yet): when developing on LightGBM on Apple Silicon, I never turned off OpenMP but used gcc instead of clang for compilation (for me, that was):

export CXX=g++-13 CC=gcc-13

This fixed any problems I had 😅

@jameslamb
Copy link
Collaborator

jameslamb commented Dec 28, 2023

Thanks @borchero , that's helpful!

Looking into this a bit today, I also think that some of these failures might not actually be about incompatibility with particular versions of OpenMP, but rather related to #5106.

Fixing the search paths embedded in lib_lightgbm.so on macOS might eliminate some of these cases where programs segfault because multiple versions of libomp have been loaded.

details (click me)

Tried the following today on my intel mac:

  • OS: macOS 14.1.2 (Sonoma)
  • CPU: intel chip
  • compiler: AppleClang 13.0.0
  • Python: 3.11.7
  • OpenMP: 17.0.6
  1. build lib_lightgbm
rm -rf ./build
mkdir ./build
cd ./build
cmake ..
make -j2 _lightgbm
cd ..
  1. check what it linked against
# check what it's linked to
otool -L lib_lightgbm.so
# 
../lib_lightgbm.so:
    @rpath/lib_lightgbm.so (compatibility version 0.0.0, current version 0.0.0)
    /usr/local/opt/libomp/lib/libomp.dylib (compatibility version 5.0.0, current version 5.0.0)
    /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1200.3.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1311.0.0

Notice that even though I was building in an active conda environment, it found Homebrew's OpenMP, /usr/local/opt/libomp/lib/libomp.dylib.

  1. install the Python library
sh build-python.sh install --precompile
  1. run an example

This segfaults, I think because it's finding the llvm-openmp from conda:

python ./examples/python-guide/logistic_regression.py
Performance of `binary` objective with binary labels:
Segmentation fault: 11

Running with some debugging stuff set... it looks like that's exactly what's happening. 2 versions of OpenMP are being loaded.

DYLD_PRINT_LIBRARIES=1 \
python examples/python-guide/logistic_regression.py 2>&1 \
| grep libomp
dyld[32037]: <891B2F9B-F926-3D67-AA9C-D58D47668AFB> /Users/jlamb/mambaforge/envs/lgb-dev/lib/libomp.dylib
dyld[32037]: <C91365F6-6644-300A-9277-1946696E9E86> /usr/local/Cellar/libomp/17.0.4/lib/libomp.dylib

Looking a bit more closely, it seems that scikit-learn comes with an sklearn/utils/_openmp_helpers.cpython-311-darwin.so which has an RPATH entry that causes conda's libomp.dylib to be loaded.

otool -L /Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/sklearn/utils/_openmp_helpers.cpython-311-darwin.so
/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/sklearn/utils/_openmp_helpers.cpython-311-darwin.so:
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1197.1.1)
	@rpath/libomp.dylib (compatibility version 5.0.0, current version 5.0.0)

Patching out lib_lightgbm's corresponding entry so that it will end up not loading a different version, the example runs without segfaulting.

install_name_tool \
    -change /usr/local/opt/libomp/lib/libomp.dylib \
    @rpath/libomp.dylib \
    /Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.so
otool -L \
    /Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.so
/Users/jlamb/mambaforge/envs/lgb-dev/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.so:
	@rpath/lib_lightgbm.so (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libomp.dylib (compatibility version 5.0.0, current version 5.0.0)
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1200.3.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1311.0.0)
python examples/python-guide/logistic_regression.py
Performance of `binary` objective with binary labels:
{'time': 0.031093120574951172, 'correlation': 0.6012584922759894, 'logloss': 0.15545640415178236}
Performance of `xentropy` objective with binary labels:
{'time': 0.0031642913818359375, 'correlation': 0.6012584922759894, 'logloss': 0.15545640415178236}
Performance of `xentropy` objective with probability labels:
{'time': 0.006477832794189453, 'correlation': 0.884189150816587, 'logloss': 0.1551448517607808}
Best `binary` time: 0.002405881881713867
Best `xentropy` time: 0.0023250579833984375

Just stopping here for now to post my notes. I'll continue working on this.

@jameslamb
Copy link
Collaborator

Adding another relevant link: bacpop/pp-sketchlib#42 (comment)

@jameslamb
Copy link
Collaborator

Fixing the search paths embedded in lib_lightgbm.so on macOS might eliminate some of these cases where programs segfault because multiple versions of libomp have been loaded.

We did this in #6391. As of lightgbm==4.4.0, lightgbm's macOS wheels should no longer segfault in the presence of other libomp.dylib already loaded in the process.

I'm going to mark this awaiting response, so it'll be closed automatically in 30 days if there are not any other comments. Doing that to leave some time for others to discover issues and post follow-up comments here.

Thank you all very much for the patience and helpful comments. Please come by and contribute again some time, we'd love the help!

Copy link

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests