[python-package] SegFault on MacOS when pytorch is installed #6595

connortann · 2024-08-07T12:51:06Z

Description

A segmentation fault occurs on MacOS when lightgbm and pytorch are both installed, depending on the order of imports.

Possibly related: #4229

Reproducible example

To reproduce the issue on GH actions:

# run_tests.yml
jobs:
  run_tests:
    runs-on: macos-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: 3.11
      - run: brew install libomp
      - run: pip install pytest torch scikit-learn lightgbm
      - run: pip list
      - run: pytest --noconftest test_bug.py

# test_bug.py
import time

import lightgbm  # Issue only occurs if this import is present
import torch
from sklearn.datasets import fetch_california_housing


def test_something():
    X, y = fetch_california_housing(return_X_y=True)
    torch.tensor(X)
    time.sleep(3)

Leads to Fatal Python error: Segmentation fault. Full output:

Run pytest --noconftest tests/test_bug121101.py
============================= test session starts ==============================
platform darwin -- Python 3.11.9, pytest-8.3.2, pluggy-1.5.0
rootdir: /Users/runner/work/shap/shap
configfile: pyproject.toml
collected 1 item

Fatal Python error: Segmentation fault

Thread 0x0000000204c1cc00 (most recent call first):
tests/test_bug121[10](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:11)1.py 
  File "/Users/runner/work/shap/shap/tests/test_bug121101.py", line 12 in test_something
  File "/Library/Frameworks/Python.framework/Versions/3.[11](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:12)/lib/python3.11/site-packages/_pytest/python.py", line 159 in pytest_pyfunc_call
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line [12](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:13)0 in _hookexec
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 5[13](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:14) in __call__
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/python.py", line [16](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:17)27 in runtest
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line [17](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:18)4 in pytest_runtest_call
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line 242 in <lambda>
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line 341 in from_call
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line 241 in call_and_report
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line 132 in runtestprotocol
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line 113 in pytest_runtest_protocol
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/main.py", line 362 in pytest_runtestloop
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/main.py", line Fatal Python error: Segmentation fault

337 in _main
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/main.py", line 283 in wrap_session
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/main.py", line 330 in pytest_cmdline_main
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/config/__init__.py", line 175 in main
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/config/__init__.py", line 201 in console_main
  File "/Users/runner/hostedtoolcache/Python/3.11.9/arm64/bin/pytest", line 8 in <module>

Extension modules: numpy._core._multiarray_umath, numpy._core._multiarray_tests, numpy.linalg._umath_linalg, scipy._lib._ccallback_c, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt[19](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:20)937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator
Extension modules: , numpy._core._multiarray_umathscipy.sparse._sparsetools, numpy._core._multiarray_tests, _csparsetools, numpy.linalg._umath_linalg, scipy.sparse._csparsetools, scipy._lib._ccallback_c, scipy.linalg._fblas, numpy.random._common, scipy.linalg._flapack, numpy.random.bit_generator, , scipy.linalg.cython_lapacknumpy.random._bounded_integers, , scipy.linalg._cythonized_array_utilsnumpy.random._mt19937, , scipy.linalg._solve_toeplitznumpy.random.mtrand, , numpy.random._philoxscipy.linalg._decomp_lu_cython, numpy.random._pcg64, scipy.linalg._matfuncs_sqrtm_triu, numpy.random._sfc64, scipy.linalg.cython_blas, numpy.random._generator, scipy.linalg._matfuncs_expm, scipy.sparse._sparsetools, scipy.linalg._decomp_update, _csparsetools, , scipy.sparse._csparsetoolsscipy.sparse.linalg._dsolve._superlu, , scipy.linalg._fblasscipy.sparse.linalg._eigen.arpack._arpack, scipy.linalg._flapack, , scipy.linalg.cython_lapackscipy.sparse.linalg._propack._spropack, scipy.linalg._cythonized_array_utils, scipy.sparse.linalg._propack._dpropack, scipy.linalg._solve_toeplitz, scipy.sparse.linalg._propack._cpropack, scipy.linalg._decomp_lu_cython, scipy.sparse.linalg._propack._zpropack, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.sparse.csgraph._tools, scipy.linalg._matfuncs_expm, scipy.sparse.csgraph._shortest_path, scipy.linalg._decomp_update, scipy.sparse.csgraph._traversal, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, , scipy.sparse.csgraph._min_spanning_treescipy.sparse.linalg._propack._spropack, , scipy.sparse.csgraph._flowscipy.sparse.linalg._propack._dpropack, , scipy.sparse.csgraph._matchingscipy.sparse.linalg._propack._cpropack, , scipy.sparse.csgraph._reorderingscipy.sparse.linalg._propack._zpropack, , scipy.sparse.csgraph._toolssklearn.__check_build._check_build, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, , scipy.sparse.csgraph._reorderingscipy.special._ufuncs_cxx, , sklearn.__check_build._check_buildscipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ufuncs_cxx, scipy.special._ellip_harm_2, scipy.special._ufuncs, scipy.spatial._ckdtree, scipy.special._specfun, scipy._lib.messagestream, scipy.special._comb, scipy.spatial._qhull, scipy.special._ellip_harm_2, scipy.spatial._voronoi, scipy.spatial._ckdtree, , scipy.spatial._distance_wrapscipy._lib.messagestream, , scipy.spatial._hausdorffscipy.spatial._qhull, scipy.spatial._voronoi, , scipy.spatial._distance_wrapscipy.spatial.transform._rotation, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.special.cython_special, scipy.stats._stats, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._ansari_swilk_statistics, scipy.stats._sobol, scipy.stats._qmc_cy, , scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNCscipy.stats._mvn, scipy.optimize._cobyla, scipy.stats._rcont.rcont, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.stats._unuran.unuran_wrapper, scipy.optimize._lsq.givens_elimination, , scipy.optimize._zeros, scipy.ndimage._nd_imagescipy.optimize._highs.cython.src._highs_wrapper, , scipy.optimize._highs._highs_wrapper_ni_label, , scipy.optimize._highs.cython.src._highs_constantsscipy.ndimage._ni_label, scipy.optimize._highs._highs_constants, sklearn.utils._isfinite, scipy.linalg._interpolative, sklearn.utils.sparsefuncs_fast, scipy.optimize._bglu_dense, sklearn.utils.murmurhash, scipy.optimize._lsap, , sklearn.utils._openmp_helpersscipy.optimize._direct, scipy.integrate._odepack, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics.cluster._expected_mutual_info_fast, scipy.integrate._quadpack, sklearn.metrics._dist_metrics, scipy.integrate._vode, sklearn.metrics._pairwise_distances_reduction._datasets_pair, scipy.integrate._dop, scipy.integrate._lsoda, sklearn.utils._cython_blas, scipy.interpolate._fitpack, sklearn.metrics._pairwise_distances_reduction._base, scipy.interpolate._dfitpack, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, scipy.interpolate._bspl, sklearn.utils._heap, scipy.interpolate._ppoly, sklearn.utils._sorting, scipy.interpolate.interpnd, sklearn.metrics._pairwise_distances_reduction._argkmin, scipy.interpolate._rbfinterp_pythran, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, scipy.interpolate._rgi_cython, scipy.special.cython_special, sklearn.utils._vector_sentinel, scipy.stats._stats, , sklearn.metrics._pairwise_distances_reduction._radius_neighborsscipy.stats._biasedurn, , sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmodescipy.stats._levy_stable.levyst, , scipy.stats._stats_pythransklearn.metrics._pairwise_fast, scipy._lib._uarray._uarray, scipy.stats._ansari_swilk_statistics, sklearn.utils._random, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, torch._C, scipy.stats._rcont.rcont, , scipy.stats._unuran.unuran_wrappertorch._C._fft, , scipy.ndimage._nd_imagetorch._C._linalg, , _ni_labeltorch._C._nested, , scipy.ndimage._ni_labeltorch._C._nn, , sklearn.utils._isfinitetorch._C._sparse, , sklearn.utils.sparsefuncs_fasttorch._C._special, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, sklearn.utils._random, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, , scipy.io.matlab._mio_utilstorch._C._nn, torch._C._sparse, scipy.io.matlab._streams, torch._C._special, scipy.io.matlab._mio5_utils, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, , sklearn.datasets._svmlight_format_fastscipy.io.matlab._mio5_utils, sklearn.datasets._svmlight_format_fast, sklearn.feature_extraction._hashing_fast (total: 130, )sklearn.feature_extraction._hashing_fast
 (total: 130)
/Users/runner/work/_temp/7013399c-b6ff-43a4-b289-cc08191dbadb.sh: line 1:  2783 Segmentation fault: 11  pytest --noconftest tests/test_bug1[21](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:22)101.py

Environment info

LightGBM version or commit hash: 4.5.0

Result of pip list:

Package           Version
----------------- --------
certifi           2024.7.4
filelock          3.15.4
fsspec            2024.6.1
iniconfig         2.0.0
Jinja2            3.1.4
joblib            1.4.2
lightgbm          4.5.0
MarkupSafe        2.1.5
mpmath            1.3.0
networkx          3.3
numpy             2.0.1
packaging         24.1
pip               24.2
pluggy            1.5.0
pytest            8.3.2
scikit-learn      1.5.1
scipy             1.14.0
setuptools        65.5.0
sympy             1.13.1
threadpoolctl     3.5.0
torch             2.4.0

Additional Comments

We came across this issue over at the shap repo, trying to run tests with the latest versions of both pytorch and lightgbm. We initially raised this issue on the pytorch issue tracker: pytorch/pytorch#121101 .

However, the underlying issue doesn't seem to be specific just to pytorch or lightgbm, but rather it relates to the mutual compatibility of pytorch and lightgbm. The issue seems to relate to multiple ~~OpenML~~ OpenMP runtimes being loaded.

So, I thought it would be worth raising the issue here too in the hope that it helps us collectively find a fix.

The text was updated successfully, but these errors were encountered:

jameslamb · 2024-08-07T14:46:49Z

Thanks for the excellent report @connortann !

Since #6391, import lightgbm on macOS will try to use the already-loaded OpenMP if there is one. So it shouldn't be the case that import lightgbm can cause "multiple OpenMP runtimes being loaded".

(assuming that was typo in your original report and you really mean "OpenMP", not "OpenML")

Since you have scikit-learn in the environment, import lightgbm will import sklearn. I suspect that scikit-learn may be contributing to this problem. In the past, we've seen that library's handling of its OpenMP dependency contribute to this "multiple OpenMP runtimes being loaded" situation.

To narrow it down further, could you try 2 other tests?

import sklearn before / after torch (no lightgbm involved)
pip uninstall --yes scikit-learn and then testing import lightgbm before / after torch

I'm sorry to possibly involve yet a THIRD project in your investigation. I'm familiar with these topics and happy to help us all reach a resolution.

You may also find these relevant:

connortann · 2024-08-07T15:29:11Z

Thanks for the response! Yes I think you're right about sklearn being relevant: the bug seems not to occur if sklearn is not imported.

Here's what I tried: the tests pass in all these situations

import sklearn, then torch (no lightgbm involved). Tests pass.

import time

import sklearn
import torch
from sklearn.datasets import fetch_california_housing

def test_something():
    X, y = fetch_california_housing(return_X_y=True)
    torch.tensor(X)
    time.sleep(3)

import torch then sklearn (no lightgbm involved). Tests pass.

import time

import torch
import sklearn
from sklearn.datasets import fetch_california_housing

def test_something():
    X, y = fetch_california_housing(return_X_y=True)
    torch.tensor(X)
    time.sleep(3)

Without sklearn installed; import lightgbm then torch. Tests pass

import time

import lightgbm
import torch
import numpy as np
# from sklearn.datasets import fetch_california_housing


def test_something():
    # X, y = fetch_california_housing(return_X_y=True)
    X = np.ones(shape=(200, 20))
    torch.tensor(X)
    time.sleep(3)

Without sklearn installed; import torch then lightgbm. Tests pass

# ruff: noqa
# fmt: off
import time

import torch
import lightgbm
import numpy as np
# from sklearn.datasets import fetch_california_housing


def test_something():
    # X, y = fetch_california_housing(return_X_y=True)
    X = np.ones(shape=(200, 20))
    torch.tensor(X)
    time.sleep(3)

So, I think the example above is the minimal reproducer: lightgbm, torch and sklearn!

vnherdeiro · 2024-08-28T18:13:14Z

Adding my two cents to this issue. I managed to reproduce the bug following the setting given by @connortann

Running the following command raises the segfault
python -m pytest test_bug.py
with
torch==2.2.2 scikit-learn==1.5.1 numpy==1.26.4 lightgbm==4.5.0

but if prepending the command with OMP_NUM_THREADS=1 (forcing single thread operations) then it irons out the segfault.

lorentzenchr · 2024-09-12T21:17:52Z

@lesteve ping as scikit-learn is involved in the minimal reproducer (openmp related).

lesteve · 2024-09-13T15:08:41Z

Honestly @jeremiedbb may be a better person on this on the scikit-learn side. To be honest this is quite a tricky topic at the interface of different projects which make different choices how to tackle OpenMP with wheels and OpenMP in itself is already tricky.

The root cause is generally using multiple OpenMP and using threadpoolctl can highlight this, see this doc and below.

One known work-around is to use conda-forge which will use a single OpenMP and avoid most of these issues. I wanted to mention it, even if I understand using conda rather than pip is a non-starter in some use cases.

In this particular case, I played a bit with the code and can reproduce without scikit-learn, i.e. only with LightGBM and PyTorch. To be honest, I have heard of cases that go wrong with PyTorch and scikit-learn for similar reasons, but it's generally a bit hard to get a reproducer ...

I put together a quick repo: https://github.com/lesteve/lightgbm-pytorch-macos-segfault.

In particular, see build log which shows a segfault, python file, worflow YAML file. Importing pytorch before lightgbm works fine, see build log.

Python file:

import pprint
import sys
import platform

import lightgbm
import torch
import threadpoolctl

print('version: ', sys.version, flush=True)
print('platform: ', platform.platform(), flush=True)
pprint.pprint(threadpoolctl.threadpool_info())

print('before torch tensor', flush=True)
t = torch.ones(200_000)
print('after torch tensor', flush=True)

Output:

version:  3.12.5 (v3.12.5:ff3bc82f7c9, Aug  7 2024, 05:32:06) [Clang 13.0.0 (clang-1300.0.29.30)]
platform:  macOS-14.6.1-arm64-arm-64bit
[{'architecture': 'armv8',
  'filepath': '/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/numpy/.dylibs/libopenblas64_.0.dylib',
  'internal_api': 'openblas',
  'num_threads': 3,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.23.dev'},
 {'filepath': '/opt/homebrew/Cellar/libomp/18.1.8/lib/libomp.dylib',
  'internal_api': 'openmp',
  'num_threads': 3,
  'prefix': 'libomp',
  'user_api': 'openmp',
  'version': None},
 {'filepath': '/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/torch/lib/libomp.dylib',
  'internal_api': 'openmp',
  'num_threads': 3,
  'prefix': 'libomp',
  'user_api': 'openmp',
  'version': None}]
before torch tensor
/Users/runner/work/_temp/558d95ac-031b-4858-bfb0-b7bb4841e27b.sh: line 1:  1924 Segmentation fault: 11  python test.py

From the threadpoolctl info, you can tell that there are multiple OpenMP in use the brew one (from LightGBM) and the PyTorch one bundled in the wheel.

pip list

Package           Version
----------------- ---------
certifi           2024.8.30
filelock          3.16.0
fsspec            2024.9.0
Jinja2            3.1.4
lightgbm          4.5.0
MarkupSafe        2.1.5
mpmath            1.3.0
networkx          3.3
numpy             1.26.4
pip               24.2
scipy             1.14.1
setuptools        74.1.2
sympy             1.13.2
threadpoolctl     3.5.0
torch             2.4.1
typing_extensions 4.12.2

(Edit: sorry pinged the wrong Jérémie originally ...)

jameslamb · 2024-09-15T06:05:30Z

Thanks very much for that! Your example has helped to clarify the picture for me a lot.

Short Summary

torch vendors a libomp.dylib (without library or symbol name mangling) and always prefers that vendored copy to a system installation.

lightgbm searches for a system installation.

As a result, if you've installed both these libraries via wheels on macOS, loading both will result in 2 copies of libomp.dylib being loaded. This may or may not show up as runtime issues... unpredictable, because symbol resolution is lazy by default and therefore depends on the code paths used.

Even if all copies of libomp.dylib loaded into the process are ABI-compatible with each other, there can still be runtime segfaults as a result of mixing symbols from libraries loaded at different memory addresses, I think.

Longer Summary

more details (click me)

I investigated this by running the following on my M2 Mac, with Python 3.11. Note that the versions are identical to those from the previous comment.

mkdir ./delete-me
cd ./delete-me

pip download \
  --no-deps \
  'lightgbm==4.5.0' \
  'torch==2.4.1'

unzip ./lightgbm*.whl
unzip ./torch*.whl

otool -l ./lightgbm/lib/lib_lightgbm.dylib
otool -l ./torch/lib/libtorch_cpu.dylib

lightgbm wheels have exactly one library, lib_lightgbm.dylib, with an OpenMP dependency like this:

@rpath/libomp.dylib (compatibility version 5.0.0, current version 5.0.0)

And the following LC_LOAD_DYLIB / LC_RPATH entries

cmd LC_LOAD_DYLIB
name @rpath/libomp.dylib (offset 24)
current version 5.0.0
compatibility version 5.0.0
...
cmd LC_RPATH
path /opt/homebrew/opt/libomp/lib (offset 12)
...
cmd LC_RPATH
path /opt/local/lib/libomp (offset 12)

torch wheels vendor libomp.dylib but without mangling the library name or its symbols.

libc10.dylib
libomp.dylib
libshm.dylib
libtorch.dylib
libtorch_cpu.dylib
libtorch_global_deps.dylib
libtorch_python.dylib

libtorch_cpu.dylib expresses its OpenMP dependency like this:

@rpath/libomp.dylib (compatibility version 5.0.0, current version 5.0.0)

And has the following LC_LOAD_DYLIB / LC_RPATH entries:

cmd LC_LOAD_DYLIB
name @rpath/libomp.dylib (offset 24)
current version 5.0.0
compatibility version 5.0.0
...
cmd LC_RPATH
path @loader_path (offset 12)

So lightgbm will search for libomp.dylib in various places (including where Homebrew likes to put it) and loads the first one found.

torch will ONLY and ALWAYS load exactly the one that its wheels vendor.

💥 2 copies of OpenMP loaded at the same time, and all the issues that comes with that.

Why didn't @connortann observe this same behavior?

Not sure why @connortann was not able to reproduce this in #6595 (comment). That comment shows:

Without sklearn installed; import torch then lightgbm. Tests pass

Probably because that example uses different codepaths in torch. Many OpenMP symbols would be resolved only at the first call site (as described in this Stack Overflow answer and the macOS docs it links to), so different code paths can lead to different behavior in terms of which copies of libomp.dylib certain symbols are found in.

How do we fix this?

I think some mix of the following would make this better for users.

Option 1: `torch` could more aggressively isolate its OpenMP dependency

If torch wants to vendor its own OpenMP in this way, it could further isolate that dependency to only torch's own uses, by doing one of the following:

statically linking instead of vendoring
mangling the library name and its symbols (but this might be difficult, see [Question] Reasons to mangle and hash .so names after copy pypa/auditwheel#409 (comment))

Option 2a: `lightgbm` could vendor OpenMP like `torch` is, but with that added strictness described above

I really do not want to do this, for the reasons mentioned in in #6391 and the things linked to it.

Option 2b: `torch` could stop vendoring OpenMP and use the same LC_RPATH search order `lightgbm` does

I don't know if this would be palatable for torch. It comes with its own challenges.

Option 3: `lightgbm` could add something like `@loader_path/../../torch/lib` earlier in its list of RPATHS

This only works as long as torch is vendoring a version of libomp.dylib that lightgbm is ABI-compatible with.

And it only helps for the narrow case of lightgbm and torch with no other OpenMP-using dependencies. Every other library depending on OpenMP (e.g. xgboost, scikit-learn) would need to do something similar for them to all reliably use that same copy of libomp.dylib at runtime.

Option 4: OpenMP could be packaged as a wheel that all of these projects depend on (and dynamically link to)

As described in https://pypackaging-native.github.io/key-issues/native-dependencies/blas_openmp/#potential-solutions-or-mitigations. This is the wheel-based equivalent of how conda handles this case, as @lesteve alluded to... you download a single copy of the library into the environment, and everything else dynamically links to it.

I personally would be willing to help with this community effort, though I don't feel qualified to lead it.

Some related discussions (about shared-library-only wheels, not OpenMP) that have been happening in RAPIDS libraries:

connortann mentioned this issue Aug 7, 2024

Segmentation error for torch==2.2.1 on MacOs pytorch/pytorch#121101

Open

jameslamb added the bug label Aug 7, 2024

jameslamb mentioned this issue Aug 29, 2024

[cmake] [R-package] include R-for-macOS vendored libs dir in OpenMP search path (fixes #6628) #6629

Merged

jameslamb changed the title ~~SegFault on MacOS when pytorch is installed~~ [python-package] SegFault on MacOS when pytorch is installed Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] SegFault on MacOS when pytorch is installed #6595

[python-package] SegFault on MacOS when pytorch is installed #6595

connortann commented Aug 7, 2024 •

edited

Loading

jameslamb commented Aug 7, 2024

connortann commented Aug 7, 2024 •

edited

Loading

vnherdeiro commented Aug 28, 2024 •

edited

Loading

lorentzenchr commented Sep 12, 2024

lesteve commented Sep 13, 2024 •

edited

Loading

jameslamb commented Sep 15, 2024

[python-package] SegFault on MacOS when pytorch is installed #6595

[python-package] SegFault on MacOS when pytorch is installed #6595

Comments

connortann commented Aug 7, 2024 • edited Loading

Description

Reproducible example

Environment info

Additional Comments

jameslamb commented Aug 7, 2024

connortann commented Aug 7, 2024 • edited Loading

vnherdeiro commented Aug 28, 2024 • edited Loading

lorentzenchr commented Sep 12, 2024

lesteve commented Sep 13, 2024 • edited Loading

jameslamb commented Sep 15, 2024

Short Summary

Longer Summary

Why didn't @connortann observe this same behavior?

How do we fix this?

Option 1: torch could more aggressively isolate its OpenMP dependency

Option 2a: lightgbm could vendor OpenMP like torch is, but with that added strictness described above

Option 2b: torch could stop vendoring OpenMP and use the same LC_RPATH search order lightgbm does

Option 3: lightgbm could add something like @loader_path/../../torch/lib earlier in its list of RPATHS

Option 4: OpenMP could be packaged as a wheel that all of these projects depend on (and dynamically link to)

connortann commented Aug 7, 2024 •

edited

Loading

connortann commented Aug 7, 2024 •

edited

Loading

vnherdeiro commented Aug 28, 2024 •

edited

Loading

lesteve commented Sep 13, 2024 •

edited

Loading

Option 1: `torch` could more aggressively isolate its OpenMP dependency

Option 2a: `lightgbm` could vendor OpenMP like `torch` is, but with that added strictness described above

Option 2b: `torch` could stop vendoring OpenMP and use the same LC_RPATH search order `lightgbm` does

Option 3: `lightgbm` could add something like `@loader_path/../../torch/lib` earlier in its list of RPATHS