Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace libcudart.so with PyTorch's CUDA APIs #346

Closed

Conversation

rapsealk
Copy link
Contributor

@rapsealk rapsealk commented Apr 26, 2023

This PR resolves #264

I have been struggling with using bitsandbytes on my environment: nvcr.io/nvidia/pytorch:22.05-py3 image running on docker. When running nvidia-smi, it successfully detects a GPU device.

work@main1[bitsandbytes-torch]:~$ nvidia-smi
Wed Apr 26 01:27:29 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.216.04   Driver Version: 450.216.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  CUDA GPU            On   | 00000000:90:00.0 Off |                    0 |
| N/A   32C    P0    58W / 400W |     48MiB /  1689MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

However, it fails to detect a GPU device when trying to import bitsandbytes.

work@main1[bitsandbytes-torch]:~$ python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bitsandbytes

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/work/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/home/work/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/work/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/cuda/extras/CUPTI/lib64'), PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/cuda/lib'), PosixPath('/usr/local/cuda-11.6/include')}
  warn(msg)
CUDA exception! Error code: OS call failed or operation not supported on this OS
CUDA exception! Error code: initialization error
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
/home/work/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library...
  warn(msg)
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/work/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so...

After an investigation, I found that it fails to execute libcudart.cudaGetDeviceCount() after libcuda.cuInit() with error code 304 - which means "OS call failed or operation not supported on this OS".
https://github.com/TimDettmers/bitsandbytes/blob/9e7cdc9ea95e9756d9f5621a0e2c7e2538363fae/bitsandbytes/cuda_setup/main.py#L345

work@main1[bitsandbytes-torch]:~/.local/lib/python3.8/site-packages/bitsandbytes$ python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ctypes
>>> libcuda = ctypes.CDLL("libcuda.so")
>>> libcudart = ctypes.CDLL("libcudart.so")
>>> count = ctypes.c_int()
>>> libcuda.cuInit(0)
0
>>> libcudart.cudaGetDeviceCount(ctypes.byref(count))
304

While I was looking for a way to solve this problem, I found that PyTorch provides similar APIs which work like a charm on my machine. So I replaced several libcuda or libcudart related commands with torch ones - I think it would be okay since bitsandbytes already has a dependency on torch.

work@main1[bitsandbytes-torch]:~$ python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch._C._cuda_getCompiledVersion(), torch.version.cuda, torch.cuda.device_count()
(11070, '11.7', 1)

Titus-von-Koeller and others added 30 commits August 1, 2022 09:32
@rapsealk
Copy link
Contributor Author

The origin from my branch works like below:

work@main1[bitsandbytes-torch]:~/.local/lib/python3.8/site-packages/bitsandbytes$ python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bitsandbytes

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/work/.local/lib/python3.8/site-packages/bitsandbytes/bitsandbytes/libbitsandbytes_cuda117.so
/home/work/.local/lib/python3.8/site-packages/bitsandbytes/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib'), PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/cuda-11.6/include'), PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/cuda/extras/CUPTI/lib64')}
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 11.7
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/work/.local/lib/python3.8/site-packages/bitsandbytes/bitsandbytes/libbitsandbytes_cuda117.so...
>>> 

@rapsealk rapsealk changed the title fix: Replace raw libcudart.so with torch APIs Replace libcudart.so with PyTorch's CUDA APIs Apr 26, 2023
CUDASetup.get_instance().add_log_entry(f'CUDA SETUP: libcudart.so path is {cudart_path}')
CUDASetup.get_instance().add_log_entry(f'CUDA SETUP: Is seems that your cuda installation is not in your path. See https://github.com/TimDettmers/bitsandbytes/issues/85 for more information.')
version = int(version.value)
version = torch._C._cuda_getCompiledVersion()
Copy link

@lizelive lizelive Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use torch.version.cuda here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lizelive Thanks for the attention! I was trying to keep the rest of the code as much as possible. Actually, the whole method get_cuda_version() can be rewritten using torch.version.cuda instead, like torch.version.cuda.replace(".", "") as you mentioned.

@rapsealk rapsealk requested a review from lizelive April 28, 2023 02:29
@expbox77
Copy link

Thanks. Maybe this works for me.

My system

Ubuntu 20.04(NOT WSL)
RTX 3090
CUDA 11.7

Add this to my .bashrc.
(result of find / -name libcuda.so 2>/dev/null)

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/expbox77/miniconda3/envs/textgen/lib/stubs/:/usr/lib/x86_64-linux-gnu/stubs/:/usr/local/cuda-11.7/targets/x86_64-linux/lib/stubs/

Interestingly, python -m bitsandbytes does not work properly in the home directory.

(textgen) expbox77@expbox-ai:~$ python -m bitsandbytes

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.0-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so
/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.0-py3.10.egg/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/expbox77/miniconda3/envs/textgen/lib/libcudart.so.11.0'), PosixPath('/home/expbox77/miniconda3/envs/textgen/lib/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /home/expbox77/miniconda3/envs/textgen/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.0-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++ ANACONDA CUDA PATHS ++++++++++++++++++++
/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/gptq_llama/quant_cuda.cpython-310-x86_64-linux-gnu.so
/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/quant_cuda-0.0.0-py3.10-linux-x86_64.egg/quant_cuda.cpython-310-x86_64-linux-gnu.so
/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/lib/libc10_cuda.so
/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so
/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/lib/libtorch_cuda_linalg.so
/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.0-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so
/home/expbox77/miniconda3/envs/textgen/lib/stubs/libcuda.so
/home/expbox77/miniconda3/envs/textgen/lib/libcudart.so
/home/expbox77/miniconda3/envs/textgen/nsight-compute/2022.2.0/target/linux-desktop-glibc_2_11_3-x64/libcuda-injection.so
/home/expbox77/miniconda3/envs/textgen/nsight-compute/2022.2.0/target/linux-desktop-glibc_2_19_0-ppc64le/libcuda-injection.so
/home/expbox77/miniconda3/envs/textgen/nsight-compute/2022.2.0/target/linux-desktop-t210-a64/libcuda-injection.so

++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++
/usr/local/cuda-11.7/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudart.so

Traceback (most recent call last):
  File "/home/expbox77/miniconda3/envs/textgen/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/expbox77/miniconda3/envs/textgen/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.0-py3.10.egg/bitsandbytes/__main__.py", line 95, in <module>
    generate_bug_report_information()
  File "/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.0-py3.10.egg/bitsandbytes/__main__.py", line 60, in generate_bug_report_information
    paths = find_file_recursive(os.getcwd(), '*cuda*so')
  File "/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.0-py3.10.egg/bitsandbytes/__main__.py", line 37, in find_file_recursive
    raise RuntimeError('Something when wrong when trying to find file. Maybe you do not have a linux system?')
RuntimeError: Something when wrong when trying to find file. Maybe you do not have a linux system?

But when I'm in any directory, python -m bitsandbytes seems to work normally.

(textgen) expbox77@expbox-ai:~/test$ python -m bitsandbytes

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.0-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so
/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.0-py3.10.egg/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/expbox77/miniconda3/envs/textgen/lib/libcudart.so.11.0'), PosixPath('/home/expbox77/miniconda3/envs/textgen/lib/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /home/expbox77/miniconda3/envs/textgen/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.0-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++ ANACONDA CUDA PATHS ++++++++++++++++++++
/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/gptq_llama/quant_cuda.cpython-310-x86_64-linux-gnu.so
/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/quant_cuda-0.0.0-py3.10-linux-x86_64.egg/quant_cuda.cpython-310-x86_64-linux-gnu.so
/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/lib/libc10_cuda.so
/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so
/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/lib/libtorch_cuda_linalg.so
/home/expbox77/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.0-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so
/home/expbox77/miniconda3/envs/textgen/lib/stubs/libcuda.so
/home/expbox77/miniconda3/envs/textgen/lib/libcudart.so
/home/expbox77/miniconda3/envs/textgen/nsight-compute/2022.2.0/target/linux-desktop-glibc_2_11_3-x64/libcuda-injection.so
/home/expbox77/miniconda3/envs/textgen/nsight-compute/2022.2.0/target/linux-desktop-glibc_2_19_0-ppc64le/libcuda-injection.so
/home/expbox77/miniconda3/envs/textgen/nsight-compute/2022.2.0/target/linux-desktop-t210-a64/libcuda-injection.so

++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++
/usr/local/cuda-11.7/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/local/cuda-11.7/targets/x86_64-linux/lib/libcudart.so

+++++++++++++++ WORKING DIRECTORY CUDA PATHS +++++++++++++++


++++++++++++++++++ LD_LIBRARY CUDA PATHS +++++++++++++++++++
 /home/expbox77/miniconda3/envs/textgen/lib/stubs/ CUDA PATHS
/home/expbox77/miniconda3/envs/textgen/lib/stubs/libcuda.so
+++++++ /usr/lib/x86_64-linux-gnu/stubs/ CUDA PATHS ++++++++
/usr/lib/x86_64-linux-gnu/stubs/libcuda.so
 /usr/local/cuda-11.7/targets/x86_64-linux/lib/stubs/ CUDA PATHS
/usr/local/cuda-11.7/targets/x86_64-linux/lib/stubs/libcuda.so

++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
COMPILED_WITH_CUDA = True
COMPUTE_CAPABILITIES_PER_GPU = ['8.6', '8.6']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Running a quick check that:
    + library is importable
    + CUDA function is callable


WARNING: Please be sure to sanitize sensible info from any such env vars!

SUCCESS!
Installation was successful!

@rapsealk
Copy link
Contributor Author

rapsealk commented May 8, 2023

@TimDettmers Hi! It seems that the main branch has been overwritten with force-pushed commits.
May I open a new PR with the same changes -- converting libcudart APIs to torch.cuda ones? It would be a glad if you share your opinions. Thanks!

@atkinson
Copy link

@TimDettmers @rapsealk This makes a lot of sense. I'm happy to help.

@TimDettmers
Copy link
Collaborator

Thank you for this and sorry for being slow on this. This has been merged with #375

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: OS call failed or operation not supported on NGC PyTorch