Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda runtime error (304) : OS call failed or operation not supported on this OS #31

Open
nehap25 opened this issue Apr 30, 2021 · 3 comments

Comments

@nehap25
Copy link

nehap25 commented Apr 30, 2021

When running the following example with Falkon, I run into a cuda runtime error.

Example:
`from sklearn import datasets, model_selection
import numpy as np
import torch
import falkon
from falkon.models import Falkon
from falkon.kernels import GaussianKernel
from falkon.options import FalkonOptions

Xtrain = np.random.randn(80000, 1536)
Xtest = np.random.randn(10000, 1536)

Ytrain = np.random.randn(80000, 20)
Ytest = np.random.randn(10000, 20)

Xtrain = torch.from_numpy(Xtrain)
Xtest = torch.from_numpy(Xtest)
Ytrain = torch.from_numpy(Ytrain)
Ytest = torch.from_numpy(Ytest)

print("X TRAIN SHAPE: ", Xtrain.shape, Ytrain.shape, "TEsT SHAPES: ", Xtest.shape, Ytest.shape)

kernel = GaussianKernel(sigma=5)
flk = Falkon(kernel=kernel, penalty=1e-5, M=Xtrain.shape[0])

flk.fit(Xtrain, Ytrain)`

Error:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=304 : OS call failed or operation not supported on this OS Traceback (most recent call last): File "falkon_test.py", line 26, in <module> flk.fit(Xtrain, Ytrain) File "/home/nehap/anaconda3/envs/falkon/lib/python3.7/site-packages/falkon/models/falkon.py", line 197, in fit ny_points = ny_points.pin_memory() RuntimeError: cuda runtime error (304) : OS call failed or operation not supported on this OS at /opt/conda/conda-bld/pytorch_1616554827596/work/aten/src/THC/THCCachingHostAllocator.cpp:278

Here is my .yml file:
name: falkon
channels:

  • conda-forge
  • pytorch
  • anaconda
  • defaults
    dependencies:
  • _libgcc_mutex=0.1=main
  • blas=1.0=mkl
  • bzip2=1.0.8=h7b6447c_0
  • ca-certificates=2020.10.14=0
  • certifi=2020.6.20=py37_0
  • cmake=3.18.2=ha30ef3c_0
  • cudatoolkit=10.1.243=h6bb024c_0
  • expat=2.2.10=he6710b0_2
  • ffmpeg=4.3=hf484d3e_0
  • freetype=2.10.4=h5ab3b9f_0
  • gmp=6.2.1=h2531618_2
  • gnutls=3.6.15=he1e5248_0
  • intel-openmp=2020.2=254
  • joblib=1.0.1=pyhd8ed1ab_0
  • jpeg=9b=h024ee3a_2
  • krb5=1.18.2=h173b8e3_0
  • lame=3.100=h7b6447c_0
  • lcms2=2.11=h396b838_0
  • ld_impl_linux-64=2.33.1=h53a641e_7
  • libblas=3.9.0=1_h6e990d7_netlib
  • libcblas=3.9.0=3_h893e4fe_netlib
  • libcurl=7.71.1=h20c2e04_1
  • libedit=3.1.20191231=h14c3975_1
  • libffi=3.3=he6710b0_2
  • libgcc-ng=9.1.0=hdf63c60_0
  • libgfortran-ng=7.5.0=h14aa051_19
  • libgfortran4=7.5.0=h14aa051_19
  • libiconv=1.15=h63c8f33_5
  • libidn2=2.3.0=h27cfd23_0
  • liblapack=3.9.0=3_h893e4fe_netlib
  • libpng=1.6.37=hbc83047_0
  • libssh2=1.9.0=h1ba5d50_1
  • libstdcxx-ng=9.1.0=hdf63c60_0
  • libtasn1=4.16.0=h27cfd23_0
  • libtiff=4.1.0=h2733197_1
  • libunistring=0.9.10=h27cfd23_0
  • libuv=1.40.0=h7b6447c_0
  • lz4-c=1.9.2=heb0550a_3
  • mkl=2020.2=256
  • mkl-service=2.3.0=py37he8ac12f_0
  • mkl_fft=1.2.0=py37h23d657b_0
  • mkl_random=1.1.1=py37h0573a6f_0
  • ncurses=6.2=he6710b0_1
  • nettle=3.7.2=hbbd107a_1
  • ninja=1.10.2=py37hff7bd54_0
  • numpy=1.19.2=py37h54aff64_0
  • numpy-base=1.19.2=py37hfa32c7d_0
  • olefile=0.46=py_0
  • openh264=2.1.0=hd408876_0
  • openssl=1.1.1h=h7b6447c_0
  • pillow=8.0.1=py37he98fc37_0
  • pip=20.3.3=py37h06a4308_0
  • python=3.7.9=h7579374_0
  • python_abi=3.7=1_cp37m
  • pytorch=1.8.1=py3.7_cuda10.1_cudnn7.6.3_0
  • readline=8.0=h7b6447c_0
  • rhash=1.4.0=h1ba5d50_0
  • scikit-learn=0.23.2=py37hddcf8d6_3
  • scipy=1.5.3=py37h8911b10_0
  • setuptools=51.0.0=py37h06a4308_2
  • six=1.15.0=py37h06a4308_0
  • sqlite=3.33.0=h62c20be_0
  • threadpoolctl=2.1.0=pyh5ca1d4c_0
  • tk=8.6.10=hbc83047_0
  • torchaudio=0.8.1=py37
  • torchvision=0.9.1=py37_cu101
  • typing_extensions=3.7.4.3=py_0
  • wheel=0.36.2=pyhd3eb1b0_0
  • xz=5.2.5=h7b6447c_0
  • zlib=1.2.11=h7b6447c_3
  • zstd=1.4.5=h9ceee32_0
  • pip:
    • falkon==0.6.3
    • psutil==5.8.0
    • pykeops==1.4.2
      prefix: /home/nehap/anaconda3/envs/falkon

I'm currently using a 1 TITAN RTX GPU
 with 24 GB memory and my CPU has 128 GB memory. The example works if we reduce the number of dimensions from 1536 to 20, but with larger datasets it seems to be running into this issue. We would appreciate any help with this issue - thank you!

@Giodiro
Copy link
Contributor

Giodiro commented May 3, 2021

Hi!
This seems to be a problem with not having enough pinnable memory, I'm not an expert on how exactly the OS determines the amount of pinnable memory but from what I observed I think this is related to the amount of free RAM on your machine.
What OS are you on, and how much free RAM do you have when running the example?

I see a couple of other issues in your script though:

  1. The number of centers (M) should be much lower than the number of points. The scalability of Falkon is cubic with the number of centers, so it makes sense to set M to be low, and gradually increase it until you see performance plateau.
  2. If you generate your data with numpy it will be in float64 precision, and float64 precision data will be processed very slowly by your GPU. An easy fix to things going slow is to reduce the precision of your data (e.g. call Xtrain = torch.from_numpy(Xtrain).to(dtype=torch.float32))

@nehap25
Copy link
Author

nehap25 commented May 4, 2021

Thank you so much for your response! For the same example, after changing the precision to float32 I was only able to use <= 400 centers, as anything more was resulting in the error I was getting above. I then tried setting pin_memory to False in falkon/preconditioner/flk_preconditioner.py, falkon/models/falkon.py, and falkon/mmv_ops/fmm_cuda.py, and that seemed to help quite a bit, as I was able to use 50K centers for the same example without running into that error. Here is also the output of 'free' while the example was running:

total    used    free   shared buff/cache  available
Mem:   131944080  33383436  65672288   954712  32888356  96054764
Swap:    999420   998744     676

I was wondering if you had any other suggestions on how to deal with this issue.

@Giodiro
Copy link
Contributor

Giodiro commented May 5, 2021

Hi again, and sorry for the slow replies.

I cannot explain the fact that changing precision changes behaviour so drastically.
May I ask what operating system you are using?

Short term the fix you applied -- disabling memory-pinning -- is fine! Just repeat the process of setting pin_memory to False in other places if you encounter the error again.

Long term, if it seems like certain hadware?software? configurations don't support pinning more than a little bit of RAM, I can wrap the calls in a try/catch so that the whole thing doesn't crash and just falls back to unpinned memory.

Thanks for the bug report :)

p.s. while using float32 might not be beneficial for the pinning issue, you should find that it improves the running time of Falkon by quite a bit (once you get past the pinning problem).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants