Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vast.ai instance - **No module named 'upfirdn2d_plugin'** #72

Closed
dokluch opened this issue Mar 18, 2021 · 18 comments
Closed

Vast.ai instance - **No module named 'upfirdn2d_plugin'** #72

dokluch opened this issue Mar 18, 2021 · 18 comments

Comments

@dokluch
Copy link

dokluch commented Mar 18, 2021

Stuck here big time with ImportError: No module named 'upfirdn2d_plugin'

I am using a vast.ai instance nvidia/cuda:11.2.1-cudnn8-runtime-ubuntu18.04

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   30C    P0    35W / 250W |      0MiB / 16160MiB |      0%      Default |

Conda environment is set with
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch --yes
(doesn't matter if I try a newer one)

What I've tried

FIrst I made sure my VM has CUDA 11.2 installed. Then I've installed a newer torch with CUDA 11.1.1, which did not help and I've rolled back (made a new env).

Removed torch_extensions
Just as described here:
#11

Didn't help

gcc
I found this thread and
#35

And tried installing gcc7
conda install -c conda-forge/label/gcc7 gcc_linux-64 (didn't help)

and even gcc5
conda install -c psi4 gcc-5
The latter sent me in a weird loop and I've abandoned this path.

This does not help either
#2 (comment)

Google Colab works fine and has ubuntu 18.04 with gcc 7.5.0 installed which I am trying to mimic. Hope that is the correct logic.

UPD:
Another instance with gcc 7.5.0 throws the same error as well

gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.

UPD2
Installing gcc 5 as described here: https://askubuntu.com/questions/1087150/install-gcc-5-on-ubuntu-18-04
Did not help either

UPD3
Sorry for not including the traceback originally

Traceback (most recent call last):
  File "/root/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 32, in _init
    _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
    file, path, description = imp.find_module(module_name, [path])
  File "/usr/local/envs/stylegan/lib/python3.7/imp.py", line 296, in find_module
    raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'upfirdn2d_plugin'

  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
/root/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

Traceback (most recent call last):
  File "/root/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 32, in _init
    _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
    file, path, description = imp.find_module(module_name, [path])
  File "/usr/local/envs/stylegan/lib/python3.7/imp.py", line 296, in find_module
    raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'upfirdn2d_plugin'

  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())

Please advice on any possible next steps. No idea where to move next.

Originally posted by @dokluch in #2 (comment)

@nurpax
Copy link
Contributor

nurpax commented Mar 18, 2021

Please post the full stacktrace for the "No module named 'upfirdn2d_plugin" exception, as requested in the issue template too:

2. See error (please copy&paste full log and stacktraces).

@dokluch
Copy link
Author

dokluch commented Mar 18, 2021

Please post the full stacktrace for the "No module named 'upfirdn2d_plugin" exception, as requested in the issue template too:

2. See error (please copy&paste full log and stacktraces).

Just updated the original post with the traceback for generate.py

@nurpax
Copy link
Contributor

nurpax commented Mar 18, 2021

Somehow the real reason why the cpp extension build fails is not shown. You confirm this is on the latest version from github? Can you post git commit id also?

See if you get any more information if you apply the suggestion from #39 (comment)

@dokluch
Copy link
Author

dokluch commented Mar 18, 2021

Somehow the real reason why the cpp extension build fails is not shown. You confirm this is on the latest version from github? Can you post git commit id also?

See if you get any more information if you apply the suggestion from #39 (comment)

I have followed the advice to modify those files and what I got is:

Traceback (most recent call last):
  File "generate.py", line 127, in <module>
    generate_images() # pylint: disable=no-value-for-parameter
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "generate.py", line 119, in generate_images
    img = G(z, label, truncation_psi=truncation_psi, noise_mode=noise_mode)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 490, in forward
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 221, in forward
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 109, in forward
  File "/root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py", line 84, in bias_act
    if impl == 'cuda' and x.device.type == 'cuda' and _init():
  File "/root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py", line 47, in _init
    _plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
    with_cuda=with_cuda)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1273, in _write_ninja_file_and_build_library
    check_compiler_abi_compatibility(compiler)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 265, in check_compiler_abi_compatibility
    if not check_compiler_ok_for_platform(compiler):
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 225, in check_compiler_ok_for_platform
    which = subprocess.check_output(['which', compiler], stderr=subprocess.STDOUT)
  File "/usr/local/envs/stylegan/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/usr/local/envs/stylegan/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.

Ran it on the machine with gcc5.5 installed and got another error message

Traceback (most recent call last):
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1539, in _run_ninja_build
    env=env)
  File "/usr/local/envs/stylegan/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "generate.py", line 127, in <module>
    generate_images() # pylint: disable=no-value-for-parameter
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "generate.py", line 119, in generate_images
    img = G(z, label, truncation_psi=truncation_psi, noise_mode=noise_mode)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 490, in forward
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 221, in forward
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "<string>", line 109, in forward
  File "/root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py", line 88, in bias_act
    if impl == 'cuda' and x.device.type == 'cuda' and _init():
  File "/root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.py", line 51, in _init
    _plugin = custom_ops.get_plugin('bias_act_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1202, in _jit_compile
    with_cuda=with_cuda)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
    error_prefix="Error building extension '{}'".format(name))
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1555, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'bias_act_plugin': [1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cu -o bias_act.cuda.o 
FAILED: bias_act.cuda.o 
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cu -o bias_act.cuda.o 
/bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
[2/3] c++ -MMD -MF bias_act.o.d -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp -o bias_act.o 
FAILED: bias_act.o 
c++ -MMD -MF bias_act.o.d -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp -o bias_act.o 
In file included from /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp:10:0:
/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:30: fatal error: cuda_runtime_api.h: No such file or directory
compilation terminated.
ninja: build stopped: subcommand failed.

PS. The irony is that my windows machine is happily working with this repository while ubuntu fails.

@nurpax
Copy link
Contributor

nurpax commented Mar 18, 2021

Are you sure you can't run Docker on this machine? It's usually an easy way to fix stuff like this.

Anyway, your run with GCC 5.5 gets a lot further, so at least there's some progress.

This error:

c++ -MMD -MF bias_act.o.d -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp -o bias_act.o 
In file included from /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp:10:0:
/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:30: fatal error: cuda_runtime_api.h: No such file or directory
compilation terminated.

seems to suggest the compilation cannot find some cuda headers. In my containers it's here:

root@7367a65ac3a5:/workspace# ls /usr/local/cuda/include/cuda_runtime_api.h 
/usr/local/cuda/include/cuda_runtime_api.h

Do you have CUDA installed in the first place? There's another error here that indicates it can't even find the CUDA compiler:

/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cu -o bias_act.cuda.o 
/bin/sh: 1: /usr/local/cuda/bin/nvcc: not found

@dokluch
Copy link
Author

dokluch commented Mar 18, 2021

Are you sure you can't run Docker on this machine? It's usually an easy way to fix stuff like this.

This error:

c++ -MMD -MF bias_act.o.d -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp -o bias_act.o 
In file included from /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cpp:10:0:
/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:30: fatal error: cuda_runtime_api.h: No such file or directory
compilation terminated.

seems to suggest the compilation cannot find some cuda headers. In my containers it's here:

root@7367a65ac3a5:/workspace# ls /usr/local/cuda/include/cuda_runtime_api.h 
/usr/local/cuda/include/cuda_runtime_api.h

Do you have CUDA installed in the first place? There's another error here that indicates it can't even find the CUDA compiler:

/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=bias_act_plugin -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/TH -isystem /usr/local/envs/stylegan/lib/python3.7/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/local/envs/stylegan/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' --use_fast_math -std=c++14 -c /root/stylegan2-ada-pytorch/torch_utils/ops/bias_act.cu -o bias_act.cuda.o 
/bin/sh: 1: /usr/local/cuda/bin/nvcc: not found

vast ai support answered that I can't reinstall cuda, just get a new instance with a cuda of my choice. Which I did.
I am going to try use Docker for this, but first I need to get a crash course on it since I've never used it in the real world scenario.

UPD. I can't run docker since their instances are already inside Docker.

@nurpax
Copy link
Contributor

nurpax commented Mar 18, 2021

Bummer that you can't use Docker. I'm not sure how much more help I can give apart from what I've already given above.

I guess you'll have to work through the CUDA compilation issues on these instances. For example, why is nvcc not found when the extension gets built? Look through what the file system on the vast.ai instance looks like, does /usr/local/cuda exist, can you find nvcc in the expected location, ditto for the CUDA header files.

If the CUDA toolkit is installed in some non-standard location, maybe you can point PyTorch to use it by setting CUDA_HOME appriately? See https://pytorch.org/docs/stable/cpp_extension.html and torch.utils.cpp_extension.load for additional clues.

@dokluch
Copy link
Author

dokluch commented Mar 19, 2021

Bummer that you can't use Docker. I'm not sure how much more help I can give apart from what I've already given above.

I guess you'll have to work through the CUDA compilation issues on these instances. For example, why is nvcc not found when the extension gets built? Look through what the file system on the vast.ai instance looks like, does /usr/local/cuda exist, can you find nvcc in the expected location, ditto for the CUDA header files.

If the CUDA toolkit is installed in some non-standard location, maybe you can point PyTorch to use it by setting CUDA_HOME appriately? See https://pytorch.org/docs/stable/cpp_extension.html and torch.utils.cpp_extension.load for additional clues.

Thank you for your time. I am going to go to the square one and try to do this all over again and hope it works. Or rent an instance somewhere else.

@dokluch
Copy link
Author

dokluch commented Mar 20, 2021

Bummer that you can't use Docker. I'm not sure how much more help I can give apart from what I've already given above.

I guess you'll have to work through the CUDA compilation issues on these instances. For example, why is nvcc not found when the extension gets built? Look through what the file system on the vast.ai instance looks like, does /usr/local/cuda exist, can you find nvcc in the expected location, ditto for the CUDA header files.

If the CUDA toolkit is installed in some non-standard location, maybe you can point PyTorch to use it by setting CUDA_HOME appriately? See https://pytorch.org/docs/stable/cpp_extension.html and torch.utils.cpp_extension.load for additional clues.

By the way, just analyzed my Windows logs and found that unfirdn2d is indeed not building properly either. Though this is a one-time error and it doesn't spam like in previous cases:

C:\Code\ML\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

Traceback (most recent call last):
  File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 1539, in _run_ninja_build
    env=env)
  File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Code\ML\stylegan2-ada-pytorch\torch_utils\ops\upfirdn2d.py", line 32, in _init
    _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "C:\Code\ML\stylegan2-ada-pytorch\torch_utils\custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 1202, in _jit_compile
    with_cuda=with_cuda)
  File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 1300, in _write_ninja_file_and_build_library
    error_prefix="Error building extension '{}'".format(name))
  File "C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 1555, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'upfirdn2d_plugin': [1/1] "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29910\bin\Hostx64\x64/link.exe" upfirdn2d.o upfirdn2d.cuda.o /nologo /DLL c10.lib c10_cuda.lib torch_cpu.lib torch_cuda.lib -INCLUDE:?warp_size@cuda@at@@YAHXZ torch.lib torch_python.lib /LIBPATH:C:\Users\admin\.conda\envs\stylegan-pytorch\libs /LIBPATH:C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\lib "/LIBPATH:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\lib/x64" cudart.lib /out:upfirdn2d_plugin.pyd

FAILED: upfirdn2d_plugin.pyd 

"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29910\bin\Hostx64\x64/link.exe" upfirdn2d.o upfirdn2d.cuda.o /nologo /DLL c10.lib c10_cuda.lib torch_cpu.lib torch_cuda.lib -INCLUDE:?warp_size@cuda@at@@YAHXZ torch.lib torch_python.lib /LIBPATH:C:\Users\admin\.conda\envs\stylegan-pytorch\libs /LIBPATH:C:\Users\admin\.conda\envs\stylegan-pytorch\lib\site-packages\torch\lib "/LIBPATH:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\lib/x64" cudart.lib /out:upfirdn2d_plugin.pyd

LINK : fatal error LNK1104: cannot open file 'upfirdn2d_plugin.pyd'

ninja: build stopped: subcommand failed.



  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Setting up PyTorch plugin "upfirdn2d_plugin"... Done.

@dokluch
Copy link
Author

dokluch commented Mar 21, 2021

UPD. Vast ai issue fixed by choosing a "devel" type Ubuntu installation instead of "runtime", since runtime does not have nvcc and gcc and it's impossible to properly install them.

@dokluch dokluch closed this as completed Mar 21, 2021
@NotNANtoN
Copy link

@dokluch Hi, could you share how exactly you set the vast.ai instance up for stylegan training? It would be amazing if you could share the exact name of the image you used and the on-start script!

Is it as simple as choosing 1.8.0-cuda11.1-cudnn8-devel as the image, or do I need to install nvidia-cuda-toolkits, gcc etc. on top of it?

@dokluch
Copy link
Author

dokluch commented Apr 23, 2021

@dokluch Hi, could you share how exactly you set the vast.ai instance up for stylegan training? It would be amazing if you could share the exact name of the image you used and the on-start script!

Is it as simple as choosing 1.8.0-cuda11.1-cudnn8-devel as the image, or do I need to install nvidia-cuda-toolkits, gcc etc. on top of it?

That's pretty much it. You choose nvidia-cuda image with appropiate cuda version
image

You don't have to install gcc, toolkit etc. Docker won't let you anyway. Then SSH to the instance and start training.

I install miniconda and then run

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 tensorboard -c pytorch --yes

pip install click psutil scipy requests tqdm pyspng ninja imageio imageio-ffmpeg==0.4.3 ipywidgets jupyterlab

If you need UI, then start jupyter lab from SSH. Here's a guide on that: https://gist.github.com/hsed/197ded8431bb545dffefb742dab5efb8

@flyywh
Copy link

flyywh commented May 29, 2021

@dokluch Hi, could you share how exactly you set the vast.ai instance up for stylegan training? It would be amazing if you could share the exact name of the image you used and the on-start script!
Is it as simple as choosing 1.8.0-cuda11.1-cudnn8-devel as the image, or do I need to install nvidia-cuda-toolkits, gcc etc. on top of it?

That's pretty much it. You choose nvidia-cuda image with appropiate cuda version
image

You don't have to install gcc, toolkit etc. Docker won't let you anyway. Then SSH to the instance and start training.

I install miniconda and then run

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 tensorboard -c pytorch --yes

pip install click psutil scipy requests tqdm pyspng ninja imageio imageio-ffmpeg==0.4.3 ipywidgets jupyterlab

If you need UI, then start jupyter lab from SSH. Here's a guide on that: https://gist.github.com/hsed/197ded8431bb545dffefb742dab5efb8

The solution is cool.

@gtnbssn
Copy link

gtnbssn commented May 30, 2021

Banging my head on this issue too... Which miniconda did you install? The StyleGAN docs say we should use python3.7 64 bits, but that installer is missing on the miniconda installers page... https://docs.conda.io/en/latest/miniconda.html#linux-installers it's got 32 bits for python3.7.

Also that docker instance comes very bare bones, no man, no vim. But your conda and pip commands should be enough?

Thanks a lot for all the pointers! I might finally see this through tonight...

@jannehellsten
Copy link

Later Python versions should work fine too. I regularly run StyleGAN2 pytorch with Python 3.8 and 3.9.

@gtnbssn
Copy link

gtnbssn commented May 30, 2021

It is finally working, phewwww. Thank you so much!

So indeed, future confused users, just go straight for the docker image and enjoy your training!

@Mo-Irene
Copy link

@dokluch Hi, I encountered exactly the same problem as you.... My error showed that I could not find nvcc, and my file cuda_runtime_api.h could not be found either .But there is no problem with other compilation tasks with nvcc ,I don't know why it fails when compiling. I am running on my local host, this is my machine information:

ubuntu 16.04, pytorch 1.9.0 ,python3.7,CUDA 11.3, gcc 5.4.0,RTX Titan

I have tried all the methods in the issue but the problem is still not solved. I don’t know if something is wrong with my ubuntu system. I hope to get some of your comments and opinions. I haven’t tried to use Docker yet. I don’t know if I can only move to Docker for training in the next step.

Expect all the advice and suggestions.

@alittlecanarybird
Copy link

@dokluch Hi, could you share how exactly you set the vast.ai instance up for stylegan training? It would be amazing if you could share the exact name of the image you used and the on-start script!
Is it as simple as choosing 1.8.0-cuda11.1-cudnn8-devel as the image, or do I need to install nvidia-cuda-toolkits, gcc etc. on top of it?

That's pretty much it. You choose nvidia-cuda image with appropiate cuda version
image

You don't have to install gcc, toolkit etc. Docker won't let you anyway. Then SSH to the instance and start training.

I install miniconda and then run

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 tensorboard -c pytorch --yes

pip install click psutil scipy requests tqdm pyspng ninja imageio imageio-ffmpeg==0.4.3 ipywidgets jupyterlab

If you need UI, then start jupyter lab from SSH. Here's a guide on that: https://gist.github.com/hsed/197ded8431bb545dffefb742dab5efb8

I can add that miniconda with Python 3.9 doesn't work (current latest version), while miniconda with Python 3.8 works like a charm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants