Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed CUDA.jl initialization breaks Flux? #1952

Closed
cirobr opened this issue Jun 12, 2023 · 12 comments
Closed

Failed CUDA.jl initialization breaks Flux? #1952

cirobr opened this issue Jun 12, 2023 · 12 comments
Labels
bug Something isn't working installation CUDA is easy to install, right? needs information Further information is requested

Comments

@cirobr
Copy link

cirobr commented Jun 12, 2023

Cheers,

Using latest Julia 1.9.1 on a AArch64 instance with no GPU and Ubuntu Server 22.04, latest CUDA package.

Pkg.add("CUDA") gives the following warning:
1 dependency had warnings during precompilation: ┌ Random123 [74087812-796a-5b5d-8853-05524746bad3] │ ┌ Warning: AES-NI is not enabled, so AESNI and ARS are not available. │ └ @ Random123 ~/.julia/packages/Random123/u5oEp/src/Random123.jl:55

Despite of that, CUDA package precompiles.

using CUDA, however, is not possible:
julia> using CUDA ┌ Error: Failed to initialize CUDA │ exception = │ CUDA error (code 100, CUDA_ERROR_NO_DEVICE)

Regards,

@cirobr cirobr added the bug Something isn't working label Jun 12, 2023
@maleadt
Copy link
Member

maleadt commented Jun 12, 2023

julia> using CUDA ┌ Error: Failed to initialize CUDA │ exception = │ CUDA error (code 100, CUDA_ERROR_NO_DEVICE)

That's just an informational message. If you have the NVIDIA driver installed, CUDA.jl is going to assume that you want the package to work, so it'll let you know if it isn't functional. If you aren't interested in GPU functionality, you should try to avoid loading CUDA.jl (i.e. using Revise, package extensions, preferences, etc).

@maleadt maleadt closed this as not planned Won't fix, can't repro, duplicate, stale Jun 12, 2023
@cirobr
Copy link
Author

cirobr commented Jun 12, 2023

julia> using CUDA ┌ Error: Failed to initialize CUDA │ exception = │ CUDA error (code 100, CUDA_ERROR_NO_DEVICE)

That's just an informational message. If you have the NVIDIA driver installed, CUDA.jl is going to assume that you want the package to work, so it'll let you know if it isn't functional. If you aren't interested in GPU functionality, you should try to avoid loading CUDA.jl (i.e. using Revise, package extensions, preferences, etc).

I am actually using Flux, which calls CUDA by default. Thus, not using CUDA is not an option. The application is a small ML model designed to run on multi-core only.

@cirobr
Copy link
Author

cirobr commented Jun 12, 2023

Extensive details discussed at Julia Discourse:

https://discourse.julialang.org/t/bring-julia-code-to-embedded-hardware-arm/19979/79?u=cirobr

@maleadt
Copy link
Member

maleadt commented Jun 12, 2023

AFAIU Flux will be moving to package extensions at some point in the future. For now, you can safely ignore the message, or uninstall the NVIDIA driver to get rid of it.

@cirobr
Copy link
Author

cirobr commented Jun 12, 2023

The "Error: Failed to initialize CUDA" persists at three scenarios tested: no driver installed, with nvidia-530-driver, and with nvidia-530-driver-open installed.

In a few attempts, code execution breaks at "using Flux". By resuming from the next line of code, the rest of the code and all Flux calls were executed.

@maleadt
Copy link
Member

maleadt commented Jun 12, 2023

The "Error: Failed to initialize CUDA" persists at three scenarios tested: no driver installed

That is impossible. The NO_DEVICE error comes from libcuda, which is part of the NVIDIA driver.

In a few attempts, code execution breaks at "using Flux".

The output you quoted above comes from

@error "Failed to initialize CUDA" exception=(err,catch_backtrace())
_initialization_error[] = "CUDA initialization failed"
, which is an error log message. It does not change execution, so it does not affect the behavior of using Flux (i.e., it does not 'break execution').

@cirobr
Copy link
Author

cirobr commented Jun 12, 2023

Evidences show otherwise. Usage of Flux became impossible without GPU. Thanks anyway.

@maleadt
Copy link
Member

maleadt commented Jun 12, 2023

Can you help me understand then? I linked to the source code that generates the output you presented, and nothing there should break precompilation as the messages are purely informational. And the NO_DEVICE error really is generated by libcuda, so I don't see how it can be generated without having (parts of) the NVIDIA/CUDA driver installed.

@maleadt
Copy link
Member

maleadt commented Jun 12, 2023

without having (parts of) the NVIDIA/CUDA driver installed.

And to add some more to this point, as we've recently had another user run into that: the NVIDIA driver is often split into multiple parts, so it's possible you removed what provides nvidia.ko (e.g. nvidia-dkms on Arch Linux) without removing what provides libcuda.so (that could be the cuda package on Arch).

I agree that it's confusing that CUDA.jl generates an error message only when you have the NVIDIA driver installed, but that's just the heuristic we've currently come up with, as people are likely to want a functional CUDA stack when they have the NVIDIA driver installed. In the future, I plan to make this message an error, "forcing" downstream users to only install CUDA.jl when they want to use GPU support. That will make the situation much more clear, however, it requires that downstream packages like Flux use package extensions. That's not the case yet, so we remain in the situation where CUDA.jl needs to be importable on systems without a GPU, because Flux.jl depends on it unconditionally.

@maleadt
Copy link
Member

maleadt commented Jun 12, 2023

Let's re-open this until we figure out your problem.

@maleadt maleadt reopened this Jun 12, 2023
@maleadt maleadt changed the title CUDA not working on hardware without GPU Failed CUDA.jl initialization breaks Flux? Jun 12, 2023
@maleadt maleadt added needs information Further information is requested installation CUDA is easy to install, right? labels Jun 12, 2023
@cirobr
Copy link
Author

cirobr commented Jun 12, 2023

Have already volunteered in this topic for several days to document the issue in the best possible way, in order to be duplicated by an expert. As the claim has already been dismissed (couple of seconds after being opened) and you are certain that "this is impossible", let's wait for comments from others. Sorry for the inconvenient, and thanks for your time.

@maleadt
Copy link
Member

maleadt commented Jun 12, 2023

As the claim has already been dismissed (couple of seconds after being opened) and you are certain that "this is impossible"

I did not (intend to) dismiss your problem, only certain elements of your report, like the fact that you mention the exact problem re-occurs after removing the NVIDIA driver, which is in fact impossible. Stating that is a matter of debugging the issue, hoping that it would help you to e.g. fully remove the driver, or otherwise resolve the problem.

In any case, I re-opened this issue in order to help you. Without additional information from your end, it won't be possible to resolve this, so I'll close this again. Note that this is just to prevent unresolved issues from lingering on, feel free to post more information when you want to resolve this issue again.

@maleadt maleadt closed this as not planned Won't fix, can't repro, duplicate, stale Jun 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working installation CUDA is easy to install, right? needs information Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants