Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConvTranspose can cause Julia crash on GPU #2193

Closed
yl4070 opened this issue Feb 18, 2023 · 11 comments
Closed

ConvTranspose can cause Julia crash on GPU #2193

yl4070 opened this issue Feb 18, 2023 · 11 comments

Comments

@yl4070
Copy link

yl4070 commented Feb 18, 2023

Initially, I thought it's only causing error when channel is mismatched. But on 0.13.12, it will crash julia even the channel is matched.

ConvTranspose seems to work regardless of the channel on CPU, but it will return an opaque CUDA error when run on GPU (CUDNN_STATUS_BAD_PARAM (code 3)) in older version (0.13.4), crash julia on 0.13.12.

A minimum example.

using Flux

x = rand(Float32, 32, 32, 4, 1) |> gpu
nn = ConvTranspose((4,4), 2 => 2) |> gpu
nn(x)

This will return CUDA error mentioned above or just crash, but will run normally if run on CPU. On 0.13.12, even I changed the convtranspose pameter from 2=>2 to 4=>2, it still crashes the Julia, it may require more parameter validity check.

@ToucheSir
Copy link
Member

Your input has 4 channels but the conv layer only expects to take 2. CUDNN_STATUS_BAD_PARAM is warning about that, but those errors are not very useful. Luckily, we recently merged some functionality which should catch these size mismatches early and generate much better errors. I've just tagged a new version with this change.

@yl4070
Copy link
Author

yl4070 commented Feb 18, 2023

I'm aware of the channels mismatch. But even after I change to use the right channels, the 0.13.12 version still crash. Not sure if it's specific to my machine, it will briefly show Could not locate cudnn_ops_infer64_8.dll, and julia (vs code extension) will crash. I tried the same code (correct channel) on 0.13.4 version, the code will run without error. Not sure if there's other issue to this.

@ToucheSir
Copy link
Member

That seems like a bigger error and not one I've seen before. If I had to guess, something about your CUDA setup is broken. Can you post the version of CUDA and the output of CUDA.versioninfo()?

@yl4070
Copy link
Author

yl4070 commented Feb 18, 2023

Here's the cuda version info:

CUDA runtime 11.8, artifact installation
CUDA driver 12.0
NVIDIA driver 528.24.0

Libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 12.0.0+528.24

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce RTX 3070 Ti Laptop GPU (sm_86, 7.665 GiB / 8.000 GiB available)  

On code that works, CUDA version is 3.13.1, and on the one that crashes, CUDA version is 4.0.1.

@mcabbott
Copy link
Member

Cannot reproduce. At least if I understood correctly, and the claim is that nn(x2) fails.

julia> using Flux, CUDA

julia> x = rand(Float32, 32, 32, 4, 1) |> gpu;

julia> nn = ConvTranspose((4,4), 2 => 2) |> gpu;

julia> nn(x) |> summary  # correctly fails
ERROR: DimensionMismatch: layer ConvTranspose((4, 4), 2 => 2) expects size(input, 3) == 2, but got 32×32×4×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}

julia> x2 = rand(Float32, 32, 32, 2, 1) |> gpu;

julia> nn(x2) |> summary
"35×35×2×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}"

(@v1.10) pkg> st Flux CUDA
Status `~/.julia/environments/v1.10/Project.toml`
  [052768ef] CUDA v4.0.1
  [587475ba] Flux v0.13.13

julia> CUDA.device()
CuDevice(0): Tesla V100-PCIE-16GB

julia> CUDA.versioninfo()
CUDA runtime 11.8, artifact installation
CUDA driver 11.6
NVIDIA driver 510.47.3

Libraries: 
- CUBLAS: 11.9.2
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 11.0.0+510.47.3

Toolchain:
- Julia: 1.10.0-DEV.220
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

6 devices:
  0: Tesla V100-PCIE-16GB (sm_70, 14.361 GiB / 16.000 GiB available)

@yl4070
Copy link
Author

yl4070 commented Feb 18, 2023

Not sure the cause, maybe depend on hardware. I'm able to reproduce on my desktop as well with 0.13.13 and Cuda 4.0.1, on which also run without error with 0.13.4. (nn(x2))

CUDA.versioninfo:

CUDA.versioninfo()
CUDA runtime 11.8, artifact installation
CUDA driver 11.7
NVIDIA driver 516.59.0

Libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0   
- NVML: 11.0.0+516.59

Toolchain:
- Julia: 1.8.5       
- LLVM: 13.0.1       
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce RTX 2070 SUPER (sm_75, 7.293 GiB / 8.000 GiB available)

@yl4070
Copy link
Author

yl4070 commented Feb 19, 2023

The problem seems to related to CUDA 4.0.1, after downgrade to 3.13, the code can run without error, but the error reappear after upgrade back 4.0.1.

@ToucheSir
Copy link
Member

It's odd because our CI should test this and it's been green as well. Can you share the output of (non-CUDA.jl) versioninfo(), cuDNN.version() and cuDNN.cuda_version() (you may need to ] add cuDNN)?

@yl4070
Copy link
Author

yl4070 commented Feb 19, 2023

Here's the versioninfo:

Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700H
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
  Threads: 1 on 20 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

It looks like the cuDNN couldn't set up properly. When I add cuDNN on my laptop, there's some redefine path warning, and error (I couldn't remember the details). While the there's no error when I add cuDNN on my desktop, both crash when I tried to execute cuDNN.version(). The error message is the same as before: Could not locate cudnn_ops_infer64_8.dll.

@ToucheSir
Copy link
Member

If you can reproduce this with just the cuDNN Julia package loaded, do you mind filing an issue over on https://github.com/JuliaGPU/CUDA.jl?

@yl4070
Copy link
Author

yl4070 commented Feb 20, 2023

Yeah, I'll file an issue there, and I think we can close this one then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants