ConvTranspose can cause Julia crash on GPU #2193

yl4070 · 2023-02-18T03:19:33Z

Initially, I thought it's only causing error when channel is mismatched. But on 0.13.12, it will crash julia even the channel is matched.

ConvTranspose seems to work regardless of the channel on CPU, but it will return an opaque CUDA error when run on GPU (CUDNN_STATUS_BAD_PARAM (code 3)) in older version (0.13.4), crash julia on 0.13.12.

A minimum example.

using Flux

x = rand(Float32, 32, 32, 4, 1) |> gpu
nn = ConvTranspose((4,4), 2 => 2) |> gpu
nn(x)

This will return CUDA error mentioned above or just crash, but will run normally if run on CPU. On 0.13.12, even I changed the convtranspose pameter from 2=>2 to 4=>2, it still crashes the Julia, it may require more parameter validity check.

The text was updated successfully, but these errors were encountered:

ToucheSir · 2023-02-18T05:13:06Z

Your input has 4 channels but the conv layer only expects to take 2. CUDNN_STATUS_BAD_PARAM is warning about that, but those errors are not very useful. Luckily, we recently merged some functionality which should catch these size mismatches early and generate much better errors. I've just tagged a new version with this change.

yl4070 · 2023-02-18T05:33:21Z

I'm aware of the channels mismatch. But even after I change to use the right channels, the 0.13.12 version still crash. Not sure if it's specific to my machine, it will briefly show Could not locate cudnn_ops_infer64_8.dll, and julia (vs code extension) will crash. I tried the same code (correct channel) on 0.13.4 version, the code will run without error. Not sure if there's other issue to this.

ToucheSir · 2023-02-18T05:55:43Z

That seems like a bigger error and not one I've seen before. If I had to guess, something about your CUDA setup is broken. Can you post the version of CUDA and the output of CUDA.versioninfo()?

yl4070 · 2023-02-18T06:12:30Z

Here's the cuda version info:

CUDA runtime 11.8, artifact installation
CUDA driver 12.0
NVIDIA driver 528.24.0

Libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 12.0.0+528.24

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce RTX 3070 Ti Laptop GPU (sm_86, 7.665 GiB / 8.000 GiB available)

On code that works, CUDA version is 3.13.1, and on the one that crashes, CUDA version is 4.0.1.

mcabbott · 2023-02-18T14:49:58Z

Cannot reproduce. At least if I understood correctly, and the claim is that nn(x2) fails.

julia> using Flux, CUDA

julia> x = rand(Float32, 32, 32, 4, 1) |> gpu;

julia> nn = ConvTranspose((4,4), 2 => 2) |> gpu;

julia> nn(x) |> summary  # correctly fails
ERROR: DimensionMismatch: layer ConvTranspose((4, 4), 2 => 2) expects size(input, 3) == 2, but got 32×32×4×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}

julia> x2 = rand(Float32, 32, 32, 2, 1) |> gpu;

julia> nn(x2) |> summary
"35×35×2×1 CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}"

(@v1.10) pkg> st Flux CUDA
Status `~/.julia/environments/v1.10/Project.toml`
  [052768ef] CUDA v4.0.1
  [587475ba] Flux v0.13.13

julia> CUDA.device()
CuDevice(0): Tesla V100-PCIE-16GB

julia> CUDA.versioninfo()
CUDA runtime 11.8, artifact installation
CUDA driver 11.6
NVIDIA driver 510.47.3

Libraries: 
- CUBLAS: 11.9.2
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 11.0.0+510.47.3

Toolchain:
- Julia: 1.10.0-DEV.220
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

6 devices:
  0: Tesla V100-PCIE-16GB (sm_70, 14.361 GiB / 16.000 GiB available)

yl4070 · 2023-02-18T17:24:22Z

Not sure the cause, maybe depend on hardware. I'm able to reproduce on my desktop as well with 0.13.13 and Cuda 4.0.1, on which also run without error with 0.13.4. (nn(x2))

CUDA.versioninfo:

CUDA.versioninfo()
CUDA runtime 11.8, artifact installation
CUDA driver 11.7
NVIDIA driver 516.59.0

Libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0   
- NVML: 11.0.0+516.59

Toolchain:
- Julia: 1.8.5       
- LLVM: 13.0.1       
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA GeForce RTX 2070 SUPER (sm_75, 7.293 GiB / 8.000 GiB available)

yl4070 · 2023-02-19T00:41:50Z

The problem seems to related to CUDA 4.0.1, after downgrade to 3.13, the code can run without error, but the error reappear after upgrade back 4.0.1.

ToucheSir · 2023-02-19T18:54:21Z

It's odd because our CI should test this and it's been green as well. Can you share the output of (non-CUDA.jl) versioninfo(), cuDNN.version() and cuDNN.cuda_version() (you may need to ] add cuDNN)?

yl4070 · 2023-02-19T23:34:48Z

Here's the versioninfo:

Julia Version 1.8.5
Commit 17cfb8e65e (2023-01-08 06:45 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700H
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
  Threads: 1 on 20 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

It looks like the cuDNN couldn't set up properly. When I add cuDNN on my laptop, there's some redefine path warning, and error (I couldn't remember the details). While the there's no error when I add cuDNN on my desktop, both crash when I tried to execute cuDNN.version(). The error message is the same as before: Could not locate cudnn_ops_infer64_8.dll.

ToucheSir · 2023-02-20T00:54:07Z

If you can reproduce this with just the cuDNN Julia package loaded, do you mind filing an issue over on https://github.com/JuliaGPU/CUDA.jl?

yl4070 · 2023-02-20T01:52:29Z

Yeah, I'll file an issue there, and I think we can close this one then.

yl4070 closed this as completed Feb 20, 2023

ToucheSir mentioned this issue Mar 4, 2023

Flux v0.13.13 gpu crashes #2199

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConvTranspose can cause Julia crash on GPU #2193

ConvTranspose can cause Julia crash on GPU #2193

yl4070 commented Feb 18, 2023

ToucheSir commented Feb 18, 2023

yl4070 commented Feb 18, 2023

ToucheSir commented Feb 18, 2023

yl4070 commented Feb 18, 2023

mcabbott commented Feb 18, 2023

yl4070 commented Feb 18, 2023

yl4070 commented Feb 19, 2023

ToucheSir commented Feb 19, 2023

yl4070 commented Feb 19, 2023

ToucheSir commented Feb 20, 2023

yl4070 commented Feb 20, 2023

ConvTranspose can cause Julia crash on GPU #2193

ConvTranspose can cause Julia crash on GPU #2193

Comments

yl4070 commented Feb 18, 2023

ToucheSir commented Feb 18, 2023

yl4070 commented Feb 18, 2023

ToucheSir commented Feb 18, 2023

yl4070 commented Feb 18, 2023

mcabbott commented Feb 18, 2023

yl4070 commented Feb 18, 2023

yl4070 commented Feb 19, 2023

ToucheSir commented Feb 19, 2023

yl4070 commented Feb 19, 2023

ToucheSir commented Feb 20, 2023

yl4070 commented Feb 20, 2023