Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Could not create cudnn handle" error #79

Closed
gregszumel opened this issue Mar 15, 2024 · 3 comments
Closed

"Could not create cudnn handle" error #79

gregszumel opened this issue Mar 15, 2024 · 3 comments

Comments

@gregszumel
Copy link

Hi - I'm not 100% sure this is an EXLA error, but it's my best guess. I'm running into an issue when trying to do ops on tensors in on Cuda (see below). Do you know what might be causing this? I've tried a few things (playing around with :preallocate, :memory_fraction, reinstalling cudnn, downgrading CUDA, etc), but nothing has worked so far. I have verified that CuDNN was installed properly through here

# running Nx -> 0.7.1, Exla -> 0.7.1, xla -> 0.6.0
iex(1)> t = Nx.tensor([1], backend: EXLA.Backend)

08:25:16.940 [info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

08:25:16.942 [info] XLA service <service> initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

08:25:16.942 [info]   StreamExecutor device (0): NVIDIA RTX A6000, Compute Capability 8.6

08:25:16.942 [info] Using BFC allocator.

08:25:16.942 [info] XLA backend allocating 45932072140 bytes on device 0 for BFCAllocator.
#Nx.Tensor<
  s64[1]
  EXLA.Backend<cuda:0, 0.2762047049.2204500040.162304>
  [1]
>

iex(2)> Nx.add(t, t)

08:23:37.926 [error] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

08:23:37.926 [error] Memory usage: 4734255104 bytes free, 51035635712 bytes total.
** (RuntimeError) DNN library initialization failed. Look at the errors above for more details.
    (exla 0.7.1) lib/exla/mlir/module.ex:127: EXLA.MLIR.Module.unwrap!/1
    (exla 0.7.1) lib/exla/mlir/module.ex:113: EXLA.MLIR.Module.compile/5
    (stdlib 5.2.1) timer.erl:270: :timer.tc/2
    (exla 0.7.1) lib/exla/defn.ex:599: anonymous fn/12 in EXLA.Defn.compile/8
    (exla 0.7.1) lib/exla/mlir/context_pool.ex:10: anonymous fn/3 in EXLA.MLIR.ContextPool.checkout/1
    (nimble_pool 1.0.0) lib/nimble_pool.ex:349: NimblePool.checkout!/4
    (exla 0.7.1) lib/exla/defn/locked_cache.ex:36: EXLA.Defn.LockedCache.run/2
    iex:1: (file)

Versions

  • OS: Ubuntu 22.04
  • Nvidia driver version: 545.29.06
  • CUDA version
> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
  • CuDNN installed via here and verified it works (through the verified section)
@polvalente
Copy link

What's your cudnn version? IIRC we require cudnn8, not cudnn9

@gregszumel
Copy link
Author

It is 9! I'll downgrade and report back

@gregszumel
Copy link
Author

Fixed, thanks for the speedy reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants