Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force conda version of cutensor #765

Merged
merged 2 commits into from
Jan 24, 2023

Conversation

marcinz
Copy link
Collaborator

@marcinz marcinz commented Jan 24, 2023

No description provided.

@marcinz marcinz added the category:bug-fix PR is a bug fix and will be classified as such in release notes label Jan 24, 2023
@marcinz
Copy link
Collaborator Author

marcinz commented Jan 24, 2023

@manopapad I was able to force conda's cutenosor. It is a little fragile, and currently it depends on the difference in the build string format between the packages. In any case, with cuda 11.8 there are errors you can see in this CI pipeline. Let me know what do you think about these errors, and if I should try to go back to cuda 11.5,

@marcinz
Copy link
Collaborator Author

marcinz commented Jan 24, 2023

@m3vaz Is there a better way to select cutensor from the conda forge channel over the one from nvidia?

@manopapad
Copy link
Contributor

This actually fixes the 1-GPU failures. The other failures are expected, and will be handled by a different workaround. Thanks for taking care of this!

@manopapad
Copy link
Contributor

I think the issue here is that the cutensor package on the nvidia conda channel is only compatible with CUDA 12, and there are no constraints on the packages to signal that the package is incompatible with the currently installed CTK conda package.

As a matter of fact, there are no (recursive or otherwise) dependencies on a CTK package, or a runtime constraint on the driver version.

There are also no build IDs, but I don't know if that matters.

prm-login:~> conda search -i -c nvidia cutensor=1.6.2.3
Loading channels: done
cutensor 1.6.2.3 0
------------------
file name   : cutensor-1.6.2.3-0.tar.bz2
name        : cutensor
version     : 1.6.2.3
build       : 0
build number: 0
size        : 1 KB
subdir      : linux-64
url         : https://conda.anaconda.org/nvidia/linux-64/cutensor-1.6.2.3-0.tar.bz2
md5         : 6d704d5a4fa923f296dc914bc0b3bcf9
timestamp   : 2023-01-17 17:33:28 UTC
dependencies:
  - cutensor-cuda-12 >=1.6.2.3


prm-login:~> conda search -i -c nvidia cutensor-cuda-12=1.6.2.3
Loading channels: done
cutensor-cuda-12 1.6.2.3 0
--------------------------
file name   : cutensor-cuda-12-1.6.2.3-0.tar.bz2
name        : cutensor-cuda-12
version     : 1.6.2.3
build       : 0
build number: 0
size        : 1 KB
subdir      : linux-64
url         : https://conda.anaconda.org/nvidia/linux-64/cutensor-cuda-12-1.6.2.3-0.tar.bz2
md5         : 01b638476343e979200b4a93a2401c47
timestamp   : 2023-01-17 17:33:18 UTC
dependencies:
  - libcutensor-cuda-12 >=1.6.2.3
  - libcutensor-dev-cuda-12 >=1.6.2.3


prm-login:~> conda search -i -c nvidia libcutensor-cuda-12=1.6.2.3
Loading channels: done
libcutensor-cuda-12 1.6.2.3 0
-----------------------------
file name   : libcutensor-cuda-12-1.6.2.3-0.tar.bz2
name        : libcutensor-cuda-12
version     : 1.6.2.3
build       : 0
build number: 0
size        : 114.7 MB
subdir      : linux-64
url         : https://conda.anaconda.org/nvidia/linux-64/libcutensor-cuda-12-1.6.2.3-0.tar.bz2
md5         : 68f8a0668ba1cd19bae9296e60a3b15b
timestamp   : 2023-01-11 18:16:39 UTC
dependencies: []


prm-login:~> conda search -i -c nvidia libcutensor-dev-cuda-12=1.6.2.3
Loading channels: done
libcutensor-dev-cuda-12 1.6.2.3 0
---------------------------------
file name   : libcutensor-dev-cuda-12-1.6.2.3-0.tar.bz2
name        : libcutensor-dev-cuda-12
version     : 1.6.2.3
build       : 0
build number: 0
size        : 128.6 MB
subdir      : linux-64
url         : https://conda.anaconda.org/nvidia/linux-64/libcutensor-dev-cuda-12-1.6.2.3-0.tar.bz2
md5         : 5e2e1baf2c127abf8341a6bd6a84c277
timestamp   : 2023-01-11 18:17:25 UTC
dependencies:
  - libcublas 12.*
  - libcutensor-cuda-12 >=1.6.2.3


prm-login:~> conda search -i -c nvidia libcublas=12
Loading channels: done
libcublas 12.0.1.189 0
----------------------
file name   : libcublas-12.0.1.189-0.tar.bz2
name        : libcublas
version     : 12.0.1.189
build       : 0
build number: 0
size        : 323.3 MB
subdir      : linux-64
url         : https://conda.anaconda.org/nvidia/linux-64/libcublas-12.0.1.189-0.tar.bz2
md5         : 023e67819d5bcb7fb1f1b0801ec1d0dc
timestamp   : 2022-12-03 02:37:04 UTC
dependencies: []

@leofang
Copy link

leofang commented Jan 24, 2023

btw we cannot not yet support cutensor's CUDA 12 flavor on conda-forge yet. Need to wait for the first wave of tasks outlined in conda-forge/staged-recipes#21382 done.

@manopapad
Copy link
Contributor

Merging this for now, to make CI pass. We will have to fix this properly, before the Nvidia channel packages change their build IDs, breaking this workaround.

@manopapad manopapad merged commit 47d65d9 into nv-legate:branch-22.12 Jan 24, 2023
@manopapad
Copy link
Contributor

FYI, when running with CUTENSOR_LOG_LEVEL=5 we get definitive evidence of the issue:

[2023-01-24 19:37:30][cuTENSOR][26][Api][cutensorInit] handle=0X7FA95C0120C0
[2023-01-24 19:37:30][cuTENSOR][26][Error][cutensorInit] Initial CUDA call failed with CUDA driver version is insufficient for CUDA runtime version
[2023-01-24 19:37:30][cuTENSOR][26][Api][cutensorGetErrorString] error=18
Internal Legate CUTENSOR failure with error CUTENSOR_STATUS_CUDA_ERROR (18) in file /opt/conda/conda-bld/cunumeric_1674562451773/work/src/cunumeric/cudalibs.cu at line 218

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:bug-fix PR is a bug fix and will be classified as such in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants