[Build] Add a reasonable default for CMAKE_CUDA_COMPILER in *nix #17293

larroy · 2020-01-13T23:47:51Z

Description

After recent changes to CMake, CMAKE_CUDA_COMPILER is not picked up automatically, as nvcc is not usually on the PATH. This sets a reasonable default which is also used by convention by NVidia tooling which symlinks /usr/local/cuda to the default cuda version.

larroy · 2020-01-13T23:48:05Z

@mxnet-label-bot add [pr-awaiting-review]

larroy · 2020-01-13T23:48:23Z

@mxnet-label-bot add [Build]

larroy · 2020-01-13T23:49:15Z

@leezu

leezu · 2020-01-14T11:16:56Z

Why would nvcc not be on the PATH? Could you provide an example system for reference that comes with this setup?

larroy · 2020-01-14T20:42:43Z

nvcc is normally not in the PATH in ubuntu nvidia packages it goes in the path that you can see in this PR.

larroy · 2020-01-14T20:43:01Z

/usr/local/cuda/bin/nvcc is usually NOT in the path.

leezu · 2020-01-15T12:43:57Z

It's expected that to compile software, the compiler must be available. For that, it must be either on $PATH or manually specified.

With respect to C and C++ compilers, this is what the CC and CXX environment variables are for.
For example, CC=gcc-9 CXX=g++-9 cmake .. will prepare the build with GCC 9.

Alike, if users want to use non-standard nvcc (ie nvcc that is not on PATH), they can set CUDACXX. CUDACXX=/usr/local/cuda/nvcc cmake ...

I think it's better to follow standard practice instead of taking additional assumptions. For example, we may want to use clang to compile the cuda files instead of nvcc. If nvcc is not on path, users may reasonably expect that clang will be used.

What do you think?

leezu · 2020-01-15T12:46:52Z

CMakeLists.txt

@@ -84,6 +84,10 @@ message(STATUS "CMake version '${CMAKE_VERSION}' using generator '${CMAKE_GENERA
 project(mxnet C CXX)
 if(USE_CUDA)
  cmake_minimum_required(VERSION 3.13.2)  # CUDA 10 (Turing) detection available starting 3.13.2
+  if (NOT MSVC AND (NOT DEFINED CMAKE_CUDA_COMPILER OR "${CMAKE_CUDA_COMPILER}" STREQUAL "CMAKE_CUDA_COMPILER-NOTFOUND"))


Would it be required to check CUDACXX?

It worked fine with this, could you expand on your concern / question?

Your current logic may break users from setting CUDACXX environment variable? That's the standard way of defining Cuda compiler. It's good to follow standards to avoid technical debt

CMakeLists.txt

apeforest · 2020-01-15T18:33:11Z

@larroy In which environment did you see this problem? Could you paste the diagnose.py result here?

larroy · 2020-01-15T20:03:37Z

@apeforest ubuntu 18.04 with nvidia machine learning APT repositories, pretty standard. I will update with diagnose.py as requested, thanks.

leezu · 2020-01-15T20:49:01Z

@larroy when adding the APT, users should add the respective folder to the PATH. Is it not documented on the Nvidia page?

larroy · 2020-01-16T01:59:10Z

We never had to do such a thing. This is happening due to CMake changes. I applied your suggestion. I would suggest to apply my proposed patch which makes it smoother in 99% of the cases for users.

larroy · 2020-01-16T02:00:22Z

----------Python Info----------
('Version      :', '2.7.17')
('Compiler     :', 'GCC 7.4.0')
('Build        :', ('default', 'Nov  7 2019 10:07:09'))
('Arch         :', ('64bit', ''))
------------Pip Info-----------
No corresponding pip install for current python.
----------MXNet Info-----------
No MXNet installed.
----------System Info----------
('Platform     :', 'Linux-4.15.0-1054-aws-x86_64-with-Ubuntu-18.04-bionic')
('system       :', 'Linux')
('node         :', '34-222-129-72')
('release      :', '4.15.0-1054-aws')
('version      :', '#56-Ubuntu SMP Thu Nov 7 16:15:59 UTC 2019')
----------Hardware Info----------
('machine      :', 'x86_64')
('processor    :', 'x86_64')
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:            1
CPU MHz:             1455.803
CPU max MHz:         3000.0000
CPU min MHz:         1200.0000
BogoMIPS:            4600.12
Hypervisor vendor:   Xen
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0029 sec, LOAD: 0.4997 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0168 sec, LOAD: 0.3225 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0236 sec, LOAD: 0.1133 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0100 sec, LOAD: 0.0518 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1954 sec, LOAD: 0.2625 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.3784 sec, LOAD: 0.1460 sec.
----------Environment----------

larroy · 2020-01-16T02:01:27Z

piotr@34-222-129-72:0:~/mxnet (cmake_cuda_compiler)+$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.3 LTS
Release:        18.04
Codename:       bionic

larroy · 2020-01-16T02:02:41Z

#17031

larroy · 2020-01-16T02:02:58Z

@mxnet-label-bot add [breaking]

larroy · 2020-01-16T02:03:43Z

This fixes #15492

samskalicky

LGTM

ChaiBapchya

Awesome!

leezu

Please clarify if this breaks CUDACXX env variable.

Further, this only helps users that didn't install Cuda correctly. See the mandatory actions in the Cuda installation guide https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#mandatory-post

I don't think we should include logic into our build logic to handle systems in broken state.

larroy · 2020-01-16T20:19:39Z

What do you suggest wrt https://cmake.org/cmake/help/v3.13/envvar/CUDACXX.html should we check if this is set? I understand your concern. Please suggest a better approach, I'm not a CMake expert, but this was working before without needing to set any paths. Even though you are right about the documentation from Nvidia.

I can compile pytorch just fine without needing to do any additional changes to PATHs, or environments. When there's a single cuda version installed or symlinked to /usr/local/cuda we should just pick up that one unless specified otherwise.

Please propose changes or alternatives.

larroy · 2020-01-16T20:22:13Z

This is the output from pytorch build:


-- Found CUDA: /usr/local/cuda (found version "10.2")
-- Caffe2: CUDA detected: 10.2
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 10.2
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so
-- Found cuDNN: v7.6.5  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
-- Autodetected CUDA architecture(s):  7.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70
-- Autodetected CUDA architecture(s):  7.0

leezu · 2020-01-16T20:41:55Z

@larroy, yes it's required to check the environment variables (both CUDACXX and PATH) first before "falling back" to some default path. But it may be hard check both environment variables correctly. So let's just rely on cmake to figure out if it can find nvcc by standard means. Then only if this fails, fall back to the path.

Thus I suggest the following approach instead

  check_language(CUDA)
  if (NOT CMAKE_CUDA_COMPILER_LOADED AND UNIX AND EXISTS "/usr/local/cuda/bin/nvcc")
    set(CMAKE_CUDA_COMPILER "/usr/local/cuda/bin/nvcc")
    message(WARNING "CMAKE_CUDA_COMPILER guessed: ${CMAKE_CUDA_COMPILER}")
  endif()

It should be placed at the same position as the changes done in this PR.

My concern is that if nvcc is not on the PATH, users may also have forgotten to set LD_LIBRARY_PATH. This will lead to issues when attempting to load mxnet later.
Thus I think it's preferable to educate users how to set up there system correctly, instead of attempting to work with broken systems (as we would never be able to catch all various ways that the system may be broken).

larroy · 2020-01-16T21:06:07Z

Thanks for the clarifications, make sense. I don't think LD_LIBRARY_PATH will be an issue in this case. As the mxnet so points to the right library, I can show that this is the case. I disagree with you regarding "broken system". /usr/local/cuda is the convention for default cuda installation, even though is not really in the nvidia documentation. I see your point, but we should just work by default, as before.

https://docs.roguewave.com/en/totalview/2018/html/index.html#page/User_Guides/totalviewug-about-cuda.32.3.html

leezu · 2020-01-16T21:29:57Z

LD_LIBRARY_PATH will not cause issues in this case, but it's a related problem source. It will cause problems if users install cuda via runfile and forget to set LD_LIBRARY_PATH.

In any case, it's just an example for why we can't handle all kinds of broken systems.

Given the updated strategy using check_language(CUDA) it should be fine to merge this PR

larroy · 2020-01-28T21:59:26Z

@leezu then approve please

leezu · 2020-01-28T22:22:09Z

@larroy why not use the approach in #17293 (comment)

I don't think the PR handles the case described correctly yet.

leezu · 2020-02-24T21:46:55Z

@larroy why close this issues?

You can just copy the suggested code change and push, then it can be merged:

  check_language(CUDA)
  if (NOT CMAKE_CUDA_COMPILER_LOADED AND UNIX AND EXISTS "/usr/local/cuda/bin/nvcc")
    set(CMAKE_CUDA_COMPILER "/usr/local/cuda/bin/nvcc")
    message(WARNING "CMAKE_CUDA_COMPILER guessed: ${CMAKE_CUDA_COMPILER}")
  endif()

larroy · 2020-02-24T22:46:20Z

I don't have much bandwidth left, but if the change is this small I can finish the PR. Seems Linux GPU is timeouting often though.

leezu

Thanks

Fixes a bug in #17293 causing an infinite loop on some systems.

Fixes a bug in apache#17293 causing an infinite loop on some systems.

* [Build] Add a reasonable default for CMAKE_CUDA_COMPILER in *nix * CR * CR * Update as per CR comments * include(CheckLanguage) Co-authored-by: Leonard Lausen <leonard@lausen.nl>

Fixes a bug in apache#17293 causing an infinite loop on some systems.

larroy requested a review from szha as a code owner January 13, 2020 23:47

lanking520 added the pr-awaiting-review PR is waiting for code review label Jan 13, 2020

lanking520 added the Build label Jan 13, 2020

leezu reviewed Jan 15, 2020

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

larroy force-pushed the cmake_cuda_compiler branch from 249babf to b9ad4b4 Compare January 16, 2020 01:57

lanking520 added the Breaking label Jan 16, 2020

josephevans approved these changes Jan 16, 2020

View reviewed changes

samskalicky approved these changes Jan 16, 2020

View reviewed changes

ChaiBapchya approved these changes Jan 16, 2020

View reviewed changes

leezu suggested changes Jan 16, 2020

View reviewed changes

larroy force-pushed the cmake_cuda_compiler branch from b9ad4b4 to 664d44b Compare February 4, 2020 19:15

larroy closed this Feb 24, 2020

leezu reopened this Feb 24, 2020

larroy added 4 commits February 24, 2020 22:59

[Build] Add a reasonable default for CMAKE_CUDA_COMPILER in *nix

3e9f244

CR

ea2734e

CR

1a4a2e6

Update as per CR comments

d92f769

larroy force-pushed the cmake_cuda_compiler branch from 9024af7 to d92f769 Compare February 24, 2020 22:59

include(CheckLanguage)

cce5732

leezu approved these changes Feb 25, 2020

View reviewed changes

leezu merged commit 8bdf068 into apache:master Feb 25, 2020

This was referenced Mar 4, 2020

cmake stuck in infinite loop at configuration if nvcc is not found #17761

Closed

cmake: improve CMAKE_CUDA_COMPILER guessing on unix #17773

Merged

leezu added a commit that referenced this pull request Mar 5, 2020

cmake: improve CMAKE_CUDA_COMPILER guessing on unix (#17773)

66e4c27

Fixes a bug in #17293 causing an infinite loop on some systems.

MoisesHer pushed a commit to MoisesHer/incubator-mxnet that referenced this pull request Apr 10, 2020

cmake: improve CMAKE_CUDA_COMPILER guessing on unix (apache#17773)

b746bee

Fixes a bug in apache#17293 causing an infinite loop on some systems.

anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020

cmake: improve CMAKE_CUDA_COMPILER guessing on unix (apache#17773)

b2b5bbc

Fixes a bug in apache#17293 causing an infinite loop on some systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Build] Add a reasonable default for CMAKE_CUDA_COMPILER in *nix #17293

[Build] Add a reasonable default for CMAKE_CUDA_COMPILER in *nix #17293

larroy commented Jan 13, 2020

larroy commented Jan 13, 2020

larroy commented Jan 13, 2020

larroy commented Jan 13, 2020

leezu commented Jan 14, 2020

larroy commented Jan 14, 2020

larroy commented Jan 14, 2020

leezu commented Jan 15, 2020

leezu Jan 15, 2020 •

edited

Loading

larroy Jan 15, 2020

leezu Jan 16, 2020

apeforest commented Jan 15, 2020

larroy commented Jan 15, 2020

leezu commented Jan 15, 2020

larroy commented Jan 16, 2020

larroy commented Jan 16, 2020

larroy commented Jan 16, 2020

larroy commented Jan 16, 2020

larroy commented Jan 16, 2020

larroy commented Jan 16, 2020

samskalicky left a comment

ChaiBapchya left a comment

leezu left a comment

larroy commented Jan 16, 2020 •

edited

Loading

larroy commented Jan 16, 2020

leezu commented Jan 16, 2020 •

edited

Loading

larroy commented Jan 16, 2020

leezu commented Jan 16, 2020

larroy commented Jan 28, 2020

leezu commented Jan 28, 2020

leezu commented Feb 24, 2020

larroy commented Feb 24, 2020 •

edited

Loading

leezu left a comment

[Build] Add a reasonable default for CMAKE_CUDA_COMPILER in *nix #17293

[Build] Add a reasonable default for CMAKE_CUDA_COMPILER in *nix #17293

Conversation

larroy commented Jan 13, 2020

Description

larroy commented Jan 13, 2020

larroy commented Jan 13, 2020

larroy commented Jan 13, 2020

leezu commented Jan 14, 2020

larroy commented Jan 14, 2020

larroy commented Jan 14, 2020

leezu commented Jan 15, 2020

leezu Jan 15, 2020 • edited Loading

Choose a reason for hiding this comment

larroy Jan 15, 2020

Choose a reason for hiding this comment

leezu Jan 16, 2020

Choose a reason for hiding this comment

apeforest commented Jan 15, 2020

larroy commented Jan 15, 2020

leezu commented Jan 15, 2020

larroy commented Jan 16, 2020

larroy commented Jan 16, 2020

larroy commented Jan 16, 2020

larroy commented Jan 16, 2020

larroy commented Jan 16, 2020

larroy commented Jan 16, 2020

samskalicky left a comment

Choose a reason for hiding this comment

ChaiBapchya left a comment

Choose a reason for hiding this comment

leezu left a comment

Choose a reason for hiding this comment

larroy commented Jan 16, 2020 • edited Loading

larroy commented Jan 16, 2020

leezu commented Jan 16, 2020 • edited Loading

larroy commented Jan 16, 2020

leezu commented Jan 16, 2020

larroy commented Jan 28, 2020

leezu commented Jan 28, 2020

leezu commented Feb 24, 2020

larroy commented Feb 24, 2020 • edited Loading

leezu left a comment

Choose a reason for hiding this comment

leezu Jan 15, 2020 •

edited

Loading

larroy commented Jan 16, 2020 •

edited

Loading

leezu commented Jan 16, 2020 •

edited

Loading

larroy commented Feb 24, 2020 •

edited

Loading