Power8/P100 node pytorch compilation from source with cuda 10.1: bus error - out of memory #31438

den-run-ai · 2019-12-18T20:18:53Z

🐛 Bug

CMake Error: Generator: execution of make failed. Make command was: /denfromufa/anaconda3/envs/pytorch.1.3.ompi4/bin/ninja -j 160 install 
Bus error

More details in this full traceback:

pytorch.openmpi.cuda.build.error.txt

To Reproduce

Steps to reproduce the behavior:

  969  conda create -n pytorch.1.3.ompi4 python=3.6
  970  conda activate pytorch.1.3.ompi4
  972  conda install numpy ninja pyyaml setuptools cmake cffi 
  974  conda install magma
  977  git clone --recursive https://github.com/pytorch/pytorch

  985  module load openmpi/4.0.1
  986  module load gcc/7.3.0 openmpi/4.0.1
  987  CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
  988  which mpirun
  989  which mpicxx
  990  mpicxx
  991  python setup.py install # this build works without CUDA
 1006  module load cuda/10.1
 1018  python setup.py clean
 1021  python setup.py install # this build fails with CUDA 10.1

Environment

Please copy and paste the output from our

python torch/utils/collect_env.py
Collecting environment information...
PyTorch version: 1.4.0a0+47766e6
Is debug build: No
CUDA used to build PyTorch: Could not collect

OS: Red Hat Enterprise Linux Server 7.4 (Maipo)
GCC version: (GCC) 7.3.0
CMake version: version 3.14.0

Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.1.105
GPU models and configuration: 
GPU 0: Tesla P100-SXM2-16GB
GPU 1: Tesla P100-SXM2-16GB
GPU 2: Tesla P100-SXM2-16GB
GPU 3: Tesla P100-SXM2-16GB

Nvidia driver version: 418.39
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.17.4
[pip] torch==1.4.0a0+47766e6
[pip] torchvision==0.2.2.post3
[conda] magma                     2.5.1             1583.g04741a4    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
[conda] nomkl                     3.0                           0  
[conda] torch                     1.4.0a0+47766e6          pypi_0    pypi
[conda] torchvision               0.2.2.post3              pypi_0    pypi

cc @ezyang @gchanan @zou3519 @ngimel

The text was updated successfully, but these errors were encountered:

den-run-ai · 2019-12-18T21:35:12Z

Same problem if I switch from openmpi to spectrum-mpi.

den-run-ai · 2019-12-19T22:16:51Z

Same problem if I remove MPI completely and just build with CUDA from module (installed as a distro package).

When I switched to cudatoolkit-dev package from powerai conda, this killed the development node with 4 GPUs 😮 See the attached traceback:

node.killed.pytorch.compile.source.txt

cpuhrsch · 2019-12-23T18:22:59Z

We're happy to accept a PR to resolve this issue

den-run-ai · 2019-12-26T15:42:09Z

@cpuhrsch I don't know the cause of this issue yet, still troubleshooting.

ezyang · 2020-01-02T11:34:07Z

Typically a "bus error" means you ran out of memory. Try reducing parallelism with, e.g., -j1

hartb · 2020-01-02T14:46:09Z

In the "node killed" case, do you mean that the system crashed / rebooted? There's nothing in the pytorch build that should be able to cause that, so if so I'd suspect something issue with the system environment more generally.

Do you see anything interesting (e.g. any warnings, "BUG", oops, or "EEH" notifications) in the system log / dmesg? If the problem is easily recreatable, could you capture the console during an event?

I see above that you're running RHEL 7.4 with the 418.39 GPU driver. If you can easily update to latest RHEL 7 and 418 GPU driver, that would at least rule out any known kernel or driver issues.

den-run-ai · 2020-01-16T20:10:42Z

@ezyang this may seem obvious to you, but how do I pass to cmake -j1 via setup.py?

hartb · 2020-01-16T20:39:41Z

If I may... Haven't tried, but looks like setting MAX_JOBS in the env will do it:

https://github.com/pytorch/pytorch/blob/master/setup.py#L11

den-run-ai · 2020-01-16T22:27:22Z

@hartb i think you are right and now I have to wait for ages!

den-run-ai · 2020-01-17T02:13:29Z

Ok, I'm at [2384/2887] so the issue above is resolved!

den-run-ai · 2020-01-17T11:33:42Z

ok, build failed eventually due to this unresolved build error:

#32083

hartb · 2020-01-17T13:52:04Z

With CUDA 10.1, you may need to make sure your tree has: 83cf947

den-run-ai · 2020-01-22T16:52:02Z

@hartb it seems it is not that simple:

#32083

hartb · 2020-01-22T23:23:25Z

Ah; this is my fault. I pointed you to the wrong fix for this. Sorry!

You'll want to revert 83cf947

And then ensure you have (or not) the guard code mentioned in #32083 based on the exact version of CUDA 10.1 you have. And if you need the guard, you'll need to tweak the version check in it. Added more details over in #32083

den-run-ai · 2020-01-23T14:31:10Z

@hartb which cuda package versions would you recommend to build pytorch master or 1.3/1.4 from source? or should I use system installed cuda 10.1?

https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/#/
https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/

$ conda search -f cudatoolkit
Loading channels: done
# Name                       Version           Build  Channel             
cudatoolkit                      8.0               0  pkgs/free           
cudatoolkit                      9.0               0  pkgs/main           
cudatoolkit                 10.1.105     446.8cc2201  ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit                 10.1.168    533.g8d035fd  ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit                 10.1.243    616.gc122b8b  ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit                 10.1.243    635.g08e787d  ibmdl/export/pub/software/server/ibm-ai/conda-early-access
cudatoolkit                  10.2.89    654.g0f7a43a  ibmdl/export/pub/software/server/ibm-ai/conda-early-access
$ conda search -f cudatoolkit-dev
Loading channels: done
# Name                       Version           Build  Channel             
cudatoolkit-dev             10.1.105     446.8cc2201  ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit-dev             10.1.168    533.g8d035fd  ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit-dev             10.1.243    616.gc122b8b  ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit-dev             10.1.243    635.g08e787d  ibmdl/export/pub/software/server/ibm-ai/conda-early-access
cudatoolkit-dev              10.2.89    654.g0f7a43a  ibmdl/export/pub/software/server/ibm-ai/conda-early-access
$ conda search -f cudnn
Loading channels: done
# Name                       Version           Build  Channel             
cudnn                         6.0.21               0  pkgs/free           
cudnn                          7.1.4       cuda9.0_0  pkgs/main           
cudnn                     7.5.0+10.1     421.cdd5ce1  ibmdl/export/pub/software/server/ibm-ai/conda
cudnn                     7.5.1_10.1    507.gcdf2330  ibmdl/export/pub/software/server/ibm-ai/conda
cudnn                     7.6.3_10.1    590.g5627c5e  ibmdl/export/pub/software/server/ibm-ai/conda
cudnn                     7.6.3_10.1    607.g5627c5e  ibmdl/export/pub/software/server/ibm-ai/conda-early-access
cudnn                     7.6.5_10.2    624.g338a052  ibmdl/export/pub/software/server/ibm-ai/conda-early-access

hartb · 2020-01-23T19:30:16Z

Our next release of WML CE will include PyTorch 1.3.1 built against CUDA 10.2 (and NCCL 2.5.6 / cuDNN 7.6.5). (That PyTorch 1.3.1 package should be avilable in our Early Access channel in a day or two, but is still build against Spectrum MPI on Power, so I think isn't what you're after.)

CUDA 10.2 is convenient because the existing cusparseGetErrorString() quirk in aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu will be correct as is. And I don't recall hitting any other difficulties when we switched from 10.1 to 10.2.

den-run-ai · 2020-01-23T19:36:07Z

@hartb I finally compiled pytorch with all updated cuda 10.1 packages in powerai channels! The testing seems fine so far.

hartb · 2020-01-23T22:41:56Z

Ah; nice--glad to hear it!

albanD added high priority module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general labels Dec 18, 2019

pytorch-probot bot added the triage review label Dec 18, 2019

den-run-ai mentioned this issue Dec 18, 2019

switch from spectrum mpi to openmpi IBM/powerai#26

Open

den-run-ai changed the title ~~Power8/P100 node pytorch compilation from source with cuda 10.1 and openmpi 4.0 support~~ Power8/P100 node pytorch compilation from source with cuda 10.1 and openmpi 4.0 support. Fails with CUDA. Dec 18, 2019

den-run-ai changed the title ~~Power8/P100 node pytorch compilation from source with cuda 10.1 and openmpi 4.0 support. Fails with CUDA.~~ Power8/P100 node pytorch compilation from source with cuda 10.1 Dec 19, 2019

cpuhrsch added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Dec 23, 2019

den-run-ai mentioned this issue Dec 26, 2019

Pytorch 1.4+ build with CUDA 10.1 on power8/p100 cluster - crashes the node IBM/powerai#185

Closed

ezyang changed the title ~~Power8/P100 node pytorch compilation from source with cuda 10.1~~ Power8/P100 node pytorch compilation from source with cuda 10.1: bbus error Jan 2, 2020

ezyang changed the title ~~Power8/P100 node pytorch compilation from source with cuda 10.1: bbus error~~ Power8/P100 node pytorch compilation from source with cuda 10.1: bus error Jan 2, 2020

den-run-ai closed this as completed Jan 17, 2020

den-run-ai changed the title ~~Power8/P100 node pytorch compilation from source with cuda 10.1: bus error~~ Power8/P100 node pytorch compilation from source with cuda 10.1: bus error - out of memory Jan 17, 2020

den-run-ai mentioned this issue Jan 17, 2020

build pytorch from source fauled: undefined reference to `cusparseGetErrorString(cusparseStatus_t)' #32083

Open

den-run-ai mentioned this issue Jan 18, 2020

How to compile PyTorch from source with custom CMake options? #12918

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Power8/P100 node pytorch compilation from source with cuda 10.1: bus error - out of memory #31438

Power8/P100 node pytorch compilation from source with cuda 10.1: bus error - out of memory #31438

den-run-ai commented Dec 18, 2019 •

edited

Loading

den-run-ai commented Dec 18, 2019

den-run-ai commented Dec 19, 2019

cpuhrsch commented Dec 23, 2019

den-run-ai commented Dec 26, 2019

ezyang commented Jan 2, 2020

hartb commented Jan 2, 2020

den-run-ai commented Jan 16, 2020

hartb commented Jan 16, 2020

den-run-ai commented Jan 16, 2020

den-run-ai commented Jan 17, 2020

den-run-ai commented Jan 17, 2020

hartb commented Jan 17, 2020

den-run-ai commented Jan 22, 2020 •

edited

Loading

hartb commented Jan 22, 2020

den-run-ai commented Jan 23, 2020 •

edited

Loading

hartb commented Jan 23, 2020

den-run-ai commented Jan 23, 2020

hartb commented Jan 23, 2020

Power8/P100 node pytorch compilation from source with cuda 10.1: bus error - out of memory #31438

Power8/P100 node pytorch compilation from source with cuda 10.1: bus error - out of memory #31438

Comments

den-run-ai commented Dec 18, 2019 • edited Loading

🐛 Bug

To Reproduce

Environment

den-run-ai commented Dec 18, 2019

den-run-ai commented Dec 19, 2019

cpuhrsch commented Dec 23, 2019

den-run-ai commented Dec 26, 2019

ezyang commented Jan 2, 2020

hartb commented Jan 2, 2020

den-run-ai commented Jan 16, 2020

hartb commented Jan 16, 2020

den-run-ai commented Jan 16, 2020

den-run-ai commented Jan 17, 2020

den-run-ai commented Jan 17, 2020

hartb commented Jan 17, 2020

den-run-ai commented Jan 22, 2020 • edited Loading

hartb commented Jan 22, 2020

den-run-ai commented Jan 23, 2020 • edited Loading

hartb commented Jan 23, 2020

den-run-ai commented Jan 23, 2020

hartb commented Jan 23, 2020

den-run-ai commented Dec 18, 2019 •

edited

Loading

den-run-ai commented Jan 22, 2020 •

edited

Loading

den-run-ai commented Jan 23, 2020 •

edited

Loading