Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Power8/P100 node pytorch compilation from source with cuda 10.1: bus error - out of memory #31438

Closed
den-run-ai opened this issue Dec 18, 2019 · 18 comments
Labels
high priority module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@den-run-ai
Copy link
Contributor

den-run-ai commented Dec 18, 2019

🐛 Bug

CMake Error: Generator: execution of make failed. Make command was: /denfromufa/anaconda3/envs/pytorch.1.3.ompi4/bin/ninja -j 160 install 
Bus error

More details in this full traceback:

pytorch.openmpi.cuda.build.error.txt

To Reproduce

Steps to reproduce the behavior:

  969  conda create -n pytorch.1.3.ompi4 python=3.6
  970  conda activate pytorch.1.3.ompi4
  972  conda install numpy ninja pyyaml setuptools cmake cffi 
  974  conda install magma
  977  git clone --recursive https://github.com/pytorch/pytorch

  985  module load openmpi/4.0.1
  986  module load gcc/7.3.0 openmpi/4.0.1
  987  CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
  988  which mpirun
  989  which mpicxx
  990  mpicxx
  991  python setup.py install # this build works without CUDA
 1006  module load cuda/10.1
 1018  python setup.py clean
 1021  python setup.py install # this build fails with CUDA 10.1

Environment

Please copy and paste the output from our

python torch/utils/collect_env.py
Collecting environment information...
PyTorch version: 1.4.0a0+47766e6
Is debug build: No
CUDA used to build PyTorch: Could not collect

OS: Red Hat Enterprise Linux Server 7.4 (Maipo)
GCC version: (GCC) 7.3.0
CMake version: version 3.14.0

Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.1.105
GPU models and configuration: 
GPU 0: Tesla P100-SXM2-16GB
GPU 1: Tesla P100-SXM2-16GB
GPU 2: Tesla P100-SXM2-16GB
GPU 3: Tesla P100-SXM2-16GB

Nvidia driver version: 418.39
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.17.4
[pip] torch==1.4.0a0+47766e6
[pip] torchvision==0.2.2.post3
[conda] magma                     2.5.1             1583.g04741a4    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
[conda] nomkl                     3.0                           0  
[conda] torch                     1.4.0a0+47766e6          pypi_0    pypi
[conda] torchvision               0.2.2.post3              pypi_0    pypi

cc @ezyang @gchanan @zou3519 @ngimel

@albanD albanD added high priority module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general labels Dec 18, 2019
@den-run-ai den-run-ai changed the title Power8/P100 node pytorch compilation from source with cuda 10.1 and openmpi 4.0 support Power8/P100 node pytorch compilation from source with cuda 10.1 and openmpi 4.0 support. Fails with CUDA. Dec 18, 2019
@den-run-ai
Copy link
Contributor Author

Same problem if I switch from openmpi to spectrum-mpi.

@den-run-ai
Copy link
Contributor Author

Same problem if I remove MPI completely and just build with CUDA from module (installed as a distro package).

When I switched to cudatoolkit-dev package from powerai conda, this killed the development node with 4 GPUs 😮 See the attached traceback:

node.killed.pytorch.compile.source.txt

@den-run-ai den-run-ai changed the title Power8/P100 node pytorch compilation from source with cuda 10.1 and openmpi 4.0 support. Fails with CUDA. Power8/P100 node pytorch compilation from source with cuda 10.1 Dec 19, 2019
@cpuhrsch
Copy link
Contributor

We're happy to accept a PR to resolve this issue

@cpuhrsch cpuhrsch added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Dec 23, 2019
@den-run-ai
Copy link
Contributor Author

@cpuhrsch I don't know the cause of this issue yet, still troubleshooting.

@ezyang
Copy link
Contributor

ezyang commented Jan 2, 2020

Typically a "bus error" means you ran out of memory. Try reducing parallelism with, e.g., -j1

@ezyang ezyang changed the title Power8/P100 node pytorch compilation from source with cuda 10.1 Power8/P100 node pytorch compilation from source with cuda 10.1: bbus error Jan 2, 2020
@ezyang ezyang changed the title Power8/P100 node pytorch compilation from source with cuda 10.1: bbus error Power8/P100 node pytorch compilation from source with cuda 10.1: bus error Jan 2, 2020
@hartb
Copy link
Contributor

hartb commented Jan 2, 2020

In the "node killed" case, do you mean that the system crashed / rebooted? There's nothing in the pytorch build that should be able to cause that, so if so I'd suspect something issue with the system environment more generally.

Do you see anything interesting (e.g. any warnings, "BUG", oops, or "EEH" notifications) in the system log / dmesg? If the problem is easily recreatable, could you capture the console during an event?

I see above that you're running RHEL 7.4 with the 418.39 GPU driver. If you can easily update to latest RHEL 7 and 418 GPU driver, that would at least rule out any known kernel or driver issues.

@den-run-ai
Copy link
Contributor Author

@ezyang this may seem obvious to you, but how do I pass to cmake -j1 via setup.py?

@hartb
Copy link
Contributor

hartb commented Jan 16, 2020

If I may... Haven't tried, but looks like setting MAX_JOBS in the env will do it:

https://github.com/pytorch/pytorch/blob/master/setup.py#L11

@den-run-ai
Copy link
Contributor Author

@hartb i think you are right and now I have to wait for ages!

image

@den-run-ai
Copy link
Contributor Author

Ok, I'm at [2384/2887] so the issue above is resolved!

@den-run-ai den-run-ai changed the title Power8/P100 node pytorch compilation from source with cuda 10.1: bus error Power8/P100 node pytorch compilation from source with cuda 10.1: bus error - out of memory Jan 17, 2020
@den-run-ai
Copy link
Contributor Author

ok, build failed eventually due to this unresolved build error:

#32083

@hartb
Copy link
Contributor

hartb commented Jan 17, 2020

With CUDA 10.1, you may need to make sure your tree has: 83cf947

@den-run-ai
Copy link
Contributor Author

den-run-ai commented Jan 22, 2020

@hartb it seems it is not that simple:

#32083

@hartb
Copy link
Contributor

hartb commented Jan 22, 2020

Ah; this is my fault. I pointed you to the wrong fix for this. Sorry!

You'll want to revert 83cf947

And then ensure you have (or not) the guard code mentioned in #32083 based on the exact version of CUDA 10.1 you have. And if you need the guard, you'll need to tweak the version check in it. Added more details over in #32083

@den-run-ai
Copy link
Contributor Author

den-run-ai commented Jan 23, 2020

@hartb which cuda package versions would you recommend to build pytorch master or 1.3/1.4 from source? or should I use system installed cuda 10.1?

https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/#/
https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/

$ conda search -f cudatoolkit
Loading channels: done
# Name                       Version           Build  Channel             
cudatoolkit                      8.0               0  pkgs/free           
cudatoolkit                      9.0               0  pkgs/main           
cudatoolkit                 10.1.105     446.8cc2201  ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit                 10.1.168    533.g8d035fd  ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit                 10.1.243    616.gc122b8b  ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit                 10.1.243    635.g08e787d  ibmdl/export/pub/software/server/ibm-ai/conda-early-access
cudatoolkit                  10.2.89    654.g0f7a43a  ibmdl/export/pub/software/server/ibm-ai/conda-early-access
$ conda search -f cudatoolkit-dev
Loading channels: done
# Name                       Version           Build  Channel             
cudatoolkit-dev             10.1.105     446.8cc2201  ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit-dev             10.1.168    533.g8d035fd  ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit-dev             10.1.243    616.gc122b8b  ibmdl/export/pub/software/server/ibm-ai/conda
cudatoolkit-dev             10.1.243    635.g08e787d  ibmdl/export/pub/software/server/ibm-ai/conda-early-access
cudatoolkit-dev              10.2.89    654.g0f7a43a  ibmdl/export/pub/software/server/ibm-ai/conda-early-access
$ conda search -f cudnn
Loading channels: done
# Name                       Version           Build  Channel             
cudnn                         6.0.21               0  pkgs/free           
cudnn                          7.1.4       cuda9.0_0  pkgs/main           
cudnn                     7.5.0+10.1     421.cdd5ce1  ibmdl/export/pub/software/server/ibm-ai/conda
cudnn                     7.5.1_10.1    507.gcdf2330  ibmdl/export/pub/software/server/ibm-ai/conda
cudnn                     7.6.3_10.1    590.g5627c5e  ibmdl/export/pub/software/server/ibm-ai/conda
cudnn                     7.6.3_10.1    607.g5627c5e  ibmdl/export/pub/software/server/ibm-ai/conda-early-access
cudnn                     7.6.5_10.2    624.g338a052  ibmdl/export/pub/software/server/ibm-ai/conda-early-access

@hartb
Copy link
Contributor

hartb commented Jan 23, 2020

Our next release of WML CE will include PyTorch 1.3.1 built against CUDA 10.2 (and NCCL 2.5.6 / cuDNN 7.6.5). (That PyTorch 1.3.1 package should be avilable in our Early Access channel in a day or two, but is still build against Spectrum MPI on Power, so I think isn't what you're after.)

CUDA 10.2 is convenient because the existing cusparseGetErrorString() quirk in aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu will be correct as is. And I don't recall hitting any other difficulties when we switched from 10.1 to 10.2.

@den-run-ai
Copy link
Contributor Author

@hartb I finally compiled pytorch with all updated cuda 10.1 packages in powerai channels! The testing seems fine so far.

@hartb
Copy link
Contributor

hartb commented Jan 23, 2020

Ah; nice--glad to hear it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants