-
Notifications
You must be signed in to change notification settings - Fork 23.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Power8/P100 node pytorch compilation from source with cuda 10.1: bus error - out of memory #31438
Comments
Same problem if I switch from openmpi to spectrum-mpi. |
Same problem if I remove MPI completely and just build with CUDA from module (installed as a distro package). When I switched to |
We're happy to accept a PR to resolve this issue |
@cpuhrsch I don't know the cause of this issue yet, still troubleshooting. |
Typically a "bus error" means you ran out of memory. Try reducing parallelism with, e.g., |
In the "node killed" case, do you mean that the system crashed / rebooted? There's nothing in the pytorch build that should be able to cause that, so if so I'd suspect something issue with the system environment more generally. Do you see anything interesting (e.g. any warnings, "BUG", oops, or "EEH" notifications) in the system log / dmesg? If the problem is easily recreatable, could you capture the console during an event? I see above that you're running RHEL 7.4 with the 418.39 GPU driver. If you can easily update to latest RHEL 7 and 418 GPU driver, that would at least rule out any known kernel or driver issues. |
@ezyang this may seem obvious to you, but how do I pass to cmake -j1 via setup.py? |
If I may... Haven't tried, but looks like setting https://github.com/pytorch/pytorch/blob/master/setup.py#L11 |
@hartb i think you are right and now I have to wait for ages! |
Ok, I'm at [2384/2887] so the issue above is resolved! |
ok, build failed eventually due to this unresolved build error: |
With CUDA 10.1, you may need to make sure your tree has: 83cf947 |
Ah; this is my fault. I pointed you to the wrong fix for this. Sorry! You'll want to revert 83cf947 And then ensure you have (or not) the guard code mentioned in #32083 based on the exact version of CUDA 10.1 you have. And if you need the guard, you'll need to tweak the version check in it. Added more details over in #32083 |
@hartb which cuda package versions would you recommend to build pytorch master or 1.3/1.4 from source? or should I use system installed cuda 10.1? https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/#/
|
Our next release of WML CE will include PyTorch 1.3.1 built against CUDA 10.2 (and NCCL 2.5.6 / cuDNN 7.6.5). (That PyTorch 1.3.1 package should be avilable in our Early Access channel in a day or two, but is still build against Spectrum MPI on Power, so I think isn't what you're after.) CUDA 10.2 is convenient because the existing |
@hartb I finally compiled pytorch with all updated cuda 10.1 packages in powerai channels! The testing seems fine so far. |
Ah; nice--glad to hear it! |
🐛 Bug
More details in this full traceback:
pytorch.openmpi.cuda.build.error.txt
To Reproduce
Steps to reproduce the behavior:
Environment
Please copy and paste the output from our
cc @ezyang @gchanan @zou3519 @ngimel
The text was updated successfully, but these errors were encountered: