-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU support #42
Comments
Comments below describe the technical details of changes made. If you just want to use Multi-GPU, you can stop reading now. cutorch.setDevice right now resets the random seed as well. This needs to be separated so that you can use multiple GPUs as needed GPU-to-GPU copy now has to use a host-bridge. This needs to be changed to P2P GPU copy. This is really trivial to implement, in the cutorch initialization function, we just have to enable p2p for each GPU detected with this function: After that UVA takes care of everything else, copying tensors from one GPU to another is as simple as this: Internally, Clement and us have multi-GPU support, and we will get the changes back to cutorch slowly (it will take time to isolate the commits and get approval etc), but if you are really adventurous, this is a couple of hours of work. |
And I am guessing we should use this for our D2D memory copies: http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__MEMORY_g046702971bc5a66d9bc6000682a6d844.html#g046702971bc5a66d9bc6000682a6d844 This means that if I have kernel sequence A->B->C (one device 1), followed by device2device memcopy D , followed by kernels (one device 2) E->F->G, then eventually A->B->C of iteration t should run in parallel to E->F->G of the previous iteration (t-1), right? Otherwise, I don't see how this can be useful, other than allowing the use of more GPU memory. I mean ideally, you want those GPUs to work on different kernels concurrently. Say A->B->C are the first 3 modules of nn.Sequential, and E->F->G are the last 3. |
@nicholas-leonard you dont need to do cudaMemcpyPeer explicitly anymore, UVA takes care of it. |
Wow. So you are right, it would be super easy to implement. We check for device UVA flag, then call cudaDeviceEnablePeerAccess for every combination of such devices. Easy. |
Still, would the two sequences in the above example be able to run concurrently? |
yes they would run concurrently starting the next itreration as long as you have no blocking calls anywhere. |
@soumith I did the UVA init for GPUs and got rid of "cuda runtime error : an illegal memory access was encountered" errors while copying directly from one tensor to another, I'm not sure, however, that network calls are not blocking everywhere. How do we test it? |
@szagoruyko profile it (with sys.tic()/sys.toc() or just with os.clock(). |
@szagoruyko SpatialConvolutionMM seems to be randomly blocking sometimes, due to for-looped cublas calls. Use CuDNN, there is no blocking at all. |
Yeah, this is really annoying, I could never figure out why this gemm calls On Wed, Sep 24, 2014 at 12:26 PM, Soumith Chintala <notifications@github.com
|
@clementfarabet I've even tried moving to cublasv2, and that didnt help. Maybe CuBLAS has a queue that gets filled? It's only conjecture, as we dont have source code. |
It probably does, in which case we would need to use streams Clément
|
@soumith forward cudnn itself is not blocking, but when I add nn.Reshape and nn.Linear it blocks. backward is not blocking at all though |
@szagoruyko use nn.View |
I think the call to new():fill() blocks here: https://github.com/torch/nn/blob/master/Linear.lua#L48 |
i dont know anymore what the public cutorch is like. it is possible that that line is blocking, that line is not needed there, it can be a temporary buffer that is reused. |
if not self.addBuffer or (self.addBuffer:size(1) ~= nframe) then shall i patch it, or does someone else want to do the honours? Same with line 89 and 92, same addBuffer can be reused. |
@soumith @nicholas-leonard it doesn't go there actually, nunit is 1 in my case |
ah, for nunit=1, this line is blocking |
@soumith cool! ccn2 is blocking by the way. MM and ccn2 are blocking, and ccn2 in backward is not fully blocked. Should I share the test script somewhere? |
@soumith thanks |
@szagoruyko yes that would be helpful for all |
and pull request is here #44 |
by the way, is it possible to have shared modules on different GPUs? |
@szagoruyko just added p2p access, anyone else wants to take the task of not resetting the random seed every time setDevice is called? All you have to do is move the randomseed initialization to cuda initialization (per device) |
@szagoruyko not directly, but if you want to do data-parallel training, your training loop can run like this:
|
@soumith cool, looks like we can do it efficiently now. Thanks! |
I've created a pull request for moving the random seed initialization: #45 |
Awesome! now that this is done, basic Multi-GPU support is essentially done. So, @jonathantompson to answer your earlier question, torch has Multi-GPU support ;) |
Error happens when I run the example: it seems the math operations are not "seamless" as declared: th> cutorch.setDevice(1) |
@Algred the only operation that is allowed cross-GPU is the copy operation. All other operations are checked with assertions to make sure what you are doing is not possible. To get good performance, you'll have to copy matrix1 onto matrix2's GPU and then do the mathematical operation. If you dont like this setting, you can simply disable these assertions by adding the define DISABLE_CHECK_GPU and reinstalling cutorch. https://github.com/torch/cutorch/blob/master/lib/THC/THCTensor.c#L761 |
@soumith Thank you very much for replying! |
Does this PR support GPU Direct RDMA? Or are additional lower-level modifications necessary to run on a multi GPU Mellanox/GTX Titan X cluster? |
If i do data parallel training, how does the minibatch data are split over the multi GPUs, do they split evenly by the scheduler, or just split the minibatch manually |
evenly |
@darksigma try looking at nccl.torch for that. |
@soumith For mini-batch splitting, is there any shuffle before it? Or just split the batch evenly according to the original order of samples in the batch. |
No shuffle, split evenly |
MultiGPU support has been implemented in cutorch (and by extension all torch cuda libraries like cunn, cudnn etc.).
Example usage for tensors:
if you want to do data-parallel training of neural nets (including convnets), your training loop can run like this:
For each mini-batch:
Loop back to 1 for next mini-batch
Also, to train ConvNets using multiple GPUs, I recommend using CuDNN for the convolution layers, as I've tested that they are completely asynchronous (meaning that the processing runs parallely on multiple GPUs)
Comments below describe the technical details of changes made. If you just want to use Multi-GPU, you can stop reading now.
The text was updated successfully, but these errors were encountered: