-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training MobileNet with multiple GPU #61
Comments
Have you done this before training? |
@nttstar Yes, attached is the log. When I use single GPU, the speed is also around 800 samples/s gpu num: 4 |
The system memory is 125GB, each GPU's memory is 12GB |
Any problem if you use other DL frameworks? |
@CrazyAlan I see |
Hi @nttstar, thanks a lot for the help! I tried to change kvstore= 'local', the performance is still the same. What I found is that I am using Mobilenet, which is not well implemented under MxNet (compare to Tensorflow) so that most of the time the GPU is empty :(. Do you know any high-efficiency Depthwise Conv operation under MxNet? I think that might help with the speed problem. I am new to MxNet, but in Caffe, I was using this (https://github.com/yonghenglh6/DepthwiseConvolution). And it seems to be much faster |
Ah, I haven't realized this before. I'm not sure if there's a better implementation. I will update the symbol if there is. Thank you very much! |
@CrazyAlan I just did a simple experiment on my test server(with Tesla M40 GPU). About 700 samples/s by using 4 GPUs while only 220 samples/s with single GPU. So there's no problem in my test. Can you check your training parameters again? |
@nttstar My training command is I am using (Tesla P100), the speed of 4 GPUs or 1 GPU is almost the same (~562.99 samples/sec). Is it possible that the file reading speed causes the problem? When I change the per-batch-size to 256, the speed incresed to ~700 samples/sec (not sure if it would harm the accuracy). I also tried to use vgg_face, the speed is roughly 800 samples/sec. Can you share me with your speed with the command I have? I don't change the script. |
How did you choose number of GPUs? You need to set |
You can post your log of single GPU training. |
Here is the log with AM-Softmax, I lose the log the softmax, but speed is very similar. It's a server, so I can choose how many numbers of GPUs to use. The speed of ResNet is linear to the number of GPUs.
|
I guess it was affected by IO. Did you use SSD? |
Hmmm, might be. It's more like a distributed system, so the network should be way much slower than SSD. When I use larger batch size, the number of samples processed per second is increased. Did you guys run any experiments to set per-batch-size to 256 or even 512? And do you know how will it affect the accuracy? |
It is very common to obtain better performance when using larger batch size. |
You can try larger learning rate when using larger batch size. But I haven't do any experiment other than batch size 512(128*4). |
Ok, I will try that. Anyway, thanks for the help, really appreciated!! |
@CrazyAlan |
@CrazyAlan Have you resolved this. I have the same problem |
When I am using 4 GPUs to train the network, it seems no speed increase comparing to single GPU.
When the training is running under 4 GPUs mode, every GPU's "GPU-Util" is decreased to 20%~30%. While in single GPU mode, the "GPU-Util" is above 60%. Do you get an idea of the problem?
The text was updated successfully, but these errors were encountered: