-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixed precision training slower than FP32 training #297
Comments
If your network or batch size is small, you may be underutilizing the device, in which case there's not much for Amp to accelerate. What kind of GPUs are you using? Also, is Amp slower than normal training within a single process as well? |
I ran the tests with 2x GTX 1080 TI and a batch size of 128 (so 64 per device) I haven't tested with a single device yet. I'll let you know. |
GTX 1080 TI have low-rate FP16 performance. If you want to better performance with FP16, then must be using Volta architecture, or RTX series. Check this topic https://devtalk.nvidia.com/default/topic/1023708/gpu-accelerated-libraries/fp16-support-on-gtx-1060-and-1080/ |
Alright, I'll test it on a few V100 |
Yes, the 1080Ti was intended for gaming, so it has really low compute throughput for FP16 math. You need a Tensor Core-enabled GPU (Turing or Volta) to get best results with mixed precision. |
hello, I ran into the same problem when I was trying to run exps on 1x RTX2080, however the performance with O1 is worse than O0, more time cost and more memory consumed. The compute capablitiy of RTX2080 is 7.5 and I think it should works with amp(see docs https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions ). Anyone knows why? |
here is my code, and my env is RTX2080 with CUDA10.1 `import os def main(): def train(args):
if name == 'main': |
I notice there is an “ImportError”, so I resinatall the apex(with another pytorch version 1.4) and meet another problem named "version mismatch", according to this #323 I deleted the some code about "matching version" and finally installed with no warning! However, when I ran my test code, the traing time is still longer with O1 than O0 while memory cost is indeed slightly decreased, is that normal? mode memroy timeO0 3855M 26s/epochO1 3557M 33s/epoch |
I've been doing some experiments on CIFAR10 with ResNets and decided to give APEX AMP a try.
However, I ran into some performance issues:
torch.nn.parallel.DistributedDataParallel
was extremely slow.apex.parallel.DistributedDataParallel
was slower than the default training withtorch.nn.DistributedDataParallel
(no apex involved). For reference, normal training took about 15 min, while apex AMP training took 21 minutes (90 epochs on CIFAR-10 with ResNet20)I followed the installation instructions, but I couldn't install the C++ extensions because of my GCC/CUDA version. Does this justify this slowdown?
You can see the code here:
https://github.com/braincreators/octconv/blob/34440209c4b37fb5198f75e4e8c052e92e80e85d/benchmarks/train.py#L1-L498
And run it (2 GPUs):
Without APEX AMP:
python -m torch.distributed.launch --nproc_per_node 2 train.py -c configs/cifar10/resnet20_small.yml --batch-size 128 --lr 0.1
With APEX AMP:
python -m torch.distributed.launch --nproc_per_node 2 train.py -c configs/cifar10/resnet20_small.yml --batch-size 128 --lr 0.1 --mixed-precision
The text was updated successfully, but these errors were encountered: