Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia Apex for FP16 calculations #36

Merged
merged 2 commits into from
Jul 24, 2019
Merged

Conversation

YacobBY
Copy link
Contributor

@YacobBY YacobBY commented Jul 23, 2019

Included Compatibility with the Nvidia's Apex library, which can do Floating Point16 calculations. This gives significant speedup in training. This code has been tested on a single RTX2070. If the Nvidia Apex library is not found the code should run as normal.

To install Apex: https://github.com/NVIDIA/apex#quick-start

Known bugs:
-Does not work with adam parameter
-Gradient overflow keeps happening at the start, however it automatically reduces loss scale to 8192 after which this notification disappears

examples:
Loading: https://i.imgur.com/3nZROJz.png
Training: https://i.imgur.com/Q2w52m7.png

Included Compatibility with the Nvidia's Apex library, which can do Floating Point16 calculations. This gives significant speedup in training. This code has been tested on a single RTX2070. If  the Nvidia Apex library is not found the code should run as normal. 

To install Apex: https://github.com/NVIDIA/apex#quick-start

Known bugs: 
-Does not work with adam parameter
-Gradient overflow keeps happening at the start, however it automatically reduces loss scale to 8192 after which this notification disappears

examples:
Loading: https://i.imgur.com/3nZROJz.png
Training: https://i.imgur.com/Q2w52m7.png
@ku21fan ku21fan merged commit 5d4ed38 into clovaai:master Jul 24, 2019
@ku21fan
Copy link
Contributor

ku21fan commented Jul 24, 2019

@YacobBY Thank you for pull request!
I did not try floating point 16 calculation yet, but it seems work, thus I merged it :)

@ku21fan
Copy link
Contributor

ku21fan commented Jul 24, 2019

@YacobBY
I am sorry that I will back to the previous version.
Instead, I will refer this pull request in the readme.
The reasons are below.

  1. Using floating-point 16 calculation is not the default option of our paper, thus not all people need to install apex and know it.
  2. I know that the code run as normal if apex library is not found.
    but I feel the code became slightly complex after merge this pull request.
    I hope to keep this code simple.
  3. known bugs as you mentioned :'(

best.

@YacobBY
Copy link
Contributor Author

YacobBY commented Jul 24, 2019

@ku21fan Hello JeongHun,

I understand. The Apex compatibility code has indeed added a lot of lines and FP16 is very new so not many people have the hardware + library to run it yet.

Currently I'm trying out some new deep-learning optimizers such as Pytorch-Lightning and Nvidia Apex. I might be able to implement the apex fusedAdamOptimizer instead of the default adam option if Apex is available. This should fix the adam bug, however it still leads to a lot of extra code lines.

If I can get the other Apex functionality working I'll try to get back to you with a neater and more modular version.

In any case thanks for your open source code! it's really helpful and I've learned a lot from it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants