Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apex AMP performance bad on GTX 1650, good on RTX 2080 and Volta w/ Tensor Cores, is this normal ?? #806

Open
CDahmsCellarEye opened this issue Apr 24, 2020 · 2 comments

Comments

@CDahmsCellarEye
Copy link

I've found an odd result when using Apex AMP (Automatic Mixed Precision). My boss and I have ran the same program performing computer vision with PyTorch. My boss trained a PyTorch graph with and without Apex AMP. Other than using or not using Apex AMP, the training process was the same, i.e. same image set, same parameters, etc. The with Apex AMP graph was trained using the O1 opt level. I'm 100% certain we're using the same program and graph, and we've compared our settings and the command prompt messages pertaining to Apex configuration shown on start-up and everything seems to be the same.

My boss tested on a desktop with an RTX 2080 and also on a Jetson Xavier (512-core Volta GPU with Tensor Cores). He found a significant speed improvement in both cases when switching to the Apex AMP enabled graph.

I tested on a GTX 1650 and found the opposite, i.e. the Apex AMP enabled graph ran substantially slower than the non-Apex AMP enabled graph.

Upon some searching I found these posts:

#297
#325

Issue 297 in particular seems to imply that Apex AMP is expected to not work well with GTX 1xxx series GPUs.

Can anybody confirm if this is the expected result? Are other people finding the same result? Is there a setting I can change before or during the Apex install to better work with GTX 1xxx series hardware? Or is there a setting when training the graph that can or should be changed?

Something else I should mention is that when I installed Apex I received various warnings, ending with Given no hashes to check 137 links for project 'pip': discarding no candidates, similar to as described in #690. Many other people have reported a similar message in issue 690. Is it possible this may have something to do with Apex not working well on the machine I'm using?

@aabzaliev
Copy link

I tested on V100 with the same warnings you mentioned during installation. Can confirm it's getting slower than without apm

@sk0g
Copy link

sk0g commented May 5, 2020

Tensor Cores were introduced in Volta, weren't they? So without hardware support for mixed precision training in the cards preceding that, you'd just be adding overhead to the training process (casting, scaling the loss and backwards steps).

Would be handy if the library detected the availability of tensor cores, and operated in a pass through mode otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants