Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor GPUStatsMonitor to improve training speed #3257

Merged
merged 12 commits into from
Sep 4, 2020

Conversation

rohitgr7
Copy link
Contributor

@rohitgr7 rohitgr7 commented Aug 29, 2020

What does this PR do?

Calls nvidia-smi once with all the required stats to monitor.
Follow up of #3008.

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Aug 29, 2020

Codecov Report

Merging #3257 into master will increase coverage by 3%.
The diff coverage is 86%.

@@           Coverage Diff           @@
##           master   #3257    +/-   ##
=======================================
+ Coverage      86%     89%    +3%     
=======================================
  Files          91      92     +1     
  Lines        8152    8311   +159     
=======================================
+ Hits         6989    7396   +407     
+ Misses       1163     915   -248     

@rohitgr7
Copy link
Contributor Author

Questions:

  • Should it be monitored for the GPUs that are actually used in the Trainer? For eg. user can have 10 GPUs but in the Trainer, only 2 are specified?
  • Should row_log_interval and log_save_interval be considered here while logging?

@rohitgr7 rohitgr7 marked this pull request as ready for review August 29, 2020 15:27
@mergify mergify bot requested a review from a team August 29, 2020 15:27
@Borda
Copy link
Member

Borda commented Aug 29, 2020

  • Should it be monitored for the GPUs that are actually used in the Trainer? For eg. user can have 10 GPUs but in the Trainer, only 2 are specified?

Yes

  • Should row_log_interval and log_save_interval be considered here while logging?

Preferably

@Borda
Copy link
Member

Borda commented Aug 29, 2020

@rohitgr7 have we also renamed the files and class in #3008 so we need to maintain back-compatibility if it was already released in 0.9... check #3251 (comment)

@rohitgr7 rohitgr7 changed the title Refactor GPUMonitor to improve training speed [WIP] Refactor GPUMonitor to improve training speed Aug 29, 2020
@pep8speaks
Copy link

pep8speaks commented Aug 29, 2020

Hello @rohitgr7! Thanks for updating this PR.

Line 136:1: W293 blank line contains whitespace

Comment last updated at 2020-09-03 23:19:42 UTC

@Borda Borda added the feature Is an improvement or enhancement label Sep 2, 2020
@Borda
Copy link
Member

Borda commented Sep 2, 2020

@rohitgr7 is it still wip?

@rohitgr7
Copy link
Contributor Author

rohitgr7 commented Sep 2, 2020

@Borda need to update this once #3251 get's merged.

@Borda Borda changed the title [WIP] Refactor GPUMonitor to improve training speed [blocked by #3251] Refactor GPUMonitor to improve training speed Sep 2, 2020
@mergify
Copy link
Contributor

mergify bot commented Sep 3, 2020

This pull request is now in conflict... :(

@Borda Borda changed the title [blocked by #3251] Refactor GPUMonitor to improve training speed Refactor GPUMonitor to improve training speed Sep 3, 2020
@rohitgr7 rohitgr7 changed the title Refactor GPUMonitor to improve training speed Refactor GPUStatsMonitor to improve training speed Sep 3, 2020
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main speedup comes from less logging calls, is that correct?

pytorch_lightning/callbacks/gpu_stats_monitor.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/gpu_stats_monitor.py Outdated Show resolved Hide resolved
@mergify mergify bot requested a review from a team September 3, 2020 21:47
@rohitgr7
Copy link
Contributor Author

rohitgr7 commented Sep 3, 2020

the main speedup comes from less logging calls, is that correct?

yep and also from single nvidia-smi call maybe.

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good that the deprecation/backward compatibility is included now!

@mergify mergify bot requested a review from a team September 3, 2020 23:42
Copy link
Member

@SkafteNicki SkafteNicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify mergify bot requested a review from a team September 4, 2020 09:39
@williamFalcon williamFalcon merged commit 24809b0 into master Sep 4, 2020
@Borda Borda deleted the refactor/gpu_stats_monitor branch September 4, 2020 10:12
@@ -69,7 +83,7 @@ def __init__(
):
super().__init__()

if shutil.which("nvidia-smi") is None:
if shutil.which('nvidia-smi') is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not work on Windows

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you consider to use pynvml instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get it to work on my windows machine. Just had to add nvidia-smi to PATH variable, and restart. On my computer it is stored in: C:\Program Files\NVIDIA Corporation\NVSMI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but this is something that user should do manually
maybe in that case, also add smth like so?

if any([shutil.which('C:/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi') is None,
        shutil.which('nvidia-smi') is None])

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work in most cases, as long as the user has install in default location.
Maybe it should just be clarified in the docs, that the user needs to be able to call nvidia-smi from terminal and hint what windows users should do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we can add it manually into PATH (os.environ["PATH"]), but i don't know is if a valid behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants