Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: determine slow gpu in multinode training #708

Merged
merged 1 commit into from
Sep 2, 2020

Conversation

YukioOobuchi
Copy link
Contributor

This feature contains:

  • measure the time around forward()&backward().
  • exchange the measured time among nodes
  • host node is responsible for checking whether there is any node is special slow
  • If any node is lower than level 1, we output a warning
  • if any node is slower than level 2, we raise an exception
  • nnabla.conf contains the setting,
    • how frequently is we do such measurement
    • What is the times the slow one compared to normal one, which is used to define level 1, level 2 thresholds.

@YukioOobuchi YukioOobuchi added the release-note-utility Auto-release; Utilities label Sep 2, 2020
@YukioOobuchi YukioOobuchi self-assigned this Sep 2, 2020
@YukioOobuchi YukioOobuchi merged commit f8bf872 into master Sep 2, 2020
@YukioOobuchi YukioOobuchi deleted the feature/20200813-determine-slow-node branch September 2, 2020 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-utility Auto-release; Utilities
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants