Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3D Training Slow -> New Priority: Speeding-up ivadomed #45

Open
uzaymacar opened this issue Jun 17, 2021 · 1 comment
Open

3D Training Slow -> New Priority: Speeding-up ivadomed #45

uzaymacar opened this issue Jun 17, 2021 · 1 comment
Assignees

Comments

@uzaymacar
Copy link
Contributor

uzaymacar commented Jun 17, 2021

A motivating scenario: 30 epochs with config/softseg_unet3D_unbalanced.json takes around ~50 hours. The full 200 epochs is expected to finish around ~320 hours. For us to experiment more freely, multi-GPU support and mixed-precision (MP) are absolutely necessary in ivadomed, especially in the context of 3D training. From my best estimation, we can have a > 4x speed-up with these mentioned features.

PRs #42 and #44 introduce i) model-parallel, ii) data-parallel, and iii) mixed-precision (MP) for this current project. An alternative solution we have meanwhile is to use modeling/train.py to train baseline models instead of ivadomed/training.py. However, ivadomed has many features that we'd like to utilize. Therefore, we have a new priority: introducing these features in ivadomed ASAP.

After guidance from @andreanne-lemay, I have decided to open three issues and subsequent PRs in ivadomed:

  • Incorporation of num_workers > 0 in data-loading in ivadomed.
  • Multi-GPU support (model- and data-parallel)
  • Another for mixed-precision (MP) training.

As discussed with @naga-karthik today, some of these might not be compatible with torch=1.5.0 currently used in ivadomed. Therefore, the branches that implement these features will likely be rogue for a while, and their integration into the main branch will be later & a lot more effortly.

I will follow with my plan tomorrow, and the ETA is ~2 days. Any feedback / comment is appreciated!

@uzaymacar uzaymacar self-assigned this Jun 17, 2021
@naga-karthik
Copy link
Member

I vaguely remember that setting the num_workers parameter is not as straightforward as it sounds. It seems that there's still no general "formula" for setting this parameter's value. There are many discussions that talk about this issue:

  1. This link says that setting num_workers > 0 raises errors in Windows machines due to their restricted multiprocessing capabilities.
  2. This issue says that DataLoader becomes extremely slow with num_workers > 0.
  3. This discussion lists some guidelines (again, highly subjective) for setting the num_workers value.

Therefore, in my opinion, it would be good to let the user choose/set the num_worker value in the .json file because we don't know under which OS ivadomed will be used. Personally, I never had any issues with num_workers on Linux but seeing the discussions it seems that if we set this parameter for the users, it'll be difficult for them to debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants