3D Training Slow -> New Priority: Speeding-up `ivadomed` #45

uzaymacar · 2021-06-17T01:07:04Z

A motivating scenario: 30 epochs with config/softseg_unet3D_unbalanced.json takes around ~50 hours. The full 200 epochs is expected to finish around ~320 hours. For us to experiment more freely, multi-GPU support and mixed-precision (MP) are absolutely necessary in ivadomed, especially in the context of 3D training. From my best estimation, we can have a > 4x speed-up with these mentioned features.

PRs #42 and #44 introduce i) model-parallel, ii) data-parallel, and iii) mixed-precision (MP) for this current project. An alternative solution we have meanwhile is to use modeling/train.py to train baseline models instead of ivadomed/training.py. However, ivadomed has many features that we'd like to utilize. Therefore, we have a new priority: introducing these features in ivadomed ASAP.

After guidance from @andreanne-lemay, I have decided to open three issues and subsequent PRs in ivadomed:

Incorporation of num_workers > 0 in data-loading in ivadomed.
Multi-GPU support (model- and data-parallel)
Another for mixed-precision (MP) training.

As discussed with @naga-karthik today, some of these might not be compatible with torch=1.5.0 currently used in ivadomed. Therefore, the branches that implement these features will likely be rogue for a while, and their integration into the main branch will be later & a lot more effortly.

I will follow with my plan tomorrow, and the ETA is ~2 days. Any feedback / comment is appreciated!

The text was updated successfully, but these errors were encountered:

naga-karthik · 2021-06-17T22:45:09Z

I vaguely remember that setting the num_workers parameter is not as straightforward as it sounds. It seems that there's still no general "formula" for setting this parameter's value. There are many discussions that talk about this issue:

This link says that setting num_workers > 0 raises errors in Windows machines due to their restricted multiprocessing capabilities.
This issue says that DataLoader becomes extremely slow with num_workers > 0.
This discussion lists some guidelines (again, highly subjective) for setting the num_workers value.

Therefore, in my opinion, it would be good to let the user choose/set the num_worker value in the .json file because we don't know under which OS ivadomed will be used. Personally, I never had any issues with num_workers on Linux but seeing the discussions it seems that if we set this parameter for the users, it'll be difficult for them to debug.

uzaymacar self-assigned this Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3D Training Slow -> New Priority: Speeding-up `ivadomed` #45

3D Training Slow -> New Priority: Speeding-up `ivadomed` #45

uzaymacar commented Jun 17, 2021 •

edited

Loading

naga-karthik commented Jun 17, 2021

3D Training Slow -> New Priority: Speeding-up ivadomed #45

3D Training Slow -> New Priority: Speeding-up ivadomed #45

Comments

uzaymacar commented Jun 17, 2021 • edited Loading

naga-karthik commented Jun 17, 2021

3D Training Slow -> New Priority: Speeding-up `ivadomed` #45

3D Training Slow -> New Priority: Speeding-up `ivadomed` #45

uzaymacar commented Jun 17, 2021 •

edited

Loading