You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A motivating scenario: 30 epochs with config/softseg_unet3D_unbalanced.json takes around ~50 hours. The full 200 epochs is expected to finish around ~320 hours. For us to experiment more freely, multi-GPU support and mixed-precision (MP) are absolutely necessary in ivadomed, especially in the context of 3D training. From my best estimation, we can have a > 4x speed-up with these mentioned features.
PRs #42 and #44 introduce i) model-parallel, ii) data-parallel, and iii) mixed-precision (MP) for this current project. An alternative solution we have meanwhile is to use modeling/train.py to train baseline models instead of ivadomed/training.py. However, ivadomed has many features that we'd like to utilize. Therefore, we have a new priority: introducing these features in ivadomed ASAP.
After guidance from @andreanne-lemay, I have decided to open three issues and subsequent PRs in ivadomed:
Incorporation of num_workers > 0 in data-loading in ivadomed.
Multi-GPU support (model- and data-parallel)
Another for mixed-precision (MP) training.
As discussed with @naga-karthik today, some of these might not be compatible with torch=1.5.0 currently used in ivadomed. Therefore, the branches that implement these features will likely be rogue for a while, and their integration into the main branch will be later & a lot more effortly.
I will follow with my plan tomorrow, and the ETA is ~2 days. Any feedback / comment is appreciated!
The text was updated successfully, but these errors were encountered:
I vaguely remember that setting the num_workers parameter is not as straightforward as it sounds. It seems that there's still no general "formula" for setting this parameter's value. There are many discussions that talk about this issue:
This link says that setting num_workers > 0 raises errors in Windows machines due to their restricted multiprocessing capabilities.
This issue says that DataLoader becomes extremely slow with num_workers > 0.
This discussion lists some guidelines (again, highly subjective) for setting the num_workers value.
Therefore, in my opinion, it would be good to let the user choose/set the num_worker value in the .json file because we don't know under which OS ivadomed will be used. Personally, I never had any issues with num_workers on Linux but seeing the discussions it seems that if we set this parameter for the users, it'll be difficult for them to debug.
A motivating scenario: 30 epochs with
config/softseg_unet3D_unbalanced.json
takes around ~50 hours. The full 200 epochs is expected to finish around ~320 hours. For us to experiment more freely, multi-GPU support and mixed-precision (MP) are absolutely necessary inivadomed
, especially in the context of 3D training. From my best estimation, we can have a > 4x speed-up with these mentioned features.PRs #42 and #44 introduce i) model-parallel, ii) data-parallel, and iii) mixed-precision (MP) for this current project. An alternative solution we have meanwhile is to use
modeling/train.py
to train baseline models instead ofivadomed/training.py
. However,ivadomed
has many features that we'd like to utilize. Therefore, we have a new priority: introducing these features inivadomed
ASAP.After guidance from @andreanne-lemay, I have decided to open three issues and subsequent PRs in
ivadomed
:num_workers
> 0 in data-loading inivadomed
.As discussed with @naga-karthik today, some of these might not be compatible with
torch=1.5.0
currently used inivadomed
. Therefore, the branches that implement these features will likely be rogue for a while, and their integration into the main branch will be later & a lot more effortly.I will follow with my plan tomorrow, and the ETA is ~2 days. Any feedback / comment is appreciated!
The text was updated successfully, but these errors were encountered: