-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom Modeling Groundwork #42
Conversation
modeling/train.py
Outdated
optimizer.zero_grad() | ||
|
||
x1, x2, y = batch | ||
x1, x2, y = x1.to(device), x2, y.to(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't x2 being transferred (i.e. x2.to(device)
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah great catch 🙂 , thanks for spotting this!
modeling/train.py
Outdated
# NOTE: Settings seeds requires cuda.deterministic = True, which slows things down considerably | ||
# set_seed(seed=args.seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide the source for where it says cuda.deterministic=True
when setting a seed? I was under the impression that just torch.manual_seed(0)
should do the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the most relevant thing I can with a quick search (scroll down for the answer). Pretty much this is an interesting question with torch
and what people suggest is that you seed every aspect of torch
. There was a longer discussion in the official PyTorch discussion forums a year or two ago. The entire procedure (including for np
and random
) looks like:
SEED =42
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need seeds in the model IMO, just to make things a little bit easier / faster (we do need in data-splits though). I will just remove that line in my next commit(s).
modeling/train.py
Outdated
center_crop_size=(320, 384, 512)) | ||
|
||
train_dataset, val_dataset = split_dataset(dataset=dataset, val_size=0.3, seed=args.seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the entire idea of synthesizing with GANs must occur between these lines. Here's what I have in mind:
- MSSeg2Dataset returns input volume tensors (
x1, x2
) and GTs (y
) --> Pass these as input to the GAN - Get synthesized (new) MR volumes
- Append these additional volumes to the original
dataset
, call it, say,dataset_appended
- Pass
dataset_appended
into thesplit_dataset
function and let the training proceed as usual
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this sounds great 🥳! Only minor clarification / check: we only want to append to the train_dataset
right? After split_dataset
? My thinking is: we don't want to validate on synthesized images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, oui! That makes sense. It would be better to validate on the original data.
With the new commits, we are able to successfully run training for a test model on a single GPU with the following command:
Next step is to work on multi-GPU training (both model- and data-parallel). I am unable to utilize this feature currently. |
Turns out model-parallel and data-parallel training works completely fine on
However, the same commands result in a freeze in |
We can see this on [IO_PAGE_FAULT]
Potentially the same:
Luckily, it seems like you're not alone: pytorch/pytorch#1637 (comment) |
Just adding this here in case it is helpful in a future deeper debug session: it always completes forward-pass and back-prop in a single batch, and then freezes at the start of the second batch. |
I've applied the
Can you please try training again @uzaymacar ? After this, I would like to try |
self.upconv1 = nn.ConvTranspose3d(in_channels=16, out_channels=8, kernel_size=(3, 3, 3)) | ||
self.upconv2 = nn.ConvTranspose3d(in_channels=8, out_channels=1, kernel_size=(3, 3, 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hold on! There's a famous issue with using ConvTranspose3d
in that it results in checkerboard artefacts in the upsampling stage. The better alternative would be to use the "resize-conv" method which uses nn.Upsample
and nn.Conv
consecutively. More on this can be found here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware of this, thanks for pointing out! Here is another discussion of transpose vs. upsampling but for the 2D version. Ideally, I would have expected the transpose to work better or equivalent, interesting. It might also be an application dependent issue. The good thing is this is just a test model (not even a baseline model), to see if the training loop works as expected. So all good, no need to change this 🙃.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, it doesn't matter if it's just a test model. But, from what I know the checkerboard issue is quite problematic especially with 3D data so I think it's an important point to consider. Maybe we could try with and without Upsample-Conv
to see how it really is in our case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uuu I see, if it's 3D-specific then I wouldn't have any idea 😄. On another note, I am implementing my multi-channel TransUNet by extending from here which does use the up-sample + conv combination as you suggest!
@kousu It worked 🥳. Lessons: The Internet and Nick never fails 🙂! |
Oh no, sometimes I definitely fail. I did this last night: https://github.com/neuropoly/computers/issues/124 But thank you! Here I'm trying with the other suggestion,
Can you please try a second time @uzaymacar ? |
@kousu This time it didn't work. We entertained an additional hypothesis: what if only |
I've done a writeup about the parallelization problems at https://github.com/neuropoly/computers/pull/87#issuecomment-860779816. They are fixed now by going with |
This PR introduces the groundwork for custom modeling for the challenge. Custom refers to the fact that this is not
ivadomed
-dependent, at least not directly (we still can / should make use of utilities like transformations etc.). The reason for going this route is to experiment with custom models like TransUNet (#37) and GANs (@naga-karthik) currently not supported inivadomed
, and also be able to change / customize the data-loading process quickly (e.g. add all experts' GT in the batch, change multi-channel logic, etc.).For now, we can focus the review on
modeling/datasets.py
andmodeling/utils.py
which handle data-loading. The remainingmodeling/train.py
handles multi-GPU parallel training, but haven't been tested yet.Short-Term Goals:
RandomAffine
intoMSSeg2Dataset.__getitem()__
. ✅modeling/train.py
works as expected. ✅ivadomed
codebase better (e.g. can we directly use theMRI3DSubVolumeSegmentationDataset
here?).Long-Term Goals:
ivadomed
.ivadomed
.