Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Modeling Groundwork #42

Merged
merged 8 commits into from
Jun 14, 2021
Merged

Custom Modeling Groundwork #42

merged 8 commits into from
Jun 14, 2021

Conversation

uzaymacar
Copy link
Contributor

@uzaymacar uzaymacar commented Jun 13, 2021

This PR introduces the groundwork for custom modeling for the challenge. Custom refers to the fact that this is not ivadomed-dependent, at least not directly (we still can / should make use of utilities like transformations etc.). The reason for going this route is to experiment with custom models like TransUNet (#37) and GANs (@naga-karthik) currently not supported in ivadomed, and also be able to change / customize the data-loading process quickly (e.g. add all experts' GT in the batch, change multi-channel logic, etc.).

For now, we can focus the review on modeling/datasets.py and modeling/utils.py which handle data-loading. The remaining modeling/train.py handles multi-GPU parallel training, but haven't been tested yet.

Short-Term Goals:

  • Add (& understand) RandomAffine into MSSeg2Dataset.__getitem()__. ✅
  • Explore elastic-transform data augmentation.
  • Add independent data augmentations between two sessions as suggested in the latest meeting.
  • Verify that modeling/train.py works as expected. ✅
  • Think how can we utilize ivadomed codebase better (e.g. can we directly use the MRI3DSubVolumeSegmentationDataset here?).
  • Continue working on TransUNet in a new, future branch / PR.

Long-Term Goals:

  • Incorporate features from here (e.g. multi-GPU training, any new augmentations, etc.) into ivadomed.
  • Incorporate successful (if applicable) models into ivadomed.

optimizer.zero_grad()

x1, x2, y = batch
x1, x2, y = x1.to(device), x2, y.to(device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't x2 being transferred (i.e. x2.to(device))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah great catch 🙂 , thanks for spotting this!

Comment on lines 251 to 252
# NOTE: Settings seeds requires cuda.deterministic = True, which slows things down considerably
# set_seed(seed=args.seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide the source for where it says cuda.deterministic=True when setting a seed? I was under the impression that just torch.manual_seed(0) should do the job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the most relevant thing I can with a quick search (scroll down for the answer). Pretty much this is an interesting question with torch and what people suggest is that you seed every aspect of torch. There was a longer discussion in the official PyTorch discussion forums a year or two ago. The entire procedure (including for np and random) looks like:

SEED =42
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

Copy link
Contributor Author

@uzaymacar uzaymacar Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need seeds in the model IMO, just to make things a little bit easier / faster (we do need in data-splits though). I will just remove that line in my next commit(s).

Comment on lines 118 to 120
center_crop_size=(320, 384, 512))

train_dataset, val_dataset = split_dataset(dataset=dataset, val_size=0.3, seed=args.seed)
Copy link
Member

@naga-karthik naga-karthik Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the entire idea of synthesizing with GANs must occur between these lines. Here's what I have in mind:

  1. MSSeg2Dataset returns input volume tensors (x1, x2) and GTs (y) --> Pass these as input to the GAN
  2. Get synthesized (new) MR volumes
  3. Append these additional volumes to the original dataset, call it, say, dataset_appended
  4. Pass dataset_appended into the split_dataset function and let the training proceed as usual

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this sounds great 🥳! Only minor clarification / check: we only want to append to the train_dataset right? After split_dataset? My thinking is: we don't want to validate on synthesized images.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, oui! That makes sense. It would be better to validate on the original data.

@uzaymacar
Copy link
Contributor Author

uzaymacar commented Jun 14, 2021

With the new commits, we are able to successfully run training for a test model on a single GPU with the following command:

export CUDA_VISIBLE_DEVICES=<GPU_ID>
python train.py -id testmodel -dr ~/duke/projects/ivadomed/tmp_ms_challenge_2021_preprocessed/ -fd 0.05 -loc -1

Next step is to work on multi-GPU training (both model- and data-parallel). I am unable to utilize this feature currently.

@uzaymacar
Copy link
Contributor Author

uzaymacar commented Jun 14, 2021

Turns out model-parallel and data-parallel training works completely fine on rosenberg, respectively with the following commands:

export CUDA_VISIBLE_DEVICES=<[GPU_IDs]>
python train.py -id testmodel -dr ~/duke/projects/ivadomed/tmp_ms_challenge_2021_preprocessed/ -loc 0 -fd 0.05 -bs 8 -nw 0
export CUDA_VISIBLE_DEVICES=<[GPU_IDs]>
python train.py -id testmodel -dr ~/duke/projects/ivadomed/tmp_ms_challenge_2021_preprocessed/ -loc -1 -fd 0.05 -bs 32

However, the same commands result in a freeze in romane. This means romane needs some more work regarding the GPU / CUDA setup. We need to enable these features in romane ASAP as rosenberg is busy and we need romane's 50GB GPUs for the TransUNet model.

@kousu
Copy link

kousu commented Jun 14, 2021

Turns out model-parallel and data-parallel training works completely fine on rosenberg, respectively with the following commands:

export CUDA_VISIBLE_DEVICES=<[GPU_IDs]>
python train.py -id testmodel -dr ~/duke/projects/ivadomed/tmp_ms_challenge_2021_preprocessed/ 
                -loc 0 -fd 0.05 -bs 8 -nw 0
export CUDA_VISIBLE_DEVICES=<[GPU_IDs]>
python train.py -id testmodel -dr ~/duke/projects/ivadomed/tmp_ms_challenge_2021_preprocessed/ 
                -loc -1 -fd 0.05 -bs 32

However, the same commands result in a freeze in romane. This means romane needs some more work regarding the GPU / CUDA setup. We need to enable these features in romane ASAP as rosenberg is busy and we need romane's 50GB GPUs for the TransUNet model.

We can see this on ssh root@romane dmesg | grep nvidia:

[IO_PAGE_FAULT]
[916290.190294] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916290.190907] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916290.191419] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916290.191907] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916290.428077] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916351.674646] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916351.675217] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916351.675707] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916351.676169] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916351.777972] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916414.338377] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916414.338954] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916414.339413] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916414.339894] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916414.444979] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916511.150622] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916511.151155] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916511.151598] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916511.152036] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916511.414317] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916600.700131] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916600.700854] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916600.701403] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916600.701922] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916600.876550] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916836.982157] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916836.983054] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916836.983907] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916836.984759] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916837.287868] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916906.216136] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916906.216989] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916906.217766] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916906.218563] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916906.306424] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916948.617754] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916948.618234] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916948.618585] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916948.618915] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916948.748545] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916991.220900] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916991.221658] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916991.222338] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916991.223006] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916991.363535] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[917249.468854] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[917249.469549] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[917249.470206] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[917249.470213] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[917249.588774] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[917304.369804] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[917304.370453] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[917304.371057] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[917304.371652] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[917304.463293] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[917949.560687] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[917949.561063] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[917949.561356] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[917949.561634] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[917949.579373] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[919452.984436] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[919452.985028] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000600 flags=0x0030]
[919452.985419] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000b80 flags=0x0030]
[919452.985692] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030001000 flags=0x0030]
[919452.985959] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[919452.986217] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030001580 flags=0x0030]
[919452.986464] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030001b00 flags=0x0030]
[919452.986708] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030002000 flags=0x0030]
[919452.986947] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030002580 flags=0x0030]
[919452.987182] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030002b00 flags=0x0030]
[927663.886635] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0xb0139068 flags=0x0020]
[927663.887232] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060007000 flags=0x0020]
[927663.887778] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060002880 flags=0x0020]
[927663.888314] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x2806000e480 flags=0x0020]
[927663.888838] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060003480 flags=0x0020]
[927663.889349] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060006c00 flags=0x0020]
[927663.889847] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060004400 flags=0x0020]
[927663.890348] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060001400 flags=0x0020]
[927663.890835] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060008080 flags=0x0020]
[927663.891326] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060005880 flags=0x0020]
[927670.820450] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xde139068 flags=0x0020]
[927670.820986] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0000000 flags=0x0020]
[927670.821467] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0000f80 flags=0x0020]
[927670.821933] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0001000 flags=0x0020]
[927670.822391] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0001f00 flags=0x0020]
[927670.822840] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0002000 flags=0x0020]
[927670.823299] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0002e80 flags=0x0020]
[927670.823740] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0003000 flags=0x0020]
[927670.824167] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0003e00 flags=0x0020]
[927670.824599] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0004000 flags=0x0020]

Potentially the same:

Luckily, it seems like you're not alone: pytorch/pytorch#1637 (comment)

2021-06-14-103124_803x799_scrot

@uzaymacar
Copy link
Contributor Author

Just adding this here in case it is helpful in a future deeper debug session: it always completes forward-pass and back-prop in a single batch, and then freezes at the start of the second batch.

@kousu
Copy link

kousu commented Jun 14, 2021

I've applied the iommu=soft workaround:

root@romane:~# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-5.4.0-74-generic root=UUID=edfd4dc4-eedd-40b3-ad86-b790b63c27bc ro
root@romane:~# vi /etc/default/grub  # edit it in
root@romane:~# cat /etc/default/grub # see the result
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=0
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="iommu=soft"

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"
root@romane:~# update-grub
Sourcing file `/etc/default/grub'
Sourcing file `/etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.4.0-74-generic
Found initrd image: /boot/initrd.img-5.4.0-74-generic
Found linux image: /boot/vmlinuz-5.4.0-73-generic
Found initrd image: /boot/initrd.img-5.4.0-73-generic
Found Ubuntu 18.04.5 LTS (18.04) on /dev/sda1
done
root@romane:~# reboot
Connection to romane.neuro.polymtl.ca closed by remote host.
Connection to romane.neuro.polymtl.ca closed.
$ ssh root@romane
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly': 
Welcome to Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-74-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

1 update can be applied immediately.
To see these additional updates run: apt list --upgradable


Last login: Mon Jun 14 10:17:01 2021 from 10.10.0.214
root@romane:~# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-5.4.0-74-generic root=UUID=edfd4dc4-eedd-40b3-ad86-b790b63c27bc ro iommu=soft
root@romane:~# 

Can you please try training again @uzaymacar ?

After this, I would like to try iommu=amd since romane is an AMD machine, and then iommu=amd iommu=soft (if that's possible). Whichever one seems best I will make permanent in https://github.com/neuropoly/computers/pull/87

modeling/datasets.py Outdated Show resolved Hide resolved
Comment on lines +13 to +14
self.upconv1 = nn.ConvTranspose3d(in_channels=16, out_channels=8, kernel_size=(3, 3, 3))
self.upconv2 = nn.ConvTranspose3d(in_channels=8, out_channels=1, kernel_size=(3, 3, 3))
Copy link
Member

@naga-karthik naga-karthik Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on! There's a famous issue with using ConvTranspose3d in that it results in checkerboard artefacts in the upsampling stage. The better alternative would be to use the "resize-conv" method which uses nn.Upsample and nn.Conv consecutively. More on this can be found here.

Copy link
Contributor Author

@uzaymacar uzaymacar Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware of this, thanks for pointing out! Here is another discussion of transpose vs. upsampling but for the 2D version. Ideally, I would have expected the transpose to work better or equivalent, interesting. It might also be an application dependent issue. The good thing is this is just a test model (not even a baseline model), to see if the training loop works as expected. So all good, no need to change this 🙃.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, it doesn't matter if it's just a test model. But, from what I know the checkerboard issue is quite problematic especially with 3D data so I think it's an important point to consider. Maybe we could try with and without Upsample-Conv to see how it really is in our case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uuu I see, if it's 3D-specific then I wouldn't have any idea 😄. On another note, I am implementing my multi-channel TransUNet by extending from here which does use the up-sample + conv combination as you suggest!

@uzaymacar
Copy link
Contributor Author

@kousu It worked 🥳. Lessons: The Internet and Nick never fails 🙂!

@kousu
Copy link

kousu commented Jun 14, 2021

@kousu It worked partying_face. Lessons: The Internet and Nick never fails slightly_smiling_face!

Oh no, sometimes I definitely fail. I did this last night: https://github.com/neuropoly/computers/issues/124

But thank you!

Here I'm trying with the other suggestion, iommu=amd:

root@romane:~# vi /etc/default/grub
root@romane:~# cat /etc/default/grub | grep CMDLINE
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="iommu=amd"
root@romane:~# update-grub
Sourcing file `/etc/default/grub'
Sourcing file `/etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.4.0-74-generic
Found initrd image: /boot/initrd.img-5.4.0-74-generic
Found linux image: /boot/vmlinuz-5.4.0-73-generic
Found initrd image: /boot/initrd.img-5.4.0-73-generic
Found Ubuntu 18.04.5 LTS (18.04) on /dev/sda1
done
root@romane:~# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-5.4.0-74-generic root=UUID=edfd4dc4-eedd-40b3-ad86-b790b63c27bc ro iommu=soft
root@romane:~# reboot
Connection to romane.neuro.polymtl.ca closed by remote host.
Connection to romane.neuro.polymtl.ca closed.
$ ssh root@romane
Welcome to Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-74-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

1 update can be applied immediately.
To see these additional updates run: apt list --upgradable


Last login: Mon Jun 14 10:44:05 2021 from 10.10.0.214
root@romane:~# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-5.4.0-74-generic root=UUID=edfd4dc4-eedd-40b3-ad86-b790b63c27bc ro iommu=amd

Can you please try a second time @uzaymacar ?

@uzaymacar
Copy link
Contributor Author

uzaymacar commented Jun 14, 2021

@kousu This time it didn't work. We entertained an additional hypothesis: what if only stdout froze but the process continued? It made sense because the memory in GPUs were locked about and the processes were busy doing something. However, after a quick test, I don't think this is the case.

@kousu
Copy link

kousu commented Jun 14, 2021

I've done a writeup about the parallelization problems at https://github.com/neuropoly/computers/pull/87#issuecomment-860779816. They are fixed now by going with iommu=soft and you should be able to merge your DataParallel code just fine when you're ready with it.

@uzaymacar uzaymacar merged commit a479560 into main Jun 14, 2021
@uzaymacar uzaymacar deleted the um/custom_modeling branch June 14, 2021 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants