Custom Modeling Groundwork #42

uzaymacar · 2021-06-13T21:00:20Z

This PR introduces the groundwork for custom modeling for the challenge. Custom refers to the fact that this is not ivadomed-dependent, at least not directly (we still can / should make use of utilities like transformations etc.). The reason for going this route is to experiment with custom models like TransUNet (#37) and GANs (@naga-karthik) currently not supported in ivadomed, and also be able to change / customize the data-loading process quickly (e.g. add all experts' GT in the batch, change multi-channel logic, etc.).

For now, we can focus the review on modeling/datasets.py and modeling/utils.py which handle data-loading. The remaining modeling/train.py handles multi-GPU parallel training, but haven't been tested yet.

Short-Term Goals:

Add (& understand) RandomAffine into MSSeg2Dataset.__getitem()__. ✅
Explore elastic-transform data augmentation.
Add independent data augmentations between two sessions as suggested in the latest meeting.
Verify that modeling/train.py works as expected. ✅
Think how can we utilize ivadomed codebase better (e.g. can we directly use the MRI3DSubVolumeSegmentationDataset here?).
Continue working on TransUNet in a new, future branch / PR.

Long-Term Goals:

Incorporate features from here (e.g. multi-GPU training, any new augmentations, etc.) into ivadomed.
Incorporate successful (if applicable) models into ivadomed.

… training.

naga-karthik · 2021-06-14T03:36:56Z

modeling/train.py

+            optimizer.zero_grad()
+
+            x1, x2, y = batch
+            x1, x2, y = x1.to(device), x2, y.to(device)


Why isn't x2 being transferred (i.e. x2.to(device))?

Ah great catch 🙂 , thanks for spotting this!

naga-karthik · 2021-06-14T03:52:31Z

modeling/train.py

+    # NOTE: Settings seeds requires cuda.deterministic = True, which slows things down considerably
+    # set_seed(seed=args.seed)


Could you provide the source for where it says cuda.deterministic=True when setting a seed? I was under the impression that just torch.manual_seed(0) should do the job.

Here is the most relevant thing I can with a quick search (scroll down for the answer). Pretty much this is an interesting question with torch and what people suggest is that you seed every aspect of torch. There was a longer discussion in the official PyTorch discussion forums a year or two ago. The entire procedure (including for np and random) looks like:

SEED =42 torch.manual_seed(SEED) torch.cuda.manual_seed(SEED) torch.cuda.manual_seed_all(SEED) np.random.seed(SEED) random.seed(SEED) torch.manual_seed(SEED) torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = True

We don't need seeds in the model IMO, just to make things a little bit easier / faster (we do need in data-splits though). I will just remove that line in my next commit(s).

naga-karthik · 2021-06-14T04:10:51Z

modeling/train.py

+                            center_crop_size=(320, 384, 512))
+
+    train_dataset, val_dataset = split_dataset(dataset=dataset, val_size=0.3, seed=args.seed)


I think the entire idea of synthesizing with GANs must occur between these lines. Here's what I have in mind:

MSSeg2Dataset returns input volume tensors (x1, x2) and GTs (y) --> Pass these as input to the GAN

Get synthesized (new) MR volumes

Append these additional volumes to the original dataset, call it, say, dataset_appended

Pass dataset_appended into the split_dataset function and let the training proceed as usual

Thoughts?

Yes, this sounds great 🥳! Only minor clarification / check: we only want to append to the train_dataset right? After split_dataset? My thinking is: we don't want to validate on synthesized images.

Ah, oui! That makes sense. It would be better to validate on the original data.

uzaymacar · 2021-06-14T10:56:36Z

With the new commits, we are able to successfully run training for a test model on a single GPU with the following command:

export CUDA_VISIBLE_DEVICES=<GPU_ID>
python train.py -id testmodel -dr ~/duke/projects/ivadomed/tmp_ms_challenge_2021_preprocessed/ -fd 0.05 -loc -1

Next step is to work on multi-GPU training (both model- and data-parallel). I am unable to utilize this feature currently.

uzaymacar · 2021-06-14T14:00:52Z

Turns out model-parallel and data-parallel training works completely fine on rosenberg, respectively with the following commands:

export CUDA_VISIBLE_DEVICES=<[GPU_IDs]>
python train.py -id testmodel -dr ~/duke/projects/ivadomed/tmp_ms_challenge_2021_preprocessed/ -loc 0 -fd 0.05 -bs 8 -nw 0

export CUDA_VISIBLE_DEVICES=<[GPU_IDs]>
python train.py -id testmodel -dr ~/duke/projects/ivadomed/tmp_ms_challenge_2021_preprocessed/ -loc -1 -fd 0.05 -bs 32

However, the same commands result in a freeze in romane. This means romane needs some more work regarding the GPU / CUDA setup. We need to enable these features in romane ASAP as rosenberg is busy and we need romane's 50GB GPUs for the TransUNet model.

kousu · 2021-06-14T14:27:26Z

Turns out model-parallel and data-parallel training works completely fine on rosenberg, respectively with the following commands:
export CUDA_VISIBLE_DEVICES=<[GPU_IDs]>
python train.py -id testmodel -dr ~/duke/projects/ivadomed/tmp_ms_challenge_2021_preprocessed/ 
                -loc 0 -fd 0.05 -bs 8 -nw 0
export CUDA_VISIBLE_DEVICES=<[GPU_IDs]>
python train.py -id testmodel -dr ~/duke/projects/ivadomed/tmp_ms_challenge_2021_preprocessed/ 
                -loc -1 -fd 0.05 -bs 32
However, the same commands result in a freeze in romane. This means romane needs some more work regarding the GPU / CUDA setup. We need to enable these features in romane ASAP as rosenberg is busy and we need romane's 50GB GPUs for the TransUNet model.

We can see this on ssh root@romane dmesg | grep nvidia:

[IO_PAGE_FAULT]

[916290.190294] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916290.190907] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916290.191419] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916290.191907] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916290.428077] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916351.674646] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916351.675217] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916351.675707] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916351.676169] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916351.777972] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916414.338377] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916414.338954] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916414.339413] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916414.339894] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916414.444979] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916511.150622] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916511.151155] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916511.151598] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916511.152036] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916511.414317] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916600.700131] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916600.700854] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916600.701403] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916600.701922] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916600.876550] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916836.982157] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916836.983054] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916836.983907] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916836.984759] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916837.287868] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916906.216136] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916906.216989] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916906.217766] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916906.218563] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916906.306424] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916948.617754] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916948.618234] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916948.618585] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916948.618915] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916948.748545] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[916991.220900] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916991.221658] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916991.222338] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[916991.223006] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[916991.363535] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[917249.468854] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[917249.469549] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[917249.470206] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[917249.470213] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[917249.588774] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[917304.369804] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[917304.370453] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[917304.371057] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[917304.371652] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[917304.463293] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[917949.560687] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[917949.561063] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[917949.561356] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[917949.561634] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[917949.579373] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139070 flags=0x0020]
[919452.984436] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000000 flags=0x0030]
[919452.985028] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000600 flags=0x0030]
[919452.985419] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030000b80 flags=0x0030]
[919452.985692] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030001000 flags=0x0030]
[919452.985959] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xf0139068 flags=0x0020]
[919452.986217] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030001580 flags=0x0030]
[919452.986464] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030001b00 flags=0x0030]
[919452.986708] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030002000 flags=0x0030]
[919452.986947] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030002580 flags=0x0030]
[919452.987182] nvidia 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x20030002b00 flags=0x0030]
[927663.886635] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0xb0139068 flags=0x0020]
[927663.887232] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060007000 flags=0x0020]
[927663.887778] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060002880 flags=0x0020]
[927663.888314] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x2806000e480 flags=0x0020]
[927663.888838] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060003480 flags=0x0020]
[927663.889349] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060006c00 flags=0x0020]
[927663.889847] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060004400 flags=0x0020]
[927663.890348] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060001400 flags=0x0020]
[927663.890835] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060008080 flags=0x0020]
[927663.891326] nvidia 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0046 address=0x28060005880 flags=0x0020]
[927670.820450] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xde139068 flags=0x0020]
[927670.820986] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0000000 flags=0x0020]
[927670.821467] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0000f80 flags=0x0020]
[927670.821933] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0001000 flags=0x0020]
[927670.822391] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0001f00 flags=0x0020]
[927670.822840] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0002000 flags=0x0020]
[927670.823299] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0002e80 flags=0x0020]
[927670.823740] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0003000 flags=0x0020]
[927670.824167] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0003e00 flags=0x0020]
[927670.824599] nvidia 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0030 address=0xc0004000 flags=0x0020]

Potentially the same:

2012: https://bugzilla.kernel.org/show_bug.cgi?id=42782 - triggered by trying to start Xorg
2016: https://forums.developer.nvidia.com/t/amd-vi-event-logged-io-page-fault/41011
2017: https://forums.centos.org/viewtopic.php?t=63971 - not about CUDA at all, but shares some of the same error messages and provides a solution: add iommu=amd iommu=soft to the kernel command line (in /etc/default/grub)
2018: dataparallel not working on nvidia gpus and amd cpus pytorch/pytorch#13045 - triggered by DataParallel; solution (dataparallel not working on nvidia gpus and amd cpus pytorch/pytorch#13045 (comment)) suggests iommu=soft, which overlaps with the previous (even though the previous wasn't specifically about nvidia or deep learning)
- more specific instructions at Multi-GPU K80s pytorch/pytorch#1637 (comment); this also suggests debugging with p2pBandwithLatencyTest from CUDA's sample package

Luckily, it seems like you're not alone: pytorch/pytorch#1637 (comment)

uzaymacar · 2021-06-14T14:42:09Z

Just adding this here in case it is helpful in a future deeper debug session: it always completes forward-pass and back-prop in a single batch, and then freezes at the start of the second batch.

kousu · 2021-06-14T14:45:18Z

I've applied the iommu=soft workaround:

root@romane:~# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-5.4.0-74-generic root=UUID=edfd4dc4-eedd-40b3-ad86-b790b63c27bc ro
root@romane:~# vi /etc/default/grub  # edit it in
root@romane:~# cat /etc/default/grub # see the result
# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=0
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="iommu=soft"

# Uncomment to enable BadRAM filtering, modify to suit your needs
# This works with Linux (no patch required) and with any kernel that obtains
# the memory map information from GRUB (GNU Mach, kernel of FreeBSD ...)
#GRUB_BADRAM="0x01234567,0xfefefefe,0x89abcdef,0xefefefef"

# Uncomment to disable graphical terminal (grub-pc only)
#GRUB_TERMINAL=console

# The resolution used on graphical terminal
# note that you can use only modes which your graphic card supports via VBE
# you can see them in real GRUB with the command `vbeinfo'
#GRUB_GFXMODE=640x480

# Uncomment if you don't want GRUB to pass "root=UUID=xxx" parameter to Linux
#GRUB_DISABLE_LINUX_UUID=true

# Uncomment to disable generation of recovery mode menu entries
#GRUB_DISABLE_RECOVERY="true"

# Uncomment to get a beep at grub start
#GRUB_INIT_TUNE="480 440 1"
root@romane:~# update-grub
Sourcing file `/etc/default/grub'
Sourcing file `/etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.4.0-74-generic
Found initrd image: /boot/initrd.img-5.4.0-74-generic
Found linux image: /boot/vmlinuz-5.4.0-73-generic
Found initrd image: /boot/initrd.img-5.4.0-73-generic
Found Ubuntu 18.04.5 LTS (18.04) on /dev/sda1
done
root@romane:~# reboot
Connection to romane.neuro.polymtl.ca closed by remote host.
Connection to romane.neuro.polymtl.ca closed.
$ ssh root@romane
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly': 
Welcome to Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-74-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

1 update can be applied immediately.
To see these additional updates run: apt list --upgradable


Last login: Mon Jun 14 10:17:01 2021 from 10.10.0.214
root@romane:~# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-5.4.0-74-generic root=UUID=edfd4dc4-eedd-40b3-ad86-b790b63c27bc ro iommu=soft
root@romane:~#

Can you please try training again @uzaymacar ?

After this, I would like to try iommu=amd since romane is an AMD machine, and then iommu=amd iommu=soft (if that's possible). Whichever one seems best I will make permanent in https://github.com/neuropoly/computers/pull/87

modeling/datasets.py

naga-karthik · 2021-06-14T14:52:44Z

modeling/models.py

+        self.upconv1 = nn.ConvTranspose3d(in_channels=16, out_channels=8, kernel_size=(3, 3, 3))
+        self.upconv2 = nn.ConvTranspose3d(in_channels=8, out_channels=1, kernel_size=(3, 3, 3))


Hold on! There's a famous issue with using ConvTranspose3d in that it results in checkerboard artefacts in the upsampling stage. The better alternative would be to use the "resize-conv" method which uses nn.Upsample and nn.Conv consecutively. More on this can be found here.

I wasn't aware of this, thanks for pointing out! Here is another discussion of transpose vs. upsampling but for the 2D version. Ideally, I would have expected the transpose to work better or equivalent, interesting. It might also be an application dependent issue. The good thing is this is just a test model (not even a baseline model), to see if the training loop works as expected. So all good, no need to change this 🙃.

Ah okay, it doesn't matter if it's just a test model. But, from what I know the checkerboard issue is quite problematic especially with 3D data so I think it's an important point to consider. Maybe we could try with and without Upsample-Conv to see how it really is in our case.

Uuu I see, if it's 3D-specific then I wouldn't have any idea 😄. On another note, I am implementing my multi-channel TransUNet by extending from here which does use the up-sample + conv combination as you suggest!

uzaymacar · 2021-06-14T14:53:33Z

@kousu It worked 🥳. Lessons: The Internet and Nick never fails 🙂!

kousu · 2021-06-14T15:00:01Z

@kousu It worked partying_face. Lessons: The Internet and Nick never fails slightly_smiling_face!

Oh no, sometimes I definitely fail. I did this last night: https://github.com/neuropoly/computers/issues/124

But thank you!

Here I'm trying with the other suggestion, iommu=amd:

root@romane:~# vi /etc/default/grub
root@romane:~# cat /etc/default/grub | grep CMDLINE
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="iommu=amd"
root@romane:~# update-grub
Sourcing file `/etc/default/grub'
Sourcing file `/etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.4.0-74-generic
Found initrd image: /boot/initrd.img-5.4.0-74-generic
Found linux image: /boot/vmlinuz-5.4.0-73-generic
Found initrd image: /boot/initrd.img-5.4.0-73-generic
Found Ubuntu 18.04.5 LTS (18.04) on /dev/sda1
done
root@romane:~# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-5.4.0-74-generic root=UUID=edfd4dc4-eedd-40b3-ad86-b790b63c27bc ro iommu=soft
root@romane:~# reboot
Connection to romane.neuro.polymtl.ca closed by remote host.
Connection to romane.neuro.polymtl.ca closed.
$ ssh root@romane
Welcome to Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-74-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

1 update can be applied immediately.
To see these additional updates run: apt list --upgradable


Last login: Mon Jun 14 10:44:05 2021 from 10.10.0.214
root@romane:~# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-5.4.0-74-generic root=UUID=edfd4dc4-eedd-40b3-ad86-b790b63c27bc ro iommu=amd

Can you please try a second time @uzaymacar ?

uzaymacar · 2021-06-14T15:09:41Z

@kousu This time it didn't work. We entertained an additional hypothesis: what if only stdout froze but the process continued? It made sense because the memory in GPUs were locked about and the processes were busy doing something. However, after a quick test, I don't think this is the case.

…tch augmentation.

kousu · 2021-06-14T17:18:11Z

I've done a writeup about the parallelization problems at https://github.com/neuropoly/computers/pull/87#issuecomment-860779816. They are fixed now by going with iommu=soft and you should be able to merge your DataParallel code just fine when you're ready with it.

Added code for carrying out custom model (e.g. TransUNet, GANs, etc.)…

74b306f

… training.

uzaymacar requested a review from naga-karthik June 13, 2021 21:10

naga-karthik reviewed Jun 14, 2021

View reviewed changes

uzaymacar added 2 commits June 14, 2021 13:52

Added simple test model.

262f927

Fixed data-loading bugs and added RandomAffine aug.

678b5d5

uzaymacar added 3 commits June 14, 2021 14:22

Added Dice loss and --fraction_data arg for debugging.

b8b51b3

Fixed types for arguments.

8047df0

Fixed typos and added squeezing at correct dim. for output of the model.

8b2d2e7

naga-karthik reviewed Jun 14, 2021

View reviewed changes

modeling/datasets.py Outdated Show resolved Hide resolved

naga-karthik reviewed Jun 14, 2021

View reviewed changes

uzaymacar added 2 commits June 14, 2021 20:09

Reserved augmentations for train-set, and changed val-time uniform pa…

ff65057

…tch augmentation.

Removed unnecessary NumPy import.

d077dab

uzaymacar merged commit a479560 into main Jun 14, 2021

uzaymacar deleted the um/custom_modeling branch June 14, 2021 22:29

uzaymacar mentioned this pull request Jun 17, 2021

3D Training Slow -> New Priority: Speeding-up ivadomed #45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom Modeling Groundwork #42

Custom Modeling Groundwork #42

uzaymacar commented Jun 13, 2021 •

edited

Loading

naga-karthik Jun 14, 2021

uzaymacar Jun 14, 2021

naga-karthik Jun 14, 2021

uzaymacar Jun 14, 2021

uzaymacar Jun 14, 2021 •

edited

Loading

naga-karthik Jun 14, 2021 •

edited

Loading

uzaymacar Jun 14, 2021

naga-karthik Jun 14, 2021

uzaymacar commented Jun 14, 2021 •

edited

Loading

uzaymacar commented Jun 14, 2021 •

edited

Loading

kousu commented Jun 14, 2021 •

edited

Loading

uzaymacar commented Jun 14, 2021

kousu commented Jun 14, 2021 •

edited

Loading

naga-karthik Jun 14, 2021 •

edited

Loading

uzaymacar Jun 14, 2021 •

edited

Loading

naga-karthik Jun 14, 2021

uzaymacar Jun 14, 2021

uzaymacar commented Jun 14, 2021

kousu commented Jun 14, 2021

uzaymacar commented Jun 14, 2021 •

edited

Loading

kousu commented Jun 14, 2021 •

edited

Loading

		# NOTE: Settings seeds requires cuda.deterministic = True, which slows things down considerably
		# set_seed(seed=args.seed)

		center_crop_size=(320, 384, 512))

		train_dataset, val_dataset = split_dataset(dataset=dataset, val_size=0.3, seed=args.seed)

		self.upconv1 = nn.ConvTranspose3d(in_channels=16, out_channels=8, kernel_size=(3, 3, 3))
		self.upconv2 = nn.ConvTranspose3d(in_channels=8, out_channels=1, kernel_size=(3, 3, 3))

Custom Modeling Groundwork #42

Custom Modeling Groundwork #42

Conversation

uzaymacar commented Jun 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uzaymacar Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

naga-karthik Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uzaymacar commented Jun 14, 2021 • edited Loading

uzaymacar commented Jun 14, 2021 • edited Loading

kousu commented Jun 14, 2021 • edited Loading

uzaymacar commented Jun 14, 2021

kousu commented Jun 14, 2021 • edited Loading

naga-karthik Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

uzaymacar Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uzaymacar commented Jun 14, 2021

kousu commented Jun 14, 2021

uzaymacar commented Jun 14, 2021 • edited Loading

kousu commented Jun 14, 2021 • edited Loading

uzaymacar commented Jun 13, 2021 •

edited

Loading

uzaymacar Jun 14, 2021 •

edited

Loading

naga-karthik Jun 14, 2021 •

edited

Loading

uzaymacar commented Jun 14, 2021 •

edited

Loading

uzaymacar commented Jun 14, 2021 •

edited

Loading

kousu commented Jun 14, 2021 •

edited

Loading

kousu commented Jun 14, 2021 •

edited

Loading

naga-karthik Jun 14, 2021 •

edited

Loading

uzaymacar Jun 14, 2021 •

edited

Loading

uzaymacar commented Jun 14, 2021 •

edited

Loading

kousu commented Jun 14, 2021 •

edited

Loading