Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding progress bar #4225

Closed
Limtle opened this issue Oct 19, 2020 · 5 comments
Closed

Understanding progress bar #4225

Limtle opened this issue Oct 19, 2020 · 5 comments
Labels
3rd party Related to a 3rd-party question Further information is requested

Comments

@Limtle
Copy link

Limtle commented Oct 19, 2020

When training or Validating on 2 nodes (8 gpus per node), Lightning show 16 same progress bars with different loss, such like
Epoch 0: 9%|▊ | 3/35 [00:14<02:29, 4.68s/it, loss=10.122, v_num=193413].
Epoch 0: 9%|▊ | 3/35 [00:14<02:29, 4.68s/it, loss=9.858, v_num=193413]
Epoch 0: 9%|▊ | 3/35 [00:14<02:29, 4.68s/it, loss=10.225, v_num=193413]
...
It means the output have 16 progress bars if training on 16 GPUs. I suppose that samples used for training are different on each GPU therefore leads to different progress bar. Moreover, the similar situation also show up in validation. I am wondering if different samples are distributed to each GPU during validation.
I use Dataset for training and IterableDataset for validation.

Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/35 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/35 [00:00<?, ?it/s] 
Epoch 0:   3%|▎         | 1/35 [00:08<04:50,  8.54s/it]
Epoch 0:   3%|▎         | 1/35 [00:08<04:50,  8.54s/it, loss=10.457, v_num=193413]
Epoch 0:   6%|▌         | 2/35 [00:11<03:06,  5.66s/it, loss=10.457, v_num=193413]
Epoch 0:   6%|▌         | 2/35 [00:11<03:06,  5.66s/it, loss=10.122, v_num=193413]
Epoch 0:   9%|▊         | 3/35 [00:14<02:29,  4.68s/it, loss=10.122, v_num=193413]
Epoch 0:   9%|▊         | 3/35 [00:14<02:29,  4.68s/it, loss=9.858, v_num=193413] 
Epoch 0:  11%|█▏        | 4/35 [00:16<02:09,  4.19s/it, loss=9.858, v_num=193413]
Epoch 0:  11%|█▏        | 4/35 [00:16<02:09,  4.19s/it, loss=9.626, v_num=193413]
Epoch 0:  14%|█▍        | 5/35 [00:19<01:57,  3.90s/it, loss=9.626, v_num=193413]
Epoch 0:  14%|█▍        | 5/35 [00:19<01:57,  3.90s/it, loss=9.432, v_num=193413]
Epoch 0:  17%|█▋        | 6/35 [00:22<01:47,  3.71s/it, loss=9.43
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/35 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/35 [00:00<?, ?it/s] 
Epoch 0:   3%|▎         | 1/35 [00:08<04:50,  8.54s/it]
Epoch 0:   3%|▎         | 1/35 [00:08<04:50,  8.54s/it, loss=10.468, v_num=193413]
Epoch 0:   6%|▌         | 2/35 [00:11<03:06,  5.66s/it, loss=10.468, v_num=193413]
Epoch 0:   6%|▌         | 2/35 [00:11<03:06,  5.66s/it, loss=10.122, v_num=193413]
Epoch 0:   9%|▊         | 3/35 [00:14<02:29,  4.68s/it, loss=10.122, v_num=193413]
Epoch 0:   9%|▊         | 3/35 [00:14<02:29,  4.68s/it, loss=9.846, v_num=193413] 
Epoch 0:  11%|█▏        | 4/35 [00:16<02:09,  4.19s/it, loss=9.846, v_num=193413]
Epoch 0:  11%|█▏        | 4/35 [00:16<02:09,  4.19s/it, loss=9.638, v_num=193413]
Epoch 0:  14%|█▍        | 5/35 [00:19<01:57,  3.90s/it, loss=9.638, v_num=193413]
Epoch 0:  14%|█▍        | 5/35 [00:19<01:57,  3.90s/it, loss=9.465, v_num=193413]
Epoch 0:  17%|█▋        | 6/35 [00:22<01:47,  3.71s/it, loss=9.46
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/35 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/35 [00:00<?, ?it/s] 
Validating: 0it [00:00, ?it/s]�[A

Validating: 0it [00:00, ?it/s]�[A

Validating: 0it [00:00, ?it/s]�[A

Validating: 0it [00:00, ?it/s]�[A

Validating: 0it [00:00, ?it/s]�[A

Validating: 0it [00:00, ?it/s]�[A

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 1/30738 [00:01<16:11:25,  1.90s/it]�[A
Epoch 0: : 36it [02:15,  3.76s/it, loss=7.437, v_num=193413]                      

Validating:   0%|          | 1/30738 [00:01<16:08:01,  1.89s/it]�[A
Epoch 0: : 36it [02:15,  3.76s/it, loss=7.469, v_num=193413]                      

Validating:   0%|          | 1/30738 [00:01<16:08:42,  1.89s/it]�[A
Epoch 0: : 36it [02:15,  3.76s/it, loss=7.447, v_num=193413]                      

Validating:   0%|          | 1/30738 [00:01<16:08:19,  1.89s/it]�[A
Epoch 0: : 36it [02:15,  3.76s/it, loss=7.470, v_num=193413]                      

Validating:   0%|          | 1/30738 [00:01<16:12:26,  1.90s/it]�[A
Epoch 0: : 36it [02:15,  3.76s/it, loss=7.443, v_num=193413]                      

Validating:   0%|          | 1/30738 [00:01<16:15:56,  1.91s/it]�[A
Epoch 0: : 36it [02:15,  3.76s/it, loss=7.438, v_num=193413]                      

Validating:   0%|          | 1/30738 [00:01<16:14:04,  1.90s/it]�[A
Epoch 0: : 36it [02:15,  3.76s/it, loss=7.433, v_num=193413]                      

class Train_Dataset(Dataset):
    def __init__(self, filepath):
        self.examples=torch.load(filepath)
        self.len = len(self.examples)
    def __len__(self):
        return self.len
    def __getitem__(self, idx):
        return self.examples[idx]

class Val_Dataset(IterableDataset):
    def __init__(self, filename_list):
        self.filename_list = filename_list
        self.len = 0
        for file_path in tqdm(self.filename_list):
            self.len+=len(torch.load(file_path))

    def __len__(self):
        return self.len

    def __iter__(self):
        for file_path in self.filename_list:
            for x in torch.load(file_path):
                yield x
       
@Limtle Limtle added the question Further information is requested label Oct 19, 2020
@Borda Borda added the 3rd party Related to a 3rd-party label Oct 19, 2020
@Borda
Copy link
Member

Borda commented Oct 19, 2020

@Limtle could you please share a complete example to reproduce? Does it happen only on DDP? Are all 16 updated in parallel or just the last one?

@Limtle
Copy link
Author

Limtle commented Oct 20, 2020

Sorry . the code can't be shared. But the following is my log file.
I will share a simple example as soon as possible.

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
Multi-processing is handled by Slurm.
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using native 16bit precision.
Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 14, MEMBER: 15/16
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/16
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using native 16bit precision.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 10, MEMBER: 11/16
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 11, MEMBER: 12/16
initializing ddp: GLOBAL_RANK: 15, MEMBER: 16/16
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 7, MEMBER: 8/16
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/16
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using native 16bit precision.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/16
Multi-processing is handled by Slurm.
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 6, MEMBER: 7/16
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/16
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/16
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 13, MEMBER: 14/16
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 12, MEMBER: 13/16
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 9, MEMBER: 10/16
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/16
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Using native 16bit precision.
initializing ddp: GLOBAL_RANK: 8, MEMBER: 9/16
read file : 0
read file : 0
read file : 0
read file : 0
read file : 0
read file : 0
read file : 0
read file : 0
read file : 0
read file : 0
read file : 0
read file : 0
read file : 0
read file : 0
read file : 0
read file : 0
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
get train_dataloader
read file : 0
get train_dataloader
get train_dataloader
get train_dataloader
read file : 0
get train_dataloader
get train_dataloader
get train_dataloader
get train_dataloader
get train_dataloader
get train_dataloader
read file : 0
get train_dataloader
read file : 0
get train_dataloader
get train_dataloader
read file : 0
get train_dataloader
get train_dataloader
read file : 0
read file : 0
read file : 0
read file : 0
read file : 0
read file : 0
get train_dataloader
read file : 0
read file : 0
read file : 0
read file : 0
read file : 0

Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/18 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/18 [00:00<?, ?it/s] 
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it]
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it, loss=10.478, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.478, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.227, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.227, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.072, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=10.072, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=9.947, v_num=193315] 
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.947, v_num=193315]
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.850, v_num=193315]
Epoch 0:  33%|███▎      | 6/18 [00:23<00:47,  3
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/18 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/18 [00:00<?, ?it/s] 
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it]
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it, loss=10.473, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.473, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.224, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.224, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.047, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=10.047, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=9.926, v_num=193315] 
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.926, v_num=193315]
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.847, v_num=193315]
Epoch 0:  33%|███▎      | 6/18 [00:23<00:47,  3
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/18 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/18 [00:00<?, ?it/s] 
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it]
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it, loss=10.502, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.502, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.244, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.244, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.073, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=10.073, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=9.958, v_num=193315] 
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.958, v_num=193315]
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.854, v_num=193315]
Epoch 0:  33%|███▎      | 6/18 [00:23<00:47,  3
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/18 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/18 [00:00<?, ?it/s] 
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it]
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it, loss=10.478, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.478, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.221, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.221, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.049, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=10.049, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=9.914, v_num=193315] 
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.914, v_num=193315]
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.835, v_num=193315]
Epoch 0:  33%|███▎      | 6/18 [00:23<00:47,  3
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/18 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/18 [00:00<?, ?it/s] 
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it]
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it, loss=10.493, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.493, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.223, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.223, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.060, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=10.060, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=9.923, v_num=193315] 
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.923, v_num=193315]
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.833, v_num=193315]
Epoch 0:  33%|███▎      | 6/18 [00:23<00:47,  3
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/18 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/18 [00:00<?, ?it/s] 
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it]
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it, loss=10.496, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.496, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.218, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.218, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.035, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=10.035, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=9.917, v_num=193315] 
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.917, v_num=193315]
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.831, v_num=193315]
Epoch 0:  33%|███▎      | 6/18 [00:23<00:47,  3
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/18 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/18 [00:00<?, ?it/s] 
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it]
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it, loss=10.505, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.505, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.231, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.231, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.049, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=10.049, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=9.908, v_num=193315] 
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.908, v_num=193315]
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.822, v_num=193315]
Epoch 0:  33%|███▎      | 6/18 [00:23<00:47,  3
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/18 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/18 [00:00<?, ?it/s] 
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it]
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it, loss=10.500, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.500, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.201, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.201, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.035, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=10.035, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=9.914, v_num=193315] 
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.914, v_num=193315]
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.813, v_num=193315]
Epoch 0:  33%|███▎      | 6/18 [00:23<00:47,  3
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/18 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/18 [00:00<?, ?it/s] 
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it]
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it, loss=10.480, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.480, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.213, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.213, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.042, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=10.042, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=9.911, v_num=193315] 
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.911, v_num=193315]
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.828, v_num=193315]
Epoch 0:  33%|███▎      | 6/18 [00:23<00:47,  3
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/18 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/18 [00:00<?, ?it/s] 
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it]
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it, loss=10.498, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.498, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.233, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.233, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.059, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=10.059, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=9.941, v_num=193315] 
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.941, v_num=193315]
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.845, v_num=193315]
Epoch 0:  33%|███▎      | 6/18 [00:23<00:47,  3
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/18 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/18 [00:00<?, ?it/s] 
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it]
Epoch 0:   6%|▌         | 1/18 [00:08<02:29,  8.77s/it, loss=10.482, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.482, v_num=193315]
Epoch 0:  11%|█         | 2/18 [00:11<01:33,  5.82s/it, loss=10.222, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.222, v_num=193315]
Epoch 0:  17%|█▋        | 3/18 [00:14<01:13,  4.87s/it, loss=10.047, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=10.047, v_num=193315]
Epoch 0:  22%|██▏       | 4/18 [00:17<01:02,  4.47s/it, loss=9.923, v_num=193315] 
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.923, v_num=193315]
Epoch 0:  28%|██▊       | 5/18 [00:20<00:53,  4.14s/it, loss=9.823, v_num=193315]
Epoch 0:  33%|███▎      | 6/18 [00:23<00:47,  3
.....

@Limtle
Copy link
Author

Limtle commented Oct 21, 2020

@Borda I have another example from pytorch_lightning . use 1 node (8 gpus) also have same situation.

environment
OS:Ubuntu
install packages
pytorch-lightning 1.0.1
torch 1.6.0
torchvision 0.7.0

Code

To run this template just do:
python generative_adversarial_net.py

After a few epochs, launch TensorBoard to see the images being generated at every batch:

tensorboard --logdir default

import os
from argparse import ArgumentParser, Namespace
from collections import OrderedDict

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST

from pytorch_lightning.core import LightningModule
from pytorch_lightning.trainer import Trainer


class Generator(nn.Module):
    def __init__(self, latent_dim, img_shape):
        super().__init__()
        self.img_shape = img_shape

        def block(in_feat, out_feat, normalize=True):
            layers = [nn.Linear(in_feat, out_feat)]
            if normalize:
                layers.append(nn.BatchNorm1d(out_feat, 0.8))
            layers.append(nn.LeakyReLU(0.2, inplace=True))
            return layers

        self.model = nn.Sequential(
            *block(latent_dim, 128, normalize=False),
            *block(128, 256),
            *block(256, 512),
            *block(512, 1024),
            nn.Linear(1024, int(np.prod(img_shape))),
            nn.Tanh()
        )

    def forward(self, z):
        img = self.model(z)
        img = img.view(img.size(0), *self.img_shape)
        return img


class Discriminator(nn.Module):
    def __init__(self, img_shape):
        super().__init__()

        self.model = nn.Sequential(
            nn.Linear(int(np.prod(img_shape)), 512),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(512, 256),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(256, 1),
            nn.Sigmoid(),
        )

    def forward(self, img):
        img_flat = img.view(img.size(0), -1)
        validity = self.model(img_flat)

        return validity


class GAN(LightningModule):

    def __init__(self,
                 latent_dim: int = 100,
                 lr: float = 0.0002,
                 b1: float = 0.5,
                 b2: float = 0.999,
                 batch_size: int = 64, **kwargs):
        super().__init__()

        self.latent_dim = latent_dim
        self.lr = lr
        self.b1 = b1
        self.b2 = b2
        self.batch_size = batch_size

        # networks
        mnist_shape = (1, 28, 28)
        self.generator = Generator(latent_dim=self.latent_dim, img_shape=mnist_shape)
        self.discriminator = Discriminator(img_shape=mnist_shape)

        self.validation_z = torch.randn(8, self.latent_dim)

        self.example_input_array = torch.zeros(2, hparams.latent_dim)

    def forward(self, z):
        return self.generator(z)

    def adversarial_loss(self, y_hat, y):
        return F.binary_cross_entropy(y_hat, y)

    def training_step(self, batch, batch_idx, optimizer_idx):
        imgs, _ = batch

        # sample noise
        z = torch.randn(imgs.shape[0], self.latent_dim)
        z = z.type_as(imgs)

        # train generator
        if optimizer_idx == 0:

            # generate images
            self.generated_imgs = self(z)

            # log sampled images
            sample_imgs = self.generated_imgs[:6]
            grid = torchvision.utils.make_grid(sample_imgs)
            self.logger.experiment.add_image('generated_images', grid, 0)

            # ground truth result (ie: all fake)
            # put on GPU because we created this tensor inside training_loop
            valid = torch.ones(imgs.size(0), 1)
            valid = valid.type_as(imgs)

            # adversarial loss is binary cross-entropy
            g_loss = self.adversarial_loss(self.discriminator(self(z)), valid)
            tqdm_dict = {'g_loss': g_loss}


            return {'loss':g_loss}

        # train discriminator
        if optimizer_idx == 1:
            # Measure discriminator's ability to classify real from generated samples

            # how well can it label as real?
            valid = torch.ones(imgs.size(0), 1)
            valid = valid.type_as(imgs)

            real_loss = self.adversarial_loss(self.discriminator(imgs), valid)

            # how well can it label as fake?
            fake = torch.zeros(imgs.size(0), 1)
            fake = fake.type_as(imgs)

            fake_loss = self.adversarial_loss(
                self.discriminator(self(z).detach()), fake)

            # discriminator loss is the average of these
            d_loss = (real_loss + fake_loss) / 2
            tqdm_dict = {'d_loss': d_loss}

            return {'loss':d_loss}

    def configure_optimizers(self):
        lr = self.lr
        b1 = self.b1
        b2 = self.b2

        opt_g = torch.optim.Adam(self.generator.parameters(), lr=lr, betas=(b1, b2))
        opt_d = torch.optim.Adam(self.discriminator.parameters(), lr=lr, betas=(b1, b2))
        return [opt_g, opt_d], []

    def train_dataloader(self):
        transform = transforms.Compose([transforms.ToTensor(),
                                        transforms.Normalize([0.5], [0.5])])
        dataset = MNIST(os.getcwd(), train=True, download=True, transform=transform)
        return DataLoader(dataset, batch_size=self.batch_size)

    def on_epoch_end(self):
        pass


def main(args: Namespace) -> None:
    # ------------------------
    # 1 INIT LIGHTNING MODEL
    # ------------------------
    model = GAN(**vars(args))

    # ------------------------
    # 2 INIT TRAINER
    # ------------------------
    # If use distubuted training  PyTorch recommends to use DistributedDataParallel.
    # See: https://pytorch.org/docs/stable/nn.html#torch.nn.DataParallel
    trainer = Trainer(gpus=8, num_nodes = 1, distributed_backend= 'ddp', profiler=True , max_epochs=10)  

    # ------------------------
    # 3 START TRAINING
    # ------------------------
    trainer.fit(model)


if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument("--batch_size", type=int, default=64, help="size of the batches")
    parser.add_argument("--lr", type=float, default=0.0002, help="adam: learning rate")
    parser.add_argument("--b1", type=float, default=0.5,
                        help="adam: decay of first order momentum of gradient")
    parser.add_argument("--b2", type=float, default=0.999,
                        help="adam: decay of first order momentum of gradient")
    parser.add_argument("--latent_dim", type=int, default=100,
                        help="dimensionality of the latent space")

    hparams = parser.parse_args()

    main(hparams)

Slurm

#!/bin/bash
#SBATCH --gres=gpu:8
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --mem=0

conda activate test_env

srun python gan.py

Log

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
initializing ddp: GLOBAL_RANK: 6, MEMBER: 7/8
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/8
Multi-processing is handled by Slurm.
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8
initializing ddp: GLOBAL_RANK: 7, MEMBER: 8/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Multi-processing is handled by Slurm.
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/8
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.
Set SLURM handle signals.

  | Name          | Type          | Params | In sizes | Out sizes     
----------------------------------------------------------------------------
0 | generator     | Generator     | 1 M    | [2, 100] | [2, 1, 28, 28]
1 | discriminator | Discriminator | 533 K  | ?        | ?             

Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/118 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/118 [00:00<?, ?it/s] 
Epoch 0:   1%|          | 1/118 [00:00<01:10,  1.65it/s]
Epoch 0:   1%|          | 1/118 [00:00<01:10,  1.65it/s, loss=0.710, v_num=194638]
Epoch 0:   2%|▏         | 2/118 [00:00<00:36,  3.17it/s, loss=0.686, v_num=194638]
Epoch 0:   3%|▎         | 3/118 [00:00<00:24,  4.61it/s, loss=0.664, v_num=194638]
Epoch 0:   3%|▎         | 4/118 [00:00<00:19,  5.97it/s, loss=0.646, v_num=194638]
Epoch 0:   4%|▍         | 5/118 [00:00<00:15,  7.25it/s, loss=0.630, v_num=194638]
Epoch 0:   5%|▌         | 6/118 [00:00<00:13,  8.47it/s, loss=0.630, v_num=194638]
Epoch 0:   5%|▌         | 6/118 [00:00<00:13,  8.47it/s, loss=0.617, v_num=194638]
Epoch 0:   6%|▌         | 7/118 [00:00<00:11,  9.62it/s, loss=0.606, v_num=194638]
Epoch 0:   7%|▋         | 8/118 [00:00<00:10, 10.71it/s, loss=0.597, v_num=194638]
Epoch 0:   8%|▊         | 9/118 [00:00<00:09, 11.74it/s, loss=0.589, v_n
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/118 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/118 [00:00<?, ?it/s] 
Epoch 0:   1%|          | 1/118 [00:00<01:10,  1.65it/s]
Epoch 0:   1%|          | 1/118 [00:00<01:11,  1.65it/s, loss=0.710, v_num=194638]
Epoch 0:   2%|▏         | 2/118 [00:00<00:36,  3.17it/s, loss=0.686, v_num=194638]
Epoch 0:   3%|▎         | 3/118 [00:00<00:24,  4.61it/s, loss=0.665, v_num=194638]
Epoch 0:   3%|▎         | 4/118 [00:00<00:19,  5.97it/s, loss=0.647, v_num=194638]
Epoch 0:   4%|▍         | 5/118 [00:00<00:15,  7.25it/s, loss=0.631, v_num=194638]
Epoch 0:   5%|▌         | 6/118 [00:00<00:13,  8.47it/s, loss=0.631, v_num=194638]
Epoch 0:   5%|▌         | 6/118 [00:00<00:13,  8.46it/s, loss=0.618, v_num=194638]
Epoch 0:   6%|▌         | 7/118 [00:00<00:11,  9.61it/s, loss=0.606, v_num=194638]
Epoch 0:   7%|▋         | 8/118 [00:00<00:10, 10.70it/s, loss=0.597, v_num=194638]
Epoch 0:   8%|▊         | 9/118 [00:00<00:09, 11.74it/s, loss=0.590, v_n
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/118 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/118 [00:00<?, ?it/s] 
Epoch 0:   1%|          | 1/118 [00:00<01:10,  1.65it/s]
Epoch 0:   1%|          | 1/118 [00:00<01:10,  1.65it/s, loss=0.711, v_num=194638]
Epoch 0:   2%|▏         | 2/118 [00:00<00:36,  3.17it/s, loss=0.686, v_num=194638]
Epoch 0:   3%|▎         | 3/118 [00:00<00:24,  4.61it/s, loss=0.664, v_num=194638]
Epoch 0:   3%|▎         | 4/118 [00:00<00:19,  5.97it/s, loss=0.646, v_num=194638]
Epoch 0:   4%|▍         | 5/118 [00:00<00:15,  7.25it/s, loss=0.631, v_num=194638]
Epoch 0:   5%|▌         | 6/118 [00:00<00:13,  8.47it/s, loss=0.631, v_num=194638]
Epoch 0:   5%|▌         | 6/118 [00:00<00:13,  8.46it/s, loss=0.618, v_num=194638]
Epoch 0:   6%|▌         | 7/118 [00:00<00:11,  9.61it/s, loss=0.607, v_num=194638]
Epoch 0:   7%|▋         | 8/118 [00:00<00:10, 10.71it/s, loss=0.598, v_num=194638]
Epoch 0:   8%|▊         | 9/118 [00:00<00:09, 11.74it/s, loss=0.590, v_n
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/118 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/118 [00:00<?, ?it/s] 
Epoch 0:   1%|          | 1/118 [00:00<01:10,  1.65it/s]
Epoch 0:   1%|          | 1/118 [00:00<01:10,  1.65it/s, loss=0.710, v_num=194638]
Epoch 0:   2%|▏         | 2/118 [00:00<00:36,  3.17it/s, loss=0.685, v_num=194638]
Epoch 0:   3%|▎         | 3/118 [00:00<00:24,  4.61it/s, loss=0.665, v_num=194638]
Epoch 0:   3%|▎         | 4/118 [00:00<00:19,  5.97it/s, loss=0.647, v_num=194638]
Epoch 0:   4%|▍         | 5/118 [00:00<00:15,  7.25it/s, loss=0.631, v_num=194638]
Epoch 0:   5%|▌         | 6/118 [00:00<00:13,  8.47it/s, loss=0.631, v_num=194638]
Epoch 0:   5%|▌         | 6/118 [00:00<00:13,  8.46it/s, loss=0.618, v_num=194638]
Epoch 0:   6%|▌         | 7/118 [00:00<00:11,  9.61it/s, loss=0.607, v_num=194638]
Epoch 0:   7%|▋         | 8/118 [00:00<00:10, 10.71it/s, loss=0.598, v_num=194638]
Epoch 0:   8%|▊         | 9/118 [00:00<00:09, 11.74it/s, loss=0.590, v_n
Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/118 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/118 [00:00<?, ?it/s] 
....

@Limtle
Copy link
Author

Limtle commented Oct 26, 2020

@Borda How to check all 16 updated in parallel or just the last one?

@Limtle
Copy link
Author

Limtle commented Nov 10, 2020

closed #4437

@Limtle Limtle closed this as completed Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Related to a 3rd-party question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants