The loss doesn't decrease when using multi nodes. #30

lailvlong · 2021-03-18T09:14:06Z

When i use one node, the code runs well. However, when I use 2 nodes and set the batch_size to 64, the loss is always around 5.545 and doesn't decrease. As 5.545 is the value of ln(512), it seems like that the network never get new knowledge during training. I have checked that the parameters are not fixed. I think maybe there is something wrong with the GatherLayer but i can not find it out. Have you met this problem?

Attila94 · 2021-03-26T11:12:57Z

Same issue here, loss stuck on +- 6.23. Everything works fine when training on single node.

Sdhir · 2021-04-01T17:32:40Z

I have the same multi-node loss issue. Any solution for this problem?

Pexure · 2021-05-22T12:33:18Z

hi guys, you only have the problem with multiple nodes? I get the same issue even on a single node but multiple processes(ranks). Any suggestion?

MAGI003769 · 2021-08-04T03:05:54Z

hi guys, you only have the problem with multiple nodes? I get the same issue even on a single node but multiple processes(ranks). Any suggestion?

same issue here, could you please tell me whether you figure it out with any solution?

huangdi95 · 2021-09-02T03:29:38Z

Maybe a bug in class NT_Xent(nn.Module) when using multi-gpus. The mask and positive/negative pairs are wrong I think.

dltkddn0525 · 2021-11-03T12:10:59Z

Maybe a bug in class NT_Xent(nn.Module) when using multi-gpus. The mask and positive/negative pairs are wrong I think.

I agree. To make the implementation work on multi node or multi processes, I think the GatherLayer should be applied to z_i and z_j independently before the concatenation(z = torch.cat((z_i, z_j), dim=0)). In my case, this slight modification solved the problem and the loss came to decrease during training.

wooozihui · 2021-12-06T09:25:59Z

Hey guys, I have adjusted some code of the forward function in class NT_Xent and now it can work, but I just found the multi-gpu performance is mush worse than only using one gpu, do you know the reason?

def forward(self, z_i, z_j):
        N = 2 * self.batch_size * self.world_size
        z_list_i = [torch.zeros_like(z_i) for _ in range(dist.get_world_size())]
        z_list_j = [torch.zeros_like(z_j) for _ in range(dist.get_world_size())]
        #z = F.normalize(z, p=2, dim=1)
        if self.world_size > 1:
            z_list_i = diffdist.functional.all_gather(z_list_i, z_i)
            z_list_j = diffdist.functional.all_gather(z_list_j, z_j)
            
            z_i = torch.cat(z_list_i,dim=0)
            z_j = torch.cat(z_list_j,dim=0)
        z = torch.cat((z_i, z_j), dim=0)

wooozihui · 2021-12-06T14:51:35Z

Hey guys, I have adjusted some code of the forward function in class NT_Xent and now it can work, but I just found the multi-gpu performance is mush worse than only using one gpu, do you know the reason?

def forward(self, z_i, z_j):
        N = 2 * self.batch_size * self.world_size
        z_list_i = [torch.zeros_like(z_i) for _ in range(dist.get_world_size())]
        z_list_j = [torch.zeros_like(z_j) for _ in range(dist.get_world_size())]
        #z = F.normalize(z, p=2, dim=1)
        if self.world_size > 1:
            z_list_i = diffdist.functional.all_gather(z_list_i, z_i)
            z_list_j = diffdist.functional.all_gather(z_list_j, z_j)
            
            z_i = torch.cat(z_list_i,dim=0)
            z_j = torch.cat(z_list_j,dim=0)
        z = torch.cat((z_i, z_j), dim=0)

ok .... I think this question has been solved ... the ddp model did not replace the origin one by mistake, so it did not work well. By properly setting the training model, this function is well for the multi-gpu training in ddp.

jwjohnson314 · 2022-10-13T15:19:42Z

@wooozihui , can you elaborate on where/how you replaced or properly set the training model?

lxysl · 2023-09-02T14:15:00Z

Hello guys, I've been plagued by the inexplicable code in nt_xent.py for a long time too. Finally I found this issue.

I agree to @dltkddn0525 's opinion. The all_gather result of z used to build mask should be something like [z_i1, z_i2, ..., z_iw, z_j1, z_j2, ..., z_jw] instead of [z_i1, z_j1,z_i2, z_j2, ..., z_iw, z_jw], where w represents the world_size.

I've made a pull request of this issue, hope this will help!

lxysl mentioned this issue Sep 2, 2023

Fix mask issue in distributed nt_xent loss #46

Merged

lxysl mentioned this issue Sep 2, 2023

how to use all_gather in training loop? pytorch/ignite#2504

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The loss doesn't decrease when using multi nodes. #30

The loss doesn't decrease when using multi nodes. #30

lailvlong commented Mar 18, 2021 •

edited

Loading

Attila94 commented Mar 26, 2021 •

edited

Loading

Sdhir commented Apr 1, 2021

Pexure commented May 22, 2021

MAGI003769 commented Aug 4, 2021

huangdi95 commented Sep 2, 2021

dltkddn0525 commented Nov 3, 2021

wooozihui commented Dec 6, 2021 •

edited

Loading

wooozihui commented Dec 6, 2021 •

edited

Loading

jwjohnson314 commented Oct 13, 2022

lxysl commented Sep 2, 2023

The loss doesn't decrease when using multi nodes. #30

The loss doesn't decrease when using multi nodes. #30

Comments

lailvlong commented Mar 18, 2021 • edited Loading

Attila94 commented Mar 26, 2021 • edited Loading

Sdhir commented Apr 1, 2021

Pexure commented May 22, 2021

MAGI003769 commented Aug 4, 2021

huangdi95 commented Sep 2, 2021

dltkddn0525 commented Nov 3, 2021

wooozihui commented Dec 6, 2021 • edited Loading

wooozihui commented Dec 6, 2021 • edited Loading

jwjohnson314 commented Oct 13, 2022

lxysl commented Sep 2, 2023

lailvlong commented Mar 18, 2021 •

edited

Loading

Attila94 commented Mar 26, 2021 •

edited

Loading

wooozihui commented Dec 6, 2021 •

edited

Loading

wooozihui commented Dec 6, 2021 •

edited

Loading