Question about network updating #25

zen-d · 2023-05-19T09:07:57Z

@YTEP-ZHI Hello, thanks for your great work. When I set https://github.com/OpenDriveLab/UniAD/blob/main/projects/mmdet3d_plugin/uniad/detectors/uniad_track.py#L545-L546 to

prev_img, prev_img_metas = None, None

I find that memory_bank and query_interact do not receive gradients. It is a bit hard for me to understand, could you please explain that? What confuses me more is that MUTR3D does not use temporal feature fusion and could run without such problems.

The text was updated successfully, but these errors were encountered:

YTEP-ZHI · 2023-05-19T11:47:12Z

Hi, @zen-d. Thanks for your question. We will check on this soon. By the way, do they(mem_bank and query_interact) receive gradients and function normally when you unset this modification?

YTEP-ZHI · 2023-05-19T15:16:28Z

Hi @zen-d, could you please tell me which config you are experimenting with? The 1-track-map config or the 2-e2e config?

zen-d · 2023-05-19T16:25:11Z

@YTEP-ZHI I use the 1-track-map config.

Sorry, I will update my findings: "memory_bank and query_interact do not receive gradients" does not depend on temporal fusion or not. That is to say, even the original repo without any modification might have such a problem.

YTEP-ZHI · 2023-05-20T08:56:33Z

Hi @zen-d, thanks for pointing out this question. Let me explain the error in detail. The error of "not receiving gradients" is correlated with the setting: find_unused_parameters=False. It's the default setting of a pytorch program, designed to warn programmers when some modules expected to receive gradients(modules.requires_grad=True) are actually not producing any loss or not receiving gradients after the forward pass. But if you change the find_unused_parameters to True as we did, the error will not occur.

So where did those unused parameters come from in UniAD?

To save GPU memory, we do not update the image backbone and image neck in both two training stages of UniAD. We implement the freeze backbone/neck operation in the forward pass by forcing them not to produce any gradient here:

UniAD/projects/mmdet3d_plugin/uniad/detectors/uniad_track.py

Lines 176 to 177 in 3643c68

with torch.no_grad():

img_feats = self.extract_img_feat(img, len_queue=len_queue)

So if you set find_unused_parameters=False, the stop-gradient operation in image backbone and neck will trigger the forementioned error. A better option to freeze modules should be using modules.requires_grad=False when initializing modules instead of what we did in the forward pass.

UniAD aggregates multi-frame BEVs to enrich the representation of BEV in current timestep. However, in order to save GPU memory, we do not calculate any gradients when generating historical BEVs. It's also implemented in the forward pass similar to the stop-grad in backbone/neck.

UniAD/projects/mmdet3d_plugin/uniad/detectors/uniad_track.py

Lines 330 to 341 in 3643c68

    
           with torch.no_grad(): 
        
               prev_bev = None 
        
               bs, len_queue, num_cams, C, H, W = imgs_queue.shape 
        
               imgs_queue = imgs_queue.reshape(bs * len_queue, num_cams, C, H, W) 
        
               img_feats_list = self.extract_feat(img=imgs_queue, len_queue=len_queue) 
        
               for i in range(len_queue): 
        
                   img_metas = [each[i] for each in img_metas_list] 
        
                   img_feats = [each_scale[:, i] for each_scale in img_feats_list] 
        
                   prev_bev, _ = self.pts_bbox_head.get_bev_features( 
        
                       mlvl_feats=img_feats,  
        
                       img_metas=img_metas,  
        
                       prev_bev=prev_bev)

It will also potentially trigger the error you mentioned when setting find_unused_parameters=False.

zen-d · 2023-05-20T09:42:54Z

@YTEP-ZHI Thanks for your detailed answer. But I have already tried the following two operations. I think I have sidestepped the two things respectively in your last reply.

set freeze_img_modules=False in the cfg file
set https://github.com/OpenDriveLab/UniAD/blob/main/projects/mmdet3d_plugin/uniad/detectors/uniad_track.py#L545-L546 to

prev_img, prev_img_metas = None, None

Still, PyTorch DDP training shows the error that mem_bank and query_interact receive no grad.

zen-d · 2023-05-20T09:50:22Z

Hi @zen-d, thanks for pointing out this question. Let me explain the error in detail. The error of "not receiving gradients" is correlated with the setting: find_unused_parameters=False. It's the default setting of a pytorch program, designed to warn programmers when some modules expected to receive gradients(modules.requires_grad=True) are actually not producing any loss or not receiving gradients after the forward pass. But if you change the find_unused_parameters to True as we did, the error will not occur.

So where did those unused parameters come from in UniAD?

To save GPU memory, we do not update the image backbone and image neck in both two training stages of UniAD. We implement the freeze backbone/neck operation in the forward pass by forcing them not to produce any gradient here:

UniAD/projects/mmdet3d_plugin/uniad/detectors/uniad_track.py

Lines 176 to 177 in 3643c68

with torch.no_grad():

img_feats = self.extract_img_feat(img, len_queue=len_queue)

So if you set find_unused_parameters=False, the stop-gradient operation in image backbone and neck will trigger the forementioned error. A better option to freeze modules should be using modules.requires_grad=False when initializing modules instead of what we did in the forward pass.

UniAD aggregates multi-frame BEVs to enrich the representation of BEV in current timestep. However, in order to save GPU memory, we do not calculate any gradients when generating historical BEVs. It's also implemented in the forward pass similar to the stop-grad in backbone/neck.

UniAD/projects/mmdet3d_plugin/uniad/detectors/uniad_track.py

Lines 330 to 341 in 3643c68

with torch.no_grad():

prev_bev = None

bs, len_queue, num_cams, C, H, W = imgs_queue.shape

imgs_queue = imgs_queue.reshape(bs * len_queue, num_cams, C, H, W)

img_feats_list = self.extract_feat(img=imgs_queue, len_queue=len_queue)

for i in range(len_queue):

img_metas = [each[i] for each in img_metas_list]

img_feats = [each_scale[:, i] for each_scale in img_feats_list]

prev_bev, _ = self.pts_bbox_head.get_bev_features(

mlvl_feats=img_feats,

img_metas=img_metas,

prev_bev=prev_bev)

It will also potentially trigger the error you mentioned when setting find_unused_parameters=False.

To complement, the second point is inherited from BEVFormer, but to the best of my knowledge, BEVFormer does not have such an issue. Correct me if I miss anything.

YTEP-ZHI · 2023-05-20T09:55:50Z

Yes, it's inherited from BEVFormer. It's strange that a lot of people have used this repo to train models, without encoutering the gradient problem. I'll check on this. If you have any updated information, please report them in this thread. Thanks.

zen-d · 2023-05-20T10:16:01Z

@YTEP-ZHI

I think the cfg setting in this line somehow bypasses the gradient problem (just without an error explicitly printed out), that is why "a lot of people do not encounter that". However, it surfaces now when I set find_unused_parameters=False.
Based on my latest ablation experiments stated in this comment Question about network updating #25 (comment), I might conjecture it is not a trivial problem related to frozen network parts.
Accordingly, I also fall into reproduction issue Could you share the training log of stage one? #21 which is still pending. Could it be somehow related the gradient issue we discussed now?

Overall, look forward to your in-depth checked results and potential fix soon. Thanks.

YTEP-ZHI · 2023-05-20T11:29:53Z

@zen-d
Thanks so much for providing valuable information. I'm checking on the reproduction issue and gradient issue, and will notice you once it's fixed.

YTEP-ZHI · 2023-06-14T08:50:07Z

Hi, @zen-d. The gradient problem, that the memory_bank (sometimes) doesn't receive gradient, is triggered when memory_bank is empty (in the beginning of track training or starting at a new scenario). When it's empty (len(embed) ==0), the interactions in memory_bank will not be performed. Hence, the neural branches in memory_bank will not be forwarded and updated, which causes this problem.

UniAD/projects/mmdet3d_plugin/uniad/dense_heads/track_head_plugin/modules.py

Line 64 in 86aca41

if len(embed) > 0:

However, I think it's acceptable to set find_unused_parameters=True to bypass this issue.

Moreover, our trackformer is modified from MOTR. You can find that they also set find_unused_parameters=True in their training codes.
https://github.com/megvii-research/MOTR/blob/8690da3392159635ca37c31975126acf40220724/main.py#L266

YTEP-ZHI · 2023-06-14T08:55:57Z

@YTEP-ZHI

I think the cfg setting in this line somehow bypasses the gradient problem (just without an error explicitly printed out), that is why "a lot of people do not encounter that". However, it surfaces now when I set find_unused_parameters=False.

Based on my latest ablation experiments stated in this comment Question about network updating #25 (comment), I might conjecture it is not a trivial problem related to frozen network parts.

Accordingly, I also fall into reproduction issue Could you share the training log of stage one? #21 which is still pending. Could it be somehow related the gradient issue we discussed now?

Overall, look forward to your in-depth checked results and potential fix soon. Thanks.

The third issue is resolved as mentioned here: #21 (comment), and the performance of stage1 model could be reproduced when trained from scratch. It's actually nothinig to do with the gradient issue. Still, thanks for your feedback.

YTEP-ZHI · 2023-06-15T07:38:21Z

I'm closing this issue as it's resolved, feel free to reopen it if needed @zen-d.

YTEP-ZHI self-assigned this May 19, 2023

YTEP-ZHI added the question Further information is requested label May 19, 2023

zen-d changed the title ~~Question about without temporal fusion~~ Question about network updating May 20, 2023

YTEP-ZHI closed this as completed Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about network updating #25

Question about network updating #25

zen-d commented May 19, 2023

YTEP-ZHI commented May 19, 2023

YTEP-ZHI commented May 19, 2023

zen-d commented May 19, 2023

YTEP-ZHI commented May 20, 2023 •

edited

Loading

zen-d commented May 20, 2023

zen-d commented May 20, 2023

YTEP-ZHI commented May 20, 2023

zen-d commented May 20, 2023

YTEP-ZHI commented May 20, 2023 •

edited

Loading

YTEP-ZHI commented Jun 14, 2023 •

edited

Loading

YTEP-ZHI commented Jun 14, 2023 •

edited

Loading

YTEP-ZHI commented Jun 15, 2023

Question about network updating #25

Question about network updating #25

Comments

zen-d commented May 19, 2023

YTEP-ZHI commented May 19, 2023

YTEP-ZHI commented May 19, 2023

zen-d commented May 19, 2023

YTEP-ZHI commented May 20, 2023 • edited Loading

zen-d commented May 20, 2023

zen-d commented May 20, 2023

YTEP-ZHI commented May 20, 2023

zen-d commented May 20, 2023

YTEP-ZHI commented May 20, 2023 • edited Loading

YTEP-ZHI commented Jun 14, 2023 • edited Loading

YTEP-ZHI commented Jun 14, 2023 • edited Loading

YTEP-ZHI commented Jun 15, 2023

YTEP-ZHI commented May 20, 2023 •

edited

Loading

YTEP-ZHI commented May 20, 2023 •

edited

Loading

YTEP-ZHI commented Jun 14, 2023 •

edited

Loading

YTEP-ZHI commented Jun 14, 2023 •

edited

Loading