Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about network updating #25

Closed
zen-d opened this issue May 19, 2023 · 12 comments
Closed

Question about network updating #25

zen-d opened this issue May 19, 2023 · 12 comments
Assignees
Labels
question Further information is requested

Comments

@zen-d
Copy link

zen-d commented May 19, 2023

@YTEP-ZHI Hello, thanks for your great work. When I set https://github.com/OpenDriveLab/UniAD/blob/main/projects/mmdet3d_plugin/uniad/detectors/uniad_track.py#L545-L546 to

prev_img, prev_img_metas = None, None

I find that memory_bank and query_interact do not receive gradients. It is a bit hard for me to understand, could you please explain that? What confuses me more is that MUTR3D does not use temporal feature fusion and could run without such problems.

@YTEP-ZHI YTEP-ZHI self-assigned this May 19, 2023
@YTEP-ZHI YTEP-ZHI added the question Further information is requested label May 19, 2023
@YTEP-ZHI
Copy link
Collaborator

Hi, @zen-d. Thanks for your question. We will check on this soon. By the way, do they(mem_bank and query_interact) receive gradients and function normally when you unset this modification?

@YTEP-ZHI
Copy link
Collaborator

Hi @zen-d, could you please tell me which config you are experimenting with? The 1-track-map config or the 2-e2e config?

@zen-d
Copy link
Author

zen-d commented May 19, 2023

@YTEP-ZHI I use the 1-track-map config.

Sorry, I will update my findings: "memory_bank and query_interact do not receive gradients" does not depend on temporal fusion or not. That is to say, even the original repo without any modification might have such a problem.

@YTEP-ZHI
Copy link
Collaborator

YTEP-ZHI commented May 20, 2023

Hi @zen-d, thanks for pointing out this question. Let me explain the error in detail. The error of "not receiving gradients" is correlated with the setting: find_unused_parameters=False. It's the default setting of a pytorch program, designed to warn programmers when some modules expected to receive gradients(modules.requires_grad=True) are actually not producing any loss or not receiving gradients after the forward pass. But if you change the find_unused_parameters to True as we did, the error will not occur.

So where did those unused parameters come from in UniAD?

  1. To save GPU memory, we do not update the image backbone and image neck in both two training stages of UniAD. We implement the freeze backbone/neck operation in the forward pass by forcing them not to produce any gradient here:

    with torch.no_grad():
    img_feats = self.extract_img_feat(img, len_queue=len_queue)

    So if you set find_unused_parameters=False, the stop-gradient operation in image backbone and neck will trigger the forementioned error. A better option to freeze modules should be using modules.requires_grad=False when initializing modules instead of what we did in the forward pass.

  2. UniAD aggregates multi-frame BEVs to enrich the representation of BEV in current timestep. However, in order to save GPU memory, we do not calculate any gradients when generating historical BEVs. It's also implemented in the forward pass similar to the stop-grad in backbone/neck.

    with torch.no_grad():
    prev_bev = None
    bs, len_queue, num_cams, C, H, W = imgs_queue.shape
    imgs_queue = imgs_queue.reshape(bs * len_queue, num_cams, C, H, W)
    img_feats_list = self.extract_feat(img=imgs_queue, len_queue=len_queue)
    for i in range(len_queue):
    img_metas = [each[i] for each in img_metas_list]
    img_feats = [each_scale[:, i] for each_scale in img_feats_list]
    prev_bev, _ = self.pts_bbox_head.get_bev_features(
    mlvl_feats=img_feats,
    img_metas=img_metas,
    prev_bev=prev_bev)

    It will also potentially trigger the error you mentioned when setting find_unused_parameters=False.

@zen-d
Copy link
Author

zen-d commented May 20, 2023

@YTEP-ZHI Thanks for your detailed answer. But I have already tried the following two operations. I think I have sidestepped the two things respectively in your last reply.

  1. set freeze_img_modules=False in the cfg file
  2. set https://github.com/OpenDriveLab/UniAD/blob/main/projects/mmdet3d_plugin/uniad/detectors/uniad_track.py#L545-L546 to
prev_img, prev_img_metas = None, None

Still, PyTorch DDP training shows the error that mem_bank and query_interact receive no grad.

@zen-d
Copy link
Author

zen-d commented May 20, 2023

Hi @zen-d, thanks for pointing out this question. Let me explain the error in detail. The error of "not receiving gradients" is correlated with the setting: find_unused_parameters=False. It's the default setting of a pytorch program, designed to warn programmers when some modules expected to receive gradients(modules.requires_grad=True) are actually not producing any loss or not receiving gradients after the forward pass. But if you change the find_unused_parameters to True as we did, the error will not occur.

So where did those unused parameters come from in UniAD?

  1. To save GPU memory, we do not update the image backbone and image neck in both two training stages of UniAD. We implement the freeze backbone/neck operation in the forward pass by forcing them not to produce any gradient here:

    with torch.no_grad():
    img_feats = self.extract_img_feat(img, len_queue=len_queue)

    So if you set find_unused_parameters=False, the stop-gradient operation in image backbone and neck will trigger the forementioned error. A better option to freeze modules should be using modules.requires_grad=False when initializing modules instead of what we did in the forward pass.

  2. UniAD aggregates multi-frame BEVs to enrich the representation of BEV in current timestep. However, in order to save GPU memory, we do not calculate any gradients when generating historical BEVs. It's also implemented in the forward pass similar to the stop-grad in backbone/neck.

    with torch.no_grad():
    prev_bev = None
    bs, len_queue, num_cams, C, H, W = imgs_queue.shape
    imgs_queue = imgs_queue.reshape(bs * len_queue, num_cams, C, H, W)
    img_feats_list = self.extract_feat(img=imgs_queue, len_queue=len_queue)
    for i in range(len_queue):
    img_metas = [each[i] for each in img_metas_list]
    img_feats = [each_scale[:, i] for each_scale in img_feats_list]
    prev_bev, _ = self.pts_bbox_head.get_bev_features(
    mlvl_feats=img_feats,
    img_metas=img_metas,
    prev_bev=prev_bev)

    It will also potentially trigger the error you mentioned when setting find_unused_parameters=False.

To complement, the second point is inherited from BEVFormer, but to the best of my knowledge, BEVFormer does not have such an issue. Correct me if I miss anything.

@YTEP-ZHI
Copy link
Collaborator

Yes, it's inherited from BEVFormer. It's strange that a lot of people have used this repo to train models, without encoutering the gradient problem. I'll check on this. If you have any updated information, please report them in this thread. Thanks.

@zen-d
Copy link
Author

zen-d commented May 20, 2023

@YTEP-ZHI

  1. I think the cfg setting in this line somehow bypasses the gradient problem (just without an error explicitly printed out), that is why "a lot of people do not encounter that". However, it surfaces now when I set find_unused_parameters=False.
  2. Based on my latest ablation experiments stated in this comment Question about network updating #25 (comment), I might conjecture it is not a trivial problem related to frozen network parts.
  3. Accordingly, I also fall into reproduction issue Could you share the training log of stage one? #21 which is still pending. Could it be somehow related the gradient issue we discussed now?

Overall, look forward to your in-depth checked results and potential fix soon. Thanks.

@zen-d zen-d changed the title Question about without temporal fusion Question about network updating May 20, 2023
@YTEP-ZHI
Copy link
Collaborator

YTEP-ZHI commented May 20, 2023

@zen-d
Thanks so much for providing valuable information. I'm checking on the reproduction issue and gradient issue, and will notice you once it's fixed.

@YTEP-ZHI
Copy link
Collaborator

YTEP-ZHI commented Jun 14, 2023

Hi, @zen-d. The gradient problem, that the memory_bank (sometimes) doesn't receive gradient, is triggered when memory_bank is empty (in the beginning of track training or starting at a new scenario). When it's empty (len(embed) ==0), the interactions in memory_bank will not be performed. Hence, the neural branches in memory_bank will not be forwarded and updated, which causes this problem.

However, I think it's acceptable to set find_unused_parameters=True to bypass this issue.

Moreover, our trackformer is modified from MOTR. You can find that they also set find_unused_parameters=True in their training codes.
https://github.com/megvii-research/MOTR/blob/8690da3392159635ca37c31975126acf40220724/main.py#L266

@YTEP-ZHI
Copy link
Collaborator

YTEP-ZHI commented Jun 14, 2023

@YTEP-ZHI

  1. I think the cfg setting in this line somehow bypasses the gradient problem (just without an error explicitly printed out), that is why "a lot of people do not encounter that". However, it surfaces now when I set find_unused_parameters=False.
  2. Based on my latest ablation experiments stated in this comment Question about network updating #25 (comment), I might conjecture it is not a trivial problem related to frozen network parts.
  3. Accordingly, I also fall into reproduction issue Could you share the training log of stage one? #21 which is still pending. Could it be somehow related the gradient issue we discussed now?

Overall, look forward to your in-depth checked results and potential fix soon. Thanks.

The third issue is resolved as mentioned here: #21 (comment), and the performance of stage1 model could be reproduced when trained from scratch. It's actually nothinig to do with the gradient issue. Still, thanks for your feedback.

@YTEP-ZHI
Copy link
Collaborator

I'm closing this issue as it's resolved, feel free to reopen it if needed @zen-d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants