Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Pixel Decoder and MSDeformableAttn #13

Open
pieris98 opened this issue Jun 10, 2024 · 0 comments
Open

Question about Pixel Decoder and MSDeformableAttn #13

pieris98 opened this issue Jun 10, 2024 · 0 comments

Comments

@pieris98
Copy link

pieris98 commented Jun 10, 2024

Hey Nazir,
I have a question about the modified part of Pixel Decoder. I'm trying to navigate the code to understand what you changed from the Mask2Former architecture to only output the $f_3$ feature from the Pixel Decoder into the Transformer Decoder.

From what I've seen in mask2former/modeling/pixel_decoder/msdeformattn.py:
1st code block:

  # append `out` with extra FPN levels
        # Reverse feature maps into top-down order (from low to high resolution)
        for idx, f in enumerate(self.in_features[:self.num_fpn_levels][::-1]):
            x = features[f].float()
            lateral_conv = self.lateral_convs[idx]
            output_conv = self.output_convs[idx]
            cur_fpn = lateral_conv(x)
            # Following FPN implementation, we use nearest upsampling here
            y = cur_fpn + F.interpolate(out[-1], size=cur_fpn.shape[-2:], mode="bilinear", align_corners=False)
            y = output_conv(y)
            out.append(y)

        for o in out:
            if num_cur_levels < self.maskformer_num_feature_levels:
                multi_scale_features.append(o)
                num_cur_levels += 1

2nd code block:

367        return self.mask_features(out[-1]), out[0], multi_scale_features 

My questions are:

  1. Do you still need to calculate $f_2$ and $f_1$ to obtain $f_3$ i.e. are they dependent on each other? I'm still confused about this part. In the 1st code block I put a breakpoint and it runs 3 times. This equation from the paper:
    F = Conv3×3 (Conv1×1 (x4) + Upsample(f1 )) (Eq. 7)
    matches the 1st block (lateral_conv is 1x1 conv and output_conv is 3x3 conv). However, this runs for all feature levels in the for loop and mentions extra FPN heads in the comment. Could you please explain why this happens? Also, there is an additional conv layer for self.mask_features which I assume is per-pixel F. Could you explain what this extra conv layer does in relation to Equation 7?
  2. I saw in this issue from the original Mask2Former repo that the backbone (swin) features i.e. $x_1 .. x_4$ are mixed through MSDeformable Attention to obtain $f_1..f_3$. Did you change the MSDeformableAttention num_levels parameter? Or did you leave it at the default value of 4? I believe this matters for the mixing of backbone features but i didn't find a parameter in the config file to change it.
  3. In the above line of 2nd code block (final line 347 in the pixel decoder file), if I understand correctly, self.mask_features(out[-1]) is the per-pixel embeddings $F$, and out[0] is $f_3$. However, you also return multi_scale_features. Does that mean you also return $f_1$ and $f_2$?
  4. If the answer to 3. is yes, then does the RbA code modify only in the Transformer Decoder part (i.e. the input features are changed to only accept $f_3$? If so, where does this happen i.e. is it via the config file or a change in the Transformer Decoder class?

I'm just trying to understand if there is any redundancy in calculating $f_1$ and $f_2$ and where are the changes in the architecture of RbA compared to Mask2Former in the code.

Thanks so much again for all your support in understanding the paper and code!
Best,
Pieris.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant