Question about Pixel Decoder and MSDeformableAttn #13

pieris98 · 2024-06-10T12:22:36Z

Hey Nazir,
I have a question about the modified part of Pixel Decoder. I'm trying to navigate the code to understand what you changed from the Mask2Former architecture to only output the $f_3$ feature from the Pixel Decoder into the Transformer Decoder.

From what I've seen in mask2former/modeling/pixel_decoder/msdeformattn.py:
1st code block:

  # append `out` with extra FPN levels
        # Reverse feature maps into top-down order (from low to high resolution)
        for idx, f in enumerate(self.in_features[:self.num_fpn_levels][::-1]):
            x = features[f].float()
            lateral_conv = self.lateral_convs[idx]
            output_conv = self.output_convs[idx]
            cur_fpn = lateral_conv(x)
            # Following FPN implementation, we use nearest upsampling here
            y = cur_fpn + F.interpolate(out[-1], size=cur_fpn.shape[-2:], mode="bilinear", align_corners=False)
            y = output_conv(y)
            out.append(y)

        for o in out:
            if num_cur_levels < self.maskformer_num_feature_levels:
                multi_scale_features.append(o)
                num_cur_levels += 1

2nd code block:

367        return self.mask_features(out[-1]), out[0], multi_scale_features

My questions are:

Do you still need to calculate $f_2$ and $f_1$ to obtain $f_3$ i.e. are they dependent on each other? I'm still confused about this part. In the 1st code block I put a breakpoint and it runs 3 times. This equation from the paper:
F = Conv3×3 (Conv1×1 (x4) + Upsample(f1 )) (Eq. 7)
matches the 1st block (lateral_conv is 1x1 conv and output_conv is 3x3 conv). However, this runs for all feature levels in the for loop and mentions extra FPN heads in the comment. Could you please explain why this happens? Also, there is an additional conv layer for self.mask_features which I assume is per-pixel F. Could you explain what this extra conv layer does in relation to Equation 7?
I saw in this issue from the original Mask2Former repo that the backbone (swin) features i.e. $x_1 .. x_4$ are mixed through MSDeformable Attention to obtain $f_1..f_3$. Did you change the MSDeformableAttention num_levels parameter? Or did you leave it at the default value of 4? I believe this matters for the mixing of backbone features but i didn't find a parameter in the config file to change it.
In the above line of 2nd code block (final line 347 in the pixel decoder file), if I understand correctly, self.mask_features(out[-1]) is the per-pixel embeddings $F$, and out[0] is $f_3$. However, you also return multi_scale_features. Does that mean you also return $f_1$ and $f_2$?
If the answer to 3. is yes, then does the RbA code modify only in the Transformer Decoder part (i.e. the input features are changed to only accept $f_3$? If so, where does this happen i.e. is it via the config file or a change in the Transformer Decoder class?

I'm just trying to understand if there is any redundancy in calculating $f_1$ and $f_2$ and where are the changes in the architecture of RbA compared to Mask2Former in the code.

Thanks so much again for all your support in understanding the paper and code!
Best,
Pieris.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Pixel Decoder and MSDeformableAttn #13

Question about Pixel Decoder and MSDeformableAttn #13

pieris98 commented Jun 10, 2024 •

edited

Loading

Question about Pixel Decoder and MSDeformableAttn #13

Question about Pixel Decoder and MSDeformableAttn #13

Comments

pieris98 commented Jun 10, 2024 • edited Loading

pieris98 commented Jun 10, 2024 •

edited

Loading