[FEATURE] Why doesn't use Conv2d directly in PatchMerging #951

WZMIAOMIAO · 2021-10-29T09:26:51Z

WZMIAOMIAO
Oct 29, 2021

First of all, Thank you for your great works.

Is your feature request related to a problem? Please describe.
I'm learning your swin-transformer code. I have a question in PatchMerging. https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/swin_transformer.py#L310-L347. Why doesn't use Conv2d(k=2, s=2) directly to merge 2x2 patches? Is 2x2 convolution too inefficient? or to facilitate the use of official weights? Look forward to your reply.

rwightman · 2021-10-29T16:57:49Z

rwightman
Oct 29, 2021
Maintainer

@WZMIAOMIAO This isn't an issue so moving to discussion, but re your question I assume you mean why not use F.conv2d with a manually crafted kernel to achieve the same? As I didn't implement that, it's a better question for the original authors, https://github.com/microsoft/Swin-Transformer

It's not high on my priority list to try but if you want to compare I'll update if your approach is better.

1 reply

WZMIAOMIAO Oct 30, 2021
Author

Thank you.

cainmagi · 2023-12-15T17:01:35Z

cainmagi
Dec 15, 2023

I think the following issue is related to this question.
microsoft/Swin-Transformer/issues/256

My opinion is that:

PatchMerge(in_channels, out_channels, downscaling_factor) == nn.Conv2d(in_channels, out_channels, kernel_size=downscaling_factor, stride=downscaling_factor, padding=0)

By the way, my conclusion is drawn by reviewing another repository:

https://github.com/berniwal/swin-transformer-pytorch/blob/c921ebf914c6ea9734bb260ada395e3746c85402/swin_transformer_pytorch/swin_transformer.py#L154

2 replies

rwightman Dec 15, 2023
Maintainer

@cainmagi that other repository is not compatible with the original swin though, indeed using a conv2d to cover both the merging + expansion operation is not because the original model has the norm layer between the two. It's worth noting that implementing with Unfold is no more efficient than view/reshape, in fact I think it can often be a bit slower and is best saved for cases where you need overlap (which is not the case here).

I feel there are two alternative downsample methods that'd work just fine for Swin, as to whether it's superior, would need testing. MaxVit and CoatNet which are related to swin use pool + expansion, and there are a number of others that use conv2d with k=s=2 and a pre-norm such as DaViT and ConvNeXt. Also a number of others like MetaFormer, use more typical convnet conv2d with k=3, stride=2, padding=1 for their downsamples.

Pool + expansion:

maxvit / coatnet -

pytorch-image-models/timm/models/maxxvit.py

Lines 303 to 339 in 7da34a9

    
           class Downsample2d(nn.Module): 
        
               """ A downsample pooling module supporting several maxpool and avgpool modes 
        
               * 'max' - MaxPool2d w/ kernel_size 3, stride 2, padding 1 
        
               * 'max2' - MaxPool2d w/ kernel_size = stride = 2 
        
               * 'avg' - AvgPool2d w/ kernel_size 3, stride 2, padding 1 
        
               * 'avg2' - AvgPool2d w/ kernel_size = stride = 2 
        
               """ 
        
               def __init__( 
        
                       self, 
        
                       dim: int, 
        
                       dim_out: int, 
        
                       pool_type: str = 'avg2', 
        
                       padding: str = '', 
        
                       bias: bool = True, 
        
               ): 
        
                   super().__init__() 
        
                   assert pool_type in ('max', 'max2', 'avg', 'avg2') 
        
                   if pool_type == 'max': 
        
                       self.pool = create_pool2d('max', kernel_size=3, stride=2, padding=padding or 1) 
        
                   elif pool_type == 'max2': 
        
                       self.pool = create_pool2d('max', 2, padding=padding or 0)  # kernel_size == stride == 2 
        
                   elif pool_type == 'avg': 
        
                       self.pool = create_pool2d( 
        
                           'avg', kernel_size=3, stride=2, count_include_pad=False, padding=padding or 1) 
        
                   else: 
        
                       self.pool = create_pool2d('avg', 2, padding=padding or 0) 
        
                   if dim != dim_out: 
        
                       self.expand = nn.Conv2d(dim, dim_out, 1, bias=bias) 
        
                   else: 
        
                       self.expand = nn.Identity() 
        
               def forward(self, x): 
        
                   x = self.pool(x)  # spatial downsample 
        
                   x = self.expand(x)  # expand chs 
        
                   return x

Norm + Conv2d w/ k=s=2

davit -

pytorch-image-models/timm/models/davit.py

Lines 82 to 108 in 7da34a9

    
           class Downsample(nn.Module): 
        
               def __init__( 
        
                       self, 
        
                       in_chs, 
        
                       out_chs, 
        
                       norm_layer=LayerNorm2d, 
        
               ): 
        
                   super().__init__() 
        
                   self.in_chs = in_chs 
        
                   self.out_chs = out_chs 
        
                   self.norm = norm_layer(in_chs) 
        
                   self.conv = nn.Conv2d( 
        
                       in_chs, 
        
                       out_chs, 
        
                       kernel_size=2, 
        
                       stride=2, 
        
                       padding=0, 
        
                   ) 
        
               def forward(self, x: Tensor): 
        
                   B, C, H, W = x.shape 
        
                   x = self.norm(x) 
        
                   x = F.pad(x, (0, (2 - W % 2) % 2)) 
        
                   x = F.pad(x, (0, 0, 0, (2 - H % 2) % 2)) 
        
                   x = self.conv(x) 
        
                   return x

convnext -

pytorch-image-models/timm/models/convnext.py

Lines 190 to 207 in 7da34a9

    
           if in_chs != out_chs or stride > 1 or dilation[0] != dilation[1]: 
        
               ds_ks = 2 if stride > 1 or dilation[0] != dilation[1] else 1 
        
               pad = 'same' if dilation[1] > 1 else 0  # same padding needed if dilation used 
        
               self.downsample = nn.Sequential( 
        
                   norm_layer(in_chs), 
        
                   create_conv2d( 
        
                       in_chs, 
        
                       out_chs, 
        
                       kernel_size=ds_ks, 
        
                       stride=stride, 
        
                       dilation=dilation[0], 
        
                       padding=pad, 
        
                       bias=conv_bias, 
        
                   ), 
        
               ) 
        
               in_chs = out_chs 
        
           else: 
        
               self.downsample = nn.Identity()

cainmagi Dec 15, 2023

Sorry for not noticing that! You are right. With the normalization, things will be different.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Why doesn't use Conv2d directly in PatchMerging #951

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[FEATURE] Why doesn't use Conv2d directly in PatchMerging #951

WZMIAOMIAO Oct 29, 2021

Replies: 2 comments · 3 replies

rwightman Oct 29, 2021 Maintainer

WZMIAOMIAO Oct 30, 2021 Author

cainmagi Dec 15, 2023

rwightman Dec 15, 2023 Maintainer

cainmagi Dec 15, 2023

WZMIAOMIAO
Oct 29, 2021

Replies: 2 comments 3 replies

rwightman
Oct 29, 2021
Maintainer

WZMIAOMIAO Oct 30, 2021
Author

cainmagi
Dec 15, 2023

rwightman Dec 15, 2023
Maintainer