You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The code is written based on timm and provides pretrained weights on ImageNet1k. But there are many layers customized in the code which are different from the implementation of timm. So I'm not sure if we need to make significant adjustments to these code.
It looks interesting, but it doesn't seem like the paper has been released.
The text was updated successfully, but these errors were encountered:
yeah, noticed this one, it is timm oriented but as always, baked in square image size assumptions and put the downsample at the end of the blocks so needs a decent amount of attention to fix and remap :(
I really truly don't understand the obsession with putting downsample at the end of vit/hybrid blocks :(
Other thing is, I've never found gcvit (same authors) to be particularly easy to train or fine-tune (including reproducing the original results) compared to vit, swin, convnext (which I've successfully managed to reproduce and improve on originals). I wonder how this compares.... given the complexity of the model code, I found the throughput #s surprising as more code usually == more activations and slower speeds.
”FasterViT: Fast Vision Transformers with Hierarchical Attention“
https://github.com/NVlabs/FasterViT
The code is written based on timm and provides pretrained weights on ImageNet1k. But there are many layers customized in the code which are different from the implementation of timm. So I'm not sure if we need to make significant adjustments to these code.
It looks interesting, but it doesn't seem like the paper has been released.
The text was updated successfully, but these errors were encountered: