-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SSD512 with ResNet50 backbone #3760
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR Vasilis!
I've left a few comments, let me know what you think
self.features = nn.Sequential( | ||
backbone.conv1, | ||
backbone.bn1, | ||
backbone.relu, | ||
backbone.maxpool, | ||
backbone.layer1, | ||
backbone.layer2, | ||
backbone.layer3, | ||
backbone.layer4, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason why you didn't use the IntermediateLayerGetter
?
Also, for the future, I think we will want to unify the way we extract features so that we rely on the FX-based feature extractor, which will be more generic.
for m in self.features[-1][0].modules(): | ||
if hasattr(m, 'stride'): | ||
m.stride = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we want to instead to pass the dilation=[False, False, True]
in the ResNet? Just replacing the stride from the last layer without adding dilation means that the features from the last block are not really acting the way they were initially trained to do.
nn.Sequential( | ||
nn.Conv2d(backbone_out_channels, 256, kernel_size=1, bias=False), | ||
nn.BatchNorm2d(256), | ||
nn.ReLU(inplace=True), | ||
nn.Conv2d(256, 512, kernel_size=3, padding=1, stride=2, bias=False), | ||
nn.BatchNorm2d(512), | ||
nn.ReLU(inplace=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit for the future: Might be good to refactor this in a Block
class or something like that, which inherits from Sequential
so that we keep the same names for the modules.
Something like
class ExtraBlock(nn.Sequential):
def __init__(self, in_channels mid_channels, out_channels):
super().__init__(nn.Conv2d(...), ...)
SSD VGG16 0.2093 0.0744 1.5 | ||
SSDlite MobileNetV3-Large 0.1773 0.0906 1.5 | ||
SSD300 VGG16 0.2093 0.0744 1.5 | ||
SSD512 ResNet-50 0.2316 0.0772 3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the future: we need to change those tables as they are misleading for now -- the test time
column for the SSD models is for a batch size of 4 per GPU, while for Faster R-CNN it was for a batch size 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do a back of the envelop estimation to bring them to comparable batch-sizes?
torchvision/models/detection/ssd.py
Outdated
|
||
def forward(self, x: Tensor) -> Dict[str, Tensor]: | ||
# Undo the 0-1 scaling of toTensor. Necessary for some backbones. | ||
if self.rescaling: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for changing this!
We ran thorough benchmarks and discussed with @fmassa on whether we should add this backbone on TorchVision. For the shake of transparency, I'm going to write here why we decided not to merge this. Originally the motivation behind adding ResNet50 as a backbone was that it's a newer backbone that is likely to be more useful than VGG for people in the industry. After training the backbone we got an mAP of 30.2 (without hyperparameter tuning) which is 1.4 mAP higher than the SSD512+VGG16 reported on the SSD paper. Moreover the model is about 45% faster on CPU that the VGG16 equivalent so it looked like a great candidate. On the other hand, one of the key drawbacks of this backbone is that it's not a canonical implementation of a model backed by a paper. Rather it's inspired by the SSD paper and closely resembles other implementations (for example check NVIDIA's). Though this should not be necessarily a blocker for adding a model in the library, we do require for a new model to bring a significant improvement in at least one field. To answer the question on whether this model has a unique characteristic that none of the other models of the library has, we benchmarked it against the pre-trained FasterRCNN+ResNet50+FPN. To make the comparison fair, we restricted the input of all images to 512x512. Here are the results:
As we can see both techniques have equivalent mAPs. We should note that though we can definitely push the accuracy of both models higher by tuning them, but especially for FasterRCNN we can improve just by training on the specific fixed_size. Looking at the CPU speed the SSD model is clearly faster but that's not the case when we look on the speed on GPU. Given that the SSD512+ResNet50 backbone is not a model that one will run on mobile (SSDlite is much better candidate for that) and that it's slower than the FasterRCNN equivalent on GPU, it makes a much less compelling argument on including it. If someone stumbles upon this PR on the future and they want to give the ResNet50 a try, they will be able to do so by copying the necessary classes. I will also leave the pre-trained weights online so that people can download them if they want to. I will close this and cherrypick the improvements made on the remaining files on a separate PR. |
This PR adds a newer backbone to the standard SSD algorithm which might be more appropriate for real-world applications.
The proposed ResNet50 backbone follows a similar approach as on the paper and is 1.4 mAP more accurate than the equivalent SSD512-VGG16 described on the paper while being 45% faster.
Trained using the code committed at 2f1f578. The current best pre-trained model was trained with:
Submitted batch job 40937540 (killed midway)
Submitted batch job 40985712 (resumed)
Accuracy metrics:
Validated with:
Speed benchmark:
1.23 sec per image on CPU