-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support of MViTv2 video variants #6373
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was referenced Aug 5, 2022
datumbox
commented
Aug 8, 2022
datumbox
commented
Aug 9, 2022
This comment was marked as outdated.
This comment was marked as outdated.
jdsgomes
reviewed
Aug 10, 2022
jdsgomes
reviewed
Aug 10, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
jdsgomes
approved these changes
Aug 10, 2022
pytorchmergebot
pushed a commit
to pytorch/pytorch
that referenced
this pull request
Aug 11, 2022
Addresses some [breakages](https://github.com/pytorch/pytorch/runs/7782559841?check_suite_focus=true) from #82560 Context: The tests are breaking because a new architecture was added in TorchVision (see pytorch/vision#6373) that requires a different input size. This PR addresses it by using the right size for the `mvit_v2_s` architecture. Pull Request resolved: #83242 Approved by: https://github.com/ezyang
facebook-github-bot
pushed a commit
that referenced
this pull request
Aug 23, 2022
Summary: * Extending to support MViTv2 * Fix docs, mypy and linter * Refactor the relative positional code. * Code refactoring. * Rename vars. * Update docs. * Replace assert with exception. * Updat docs. * Minor refactoring. * Remove the square input limitation. * Moving methods around. * Modify the shortcut in the attention layer. * Add ported weights. * Introduce a `residual_cls` config on the attention layer. * Make the patch_embed kernel/padding/stride configurable. * Apply changes from code-review. * Remove stale todo. Reviewed By: datumbox Differential Revision: D38824226 fbshipit-source-id: 2950997bb37e431d76a0480b5b938b15b1d5eeaf
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the continuation of the work from #6198 to finalize the API of MViT class. The PR extends TorchVision's existing MViT architecture to support v2 variants.
The
mvit_v2_s
variant introduced is canonical and the weights are ported from the paper. This is based on the work of @lyttonhao, @haooooooqi and @feichtenhofer on SlowFast.Verification process
Comparing outputs
To confirm that the implementation is compatible with the original from SlowFast we create a weight converter, load the same weights for both implementations and compare them against the same input:
Benchmarks
To ensure that we don't introduce any speed regression we test the speed as follows:
This was tested on an A100 and as we see below the implementation is 5% faster than the original:
Accuracy
To verify the accuracy of the model we run the following:
Note that the reporting Acc@1 is a bit lower than the one of the paper but this is due to the version of the dataset that we use to assess the model (some corrupted videos are removed). To ensure that the accuracy of TorchVision's implementation is not lagging, we are testing the same data and weights using Slowfast reference scripts:
As we can see the accuracies are practically the same, with minor differences caused by differences on the VideoClip sampling mechanism.