-
Notifications
You must be signed in to change notification settings - Fork 22.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 0-dim Tensor
overload to _foreach_mul
#106677
Conversation
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106677
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 355b464: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
in fast path Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
@pytorchbot label "module: mta" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @crcrpar for the speedy PR! :D
Looks good overall, I have some minor comments on error messages and testing. I also have a broader question => why does this overload only handle scalar tensors vs any tensor that could be broadcasted into an element of the tensorlist? What is holding us back from adding general support for the more general tensor?
#define FOREACH_BINARY_OP_SCALAR_TENSOR(FUNCTION, NAME, OP, DIVISION_OP) \ | ||
void foreach_tensor_##NAME##_tensor_kernel_cuda_( \ | ||
TensorList tensors, const Tensor& scalar) { \ | ||
check_foreach_api_restrictions(tensors); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to keep making this check even in the functions we dispatch into? Like in foreach_tensor_##OP##tensor_kernel_slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need this in ones defined in ForeachOpsKernels.cpp since CPU TensorList inputs don't come here. would this answer your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I realized that. However, could we postpone this check to after we run the slow path (since that will check this anyway)? Or is the runtime for check_foreach_api_restrictions negligible? This also doesn't need to happen in this PR, I'm just curious why the precedent was set as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not sure about the benefit of it because we call this check in the fast path at some point and what it does is basically comparing the size of tensorlists and scalarlists.
I once thought of merging these two into one for fast path while keeping the first check for the slow path.
cc @janeyx99 there was some discussion here, where it seems like defining the behavior of "what does the user expect when the |
#define FOREACH_BINARY_OP_SCALAR_TENSOR(FUNCTION, NAME, OP, DIVISION_OP) \ | ||
void foreach_tensor_##NAME##_tensor_kernel_cuda_( \ | ||
TensorList tensors, const Tensor& scalar) { \ | ||
check_foreach_api_restrictions(tensors); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need this in ones defined in ForeachOpsKernels.cpp since CPU TensorList inputs don't come here. would this answer your question?
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
CI failures look to just be error message string matching--should be chill to fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good conditioned on green CI. Thanks for the change!
Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
rel: - pytorch#106427 Pull Request resolved: pytorch#106677 Approved by: https://github.com/janeyx99
This PR is ALMOST basically just following the steps from #106677 EXCEPT! We do add one feature. Similar to fused_adam(w), for the CUDA dispatches: when the scalar tensor is on CPU, we .item and redispatch to the normal scalar overload. Otherwise, the cuda kernel will complain about mismatch in devices between the scalar and the tensors. Why do we add this feature? Our optimizers want to allow lr as a tensor, and lr could be a CPU tensor. lr is used with foreach_div_ in Adam, so our CI will break otherwise. After this PR, `_foreach_mul` and `_foreach_div` will accept either a CPU or a GPU tensor for the scalar tensor (vs only a GPU tensor). They join the ranks of `fused_adam(w)` in this characteristic. I did not yet do the same thing for foreach_add (the only other foreach op with a .Tensor overload) because there is no use case and will be more involved. cc crcrpar [ghstack-poisoned]
This PR is ALMOST basically just following the steps from #106677 EXCEPT! We do add one feature. Similar to fused_adam(w), for the CUDA dispatches: when the scalar tensor is on CPU, we .item and redispatch to the normal scalar overload. Otherwise, the cuda kernel will complain about mismatch in devices between the scalar and the tensors. Why do we add this feature? Our optimizers want to allow lr as a tensor, and lr could be a CPU tensor. lr is used with foreach_div_ in Adam, so our CI will break otherwise. After this PR, `_foreach_mul` and `_foreach_div` will accept either a CPU or a GPU tensor for the scalar tensor (vs only a GPU tensor). They join the ranks of `fused_adam(w)` in this characteristic. I did not yet do the same thing for foreach_add (the only other foreach op with a .Tensor overload) because there is no use case and will be more involved. Pull Request resolved: #113688 Approved by: https://github.com/mlazos, https://github.com/albanD
rel:
torch.nn.utils.clip_grad_norm_()
causes H2D sync with foreach ops. #106427cc @mcarilli @bdhirsh