Add 0-dim `Tensor` overload to `_foreach_mul` #106677

crcrpar · 2023-08-06T09:16:26Z

rel:

torch.nn.utils.clip_grad_norm_() causes H2D sync with foreach ops. #106427

cc @mcarilli @bdhirsh

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

pytorch-bot · 2023-08-06T09:16:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106677

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 355b464:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

in fast path Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

crcrpar · 2023-08-06T23:05:36Z

@pytorchbot label "module: mta"

janeyx99

Thanks @crcrpar for the speedy PR! :D

Looks good overall, I have some minor comments on error messages and testing. I also have a broader question => why does this overload only handle scalar tensors vs any tensor that could be broadcasted into an element of the tensorlist? What is holding us back from adding general support for the more general tensor?

aten/src/ATen/native/ForeachOpsKernels.cpp

aten/src/ATen/native/cuda/ForeachBinaryOpScalarTensor.cu

janeyx99 · 2023-08-07T14:27:55Z

aten/src/ATen/native/cuda/ForeachBinaryOpScalarTensor.cu

+#define FOREACH_BINARY_OP_SCALAR_TENSOR(FUNCTION, NAME, OP, DIVISION_OP) \
+  void foreach_tensor_##NAME##_tensor_kernel_cuda_(                      \
+      TensorList tensors, const Tensor& scalar) {                        \
+    check_foreach_api_restrictions(tensors);                             \


Why do we have to keep making this check even in the functions we dispatch into? Like in foreach_tensor_##OP##tensor_kernel_slow?

we need this in ones defined in ForeachOpsKernels.cpp since CPU TensorList inputs don't come here. would this answer your question?

Yea, I realized that. However, could we postpone this check to after we run the slow path (since that will check this anyway)? Or is the runtime for check_foreach_api_restrictions negligible? This also doesn't need to happen in this PR, I'm just curious why the precedent was set as such.

Hmm, I'm not sure about the benefit of it because we call this check in the fast path at some point and what it does is basically comparing the size of tensorlists and scalarlists.
I once thought of merging these two into one for fast path while keeping the first check for the slow path.

test/test_foreach.py

bdhirsh · 2023-08-07T15:09:28Z

cc @janeyx99 there was some discussion here, where it seems like defining the behavior of "what does the user expect when the other tensor has multiple elements" is a bit ambiguous. Since this the most concrete benefit of this overload today is in preventing a H2D sync in clip_grad_norm, where we know that the second argument is a 0-dim scalar-tensor, then we tentatively agreed to put off resolving that ambiguity until there's a clearer use case for it.

crcrpar · 2023-08-07T17:15:47Z

aten/src/ATen/native/cuda/ForeachBinaryOpScalarTensor.cu

+#define FOREACH_BINARY_OP_SCALAR_TENSOR(FUNCTION, NAME, OP, DIVISION_OP) \
+  void foreach_tensor_##NAME##_tensor_kernel_cuda_(                      \
+      TensorList tensors, const Tensor& scalar) {                        \
+    check_foreach_api_restrictions(tensors);                             \


we need this in ones defined in ForeachOpsKernels.cpp since CPU TensorList inputs don't come here. would this answer your question?

aten/src/ATen/native/cuda/ForeachBinaryOpScalarTensor.cu

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

janeyx99 · 2023-08-07T19:46:55Z

CI failures look to just be error message string matching--should be chill to fix.

janeyx99

Looks good conditioned on green CI. Thanks for the change!

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

crcrpar · 2023-08-08T00:16:21Z

@pytorchbot merge

pytorchmergebot · 2023-08-08T00:18:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

rel: - pytorch#106427 Pull Request resolved: pytorch#106677 Approved by: https://github.com/janeyx99

This PR is ALMOST basically just following the steps from #106677 EXCEPT! We do add one feature. Similar to fused_adam(w), for the CUDA dispatches: when the scalar tensor is on CPU, we .item and redispatch to the normal scalar overload. Otherwise, the cuda kernel will complain about mismatch in devices between the scalar and the tensors. Why do we add this feature? Our optimizers want to allow lr as a tensor, and lr could be a CPU tensor. lr is used with foreach_div_ in Adam, so our CI will break otherwise. After this PR, `_foreach_mul` and `_foreach_div` will accept either a CPU or a GPU tensor for the scalar tensor (vs only a GPU tensor). They join the ranks of `fused_adam(w)` in this characteristic. I did not yet do the same thing for foreach_add (the only other foreach op with a .Tensor overload) because there is no use case and will be more involved. cc crcrpar [ghstack-poisoned]

This PR is ALMOST basically just following the steps from #106677 EXCEPT! We do add one feature. Similar to fused_adam(w), for the CUDA dispatches: when the scalar tensor is on CPU, we .item and redispatch to the normal scalar overload. Otherwise, the cuda kernel will complain about mismatch in devices between the scalar and the tensors. Why do we add this feature? Our optimizers want to allow lr as a tensor, and lr could be a CPU tensor. lr is used with foreach_div_ in Adam, so our CI will break otherwise. After this PR, `_foreach_mul` and `_foreach_div` will accept either a CPU or a GPU tensor for the scalar tensor (vs only a GPU tensor). They join the ranks of `fused_adam(w)` in this characteristic. I did not yet do the same thing for foreach_add (the only other foreach op with a .Tensor overload) because there is no use case and will be more involved. Pull Request resolved: #113688 Approved by: https://github.com/mlazos, https://github.com/albanD

crcrpar added 2 commits August 6, 2023 01:13

foreach_mul with 0d tensor overload

b448399

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

error test

a4a17f1

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

crcrpar requested review from albanD and soulitzer as code owners August 6, 2023 09:16

pytorch-bot bot added the release notes: foreach_frontend release notes category label Aug 6, 2023

pytorchbot added the open source label Aug 6, 2023

crcrpar added 3 commits August 6, 2023 05:34

update decomp

381e325

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

constrain rhs tensor to have the scalar_type of self

7c32427

in fast path Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

comment

ff602d7

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

pytorch-bot bot added the module: mta Issues related to multi-tensor apply kernels and foreach functions label Aug 6, 2023

albanD requested a review from janeyx99 August 6, 2023 23:22

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 6, 2023

janeyx99 reviewed Aug 7, 2023

View reviewed changes

crcrpar commented Aug 7, 2023

View reviewed changes

crcrpar and others added 3 commits August 8, 2023 02:19

Apply suggestions from code review

765642d

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>

clang-format

5480d4c

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

0dim Tensor overload

93c0b5f

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

crcrpar requested review from mruberry and ngimel as code owners August 7, 2023 18:07

janeyx99 approved these changes Aug 7, 2023

View reviewed changes

msg pattern update & other -> scalar

355b464

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 8, 2023

pytorchmergebot added the merging label Aug 8, 2023

pytorchmergebot added the Merged label Aug 8, 2023

pytorchmergebot removed the merging label Aug 8, 2023

pytorchmergebot closed this in 9e4e0ec Aug 8, 2023

crcrpar deleted the foreach_mul_tensor_overload branch August 8, 2023 04:45

janeyx99 mentioned this pull request Aug 8, 2023

Optimizers should use learning rates passed as tensors directly #106802

Open

Cyril-Anto pushed a commit to Cyril-Anto/pytorch that referenced this pull request Aug 17, 2023

Add 0-dim Tensor overload to _foreach_mul (pytorch#106677)

018e5ca

rel: - pytorch#106427 Pull Request resolved: pytorch#106677 Approved by: https://github.com/janeyx99

janeyx99 mentioned this pull request Nov 15, 2023

Add 0dim Tensor overload for _foreach_div #113688

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 0-dim `Tensor` overload to `_foreach_mul` #106677

Add 0-dim `Tensor` overload to `_foreach_mul` #106677

crcrpar commented Aug 6, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 6, 2023 •

edited

Loading

crcrpar commented Aug 6, 2023

janeyx99 left a comment

janeyx99 Aug 7, 2023

crcrpar Aug 7, 2023 •

edited

Loading

janeyx99 Aug 7, 2023

crcrpar Aug 7, 2023

bdhirsh commented Aug 7, 2023

crcrpar Aug 7, 2023 •

edited

Loading

janeyx99 commented Aug 7, 2023

janeyx99 left a comment

crcrpar commented Aug 8, 2023

pytorchmergebot commented Aug 8, 2023

Add 0-dim Tensor overload to _foreach_mul #106677

Add 0-dim Tensor overload to _foreach_mul #106677

Conversation

crcrpar commented Aug 6, 2023 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Aug 6, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106677

✅ No Failures

crcrpar commented Aug 6, 2023

janeyx99 left a comment

Choose a reason for hiding this comment

janeyx99 Aug 7, 2023

Choose a reason for hiding this comment

crcrpar Aug 7, 2023 • edited Loading

Choose a reason for hiding this comment

janeyx99 Aug 7, 2023

Choose a reason for hiding this comment

crcrpar Aug 7, 2023

Choose a reason for hiding this comment

bdhirsh commented Aug 7, 2023

crcrpar Aug 7, 2023 • edited Loading

Choose a reason for hiding this comment

janeyx99 commented Aug 7, 2023

janeyx99 left a comment

Choose a reason for hiding this comment

crcrpar commented Aug 8, 2023

pytorchmergebot commented Aug 8, 2023

Merge started

Add 0-dim `Tensor` overload to `_foreach_mul` #106677

Add 0-dim `Tensor` overload to `_foreach_mul` #106677

crcrpar commented Aug 6, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 6, 2023 •

edited

Loading

crcrpar Aug 7, 2023 •

edited

Loading

crcrpar Aug 7, 2023 •

edited

Loading