-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI] Make cumsum IR reusable, add thrust scan #7303
Conversation
7a0403d
to
2d58ef7
Compare
Scan is probably the most hand-optimized kernel in thrust, I'm thrilled to be within 10x for a cross-GPU kernel. Overall I'm happy with this, but I have 2 thoughts.
|
A related discussion point: Do you expect scan performance on non-innermost axis to be slower than the innermost case? If that's the case (which I believe yes), I think supporting non innermost scan and other ranks by
is a good solution. It is definitely preferred in terms of implementation simplicity, allowing scan implementation to focus on 1 or 2D + innermost axis. |
Yeah, scanning on the non-inner axis will have a cache locality performance hit, but I'm honestly not sure if that would be better or worse than the overhead from doing a pair of reshape/transpose ops. Reshape and transpose are heavily limited by memory bandwidth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this, we have some feature/performance questions that I think can go into follow up PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. I am new to this concept and TVM-CUDA, so can't fully comment on the details. But high-level idea and design looks solid.
One request is to run this with empty tensor (zero in the input shape) to see if any corner case is missing.
hmm interesting, I've never created a test case with empty tensor, is that possible? Note that the IR is copied straight from #7303, so the same guard against empty tensor is here. tvm/python/tvm/topi/cuda/scan.py Line 59 in 4e13a3f
|
Yes, I eyeballed the changes wrt to empty tensor and they looked good. So, I am happy to approve this PR. Once it is merged, I can try on my end with TF models as well. For empty tensor test case - https://github.com/apache/tvm/blob/main/tests/python/relay/test_any.py#L879 |
Perf improvement is not expected, since it only improves @anijain2305 The term you want to search for is "gpu stream compaction". |
commit cf0d4fd Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 10:12:01 2020 +0900 get valid count test working commit eb142d3 Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 07:22:00 2020 +0900 integrate new cumsum change commit f89684d Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 06:56:46 2020 +0900 remove ceil_div from nms commit a2ad4de Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 20:36:34 2020 +0900 add api for returning reduction from ex scan output commit b7f4ef7 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 19:49:07 2020 +0900 move ceil_div to utils commit a9a57e3 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 19:38:15 2020 +0900 rename prefix_scan.py to scan.py commit 03ed43f Author: Masahiro Masuda <masahi129@gmail.com> Date: Sat Dec 19 06:12:55 2020 +0900 surpress cpplint commit abceac9 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:36:24 2020 +0900 support more data type commit 3e7d1f8 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:09:51 2020 +0900 1d thrust scan working commit ac13b40 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 19:49:25 2020 +0900 adding thrust scan support commit 65634e8 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 19:01:11 2020 +0900 add thrust scan python stub commit 9876c90 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:55:14 2020 +0900 introduce prefix_scan.py and move scan ir in nms.py commit 667bdd3 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 15:06:18 2020 +0900 make the scan loop exclusive commit 480787b Author: mbrookhart <mbrookhart@octoml.ai> Date: Thu Dec 17 10:01:11 2020 -0700 Parallelize cumsum in get_valid_counts
@anijain2305 I added an empty tensor test in a6c7403 OpenCL seems to have a problem with 0 size buffer, but otherwise both TIR scan and thrust scan seem to have no issue. Please take a look. |
Thanks :) LGTM :) |
Thanks @mbrookhart @anijain2305 |
* import changes from scan branch commit cf0d4fd Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 10:12:01 2020 +0900 get valid count test working commit eb142d3 Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 07:22:00 2020 +0900 integrate new cumsum change commit f89684d Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 06:56:46 2020 +0900 remove ceil_div from nms commit a2ad4de Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 20:36:34 2020 +0900 add api for returning reduction from ex scan output commit b7f4ef7 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 19:49:07 2020 +0900 move ceil_div to utils commit a9a57e3 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 19:38:15 2020 +0900 rename prefix_scan.py to scan.py commit 03ed43f Author: Masahiro Masuda <masahi129@gmail.com> Date: Sat Dec 19 06:12:55 2020 +0900 surpress cpplint commit abceac9 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:36:24 2020 +0900 support more data type commit 3e7d1f8 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:09:51 2020 +0900 1d thrust scan working commit ac13b40 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 19:49:25 2020 +0900 adding thrust scan support commit 65634e8 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 19:01:11 2020 +0900 add thrust scan python stub commit 9876c90 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:55:14 2020 +0900 introduce prefix_scan.py and move scan ir in nms.py commit 667bdd3 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 15:06:18 2020 +0900 make the scan loop exclusive commit 480787b Author: mbrookhart <mbrookhart@octoml.ai> Date: Thu Dec 17 10:01:11 2020 -0700 Parallelize cumsum in get_valid_counts * fix for 1d scan * rename * cast to out dtype * do not run return reduction for inclusive scan * remove another ceil_div definition * adding scan test * add scheduling for scan op, fixed scan 1d test * pylint fix * add doc string * add more thrust scan test * add dynamic get valid count test, including empty size tensor * fix hard coded gpu targets for cpu only env * try retunring early if scan_size is 0 * another change for empty tensor and thrust path Co-authored-by: masa <masa@pop-os.localdomain>
* import changes from scan branch commit cf0d4fd Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 10:12:01 2020 +0900 get valid count test working commit eb142d3 Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 07:22:00 2020 +0900 integrate new cumsum change commit f89684d Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 06:56:46 2020 +0900 remove ceil_div from nms commit a2ad4de Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 20:36:34 2020 +0900 add api for returning reduction from ex scan output commit b7f4ef7 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 19:49:07 2020 +0900 move ceil_div to utils commit a9a57e3 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 19:38:15 2020 +0900 rename prefix_scan.py to scan.py commit 03ed43f Author: Masahiro Masuda <masahi129@gmail.com> Date: Sat Dec 19 06:12:55 2020 +0900 surpress cpplint commit abceac9 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:36:24 2020 +0900 support more data type commit 3e7d1f8 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:09:51 2020 +0900 1d thrust scan working commit ac13b40 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 19:49:25 2020 +0900 adding thrust scan support commit 65634e8 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 19:01:11 2020 +0900 add thrust scan python stub commit 9876c90 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:55:14 2020 +0900 introduce prefix_scan.py and move scan ir in nms.py commit 667bdd3 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 15:06:18 2020 +0900 make the scan loop exclusive commit 480787b Author: mbrookhart <mbrookhart@octoml.ai> Date: Thu Dec 17 10:01:11 2020 -0700 Parallelize cumsum in get_valid_counts * fix for 1d scan * rename * cast to out dtype * do not run return reduction for inclusive scan * remove another ceil_div definition * adding scan test * add scheduling for scan op, fixed scan 1d test * pylint fix * add doc string * add more thrust scan test * add dynamic get valid count test, including empty size tensor * fix hard coded gpu targets for cpu only env * try retunring early if scan_size is 0 * another change for empty tensor and thrust path Co-authored-by: masa <masa@pop-os.localdomain>
* import changes from scan branch commit cf0d4fd Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 10:12:01 2020 +0900 get valid count test working commit eb142d3 Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 07:22:00 2020 +0900 integrate new cumsum change commit f89684d Author: Masahiro Masuda <masahi129@gmail.com> Date: Fri Dec 25 06:56:46 2020 +0900 remove ceil_div from nms commit a2ad4de Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 20:36:34 2020 +0900 add api for returning reduction from ex scan output commit b7f4ef7 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 19:49:07 2020 +0900 move ceil_div to utils commit a9a57e3 Author: Masahiro Masuda <masahi129@gmail.com> Date: Sun Dec 20 19:38:15 2020 +0900 rename prefix_scan.py to scan.py commit 03ed43f Author: Masahiro Masuda <masahi129@gmail.com> Date: Sat Dec 19 06:12:55 2020 +0900 surpress cpplint commit abceac9 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:36:24 2020 +0900 support more data type commit 3e7d1f8 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:09:51 2020 +0900 1d thrust scan working commit ac13b40 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 19:49:25 2020 +0900 adding thrust scan support commit 65634e8 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 19:01:11 2020 +0900 add thrust scan python stub commit 9876c90 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 20:55:14 2020 +0900 introduce prefix_scan.py and move scan ir in nms.py commit 667bdd3 Author: masa <masa@pop-os.localdomain> Date: Fri Dec 18 15:06:18 2020 +0900 make the scan loop exclusive commit 480787b Author: mbrookhart <mbrookhart@octoml.ai> Date: Thu Dec 17 10:01:11 2020 -0700 Parallelize cumsum in get_valid_counts * fix for 1d scan * rename * cast to out dtype * do not run return reduction for inclusive scan * remove another ceil_div definition * adding scan test * add scheduling for scan op, fixed scan 1d test * pylint fix * add doc string * add more thrust scan test * add dynamic get valid count test, including empty size tensor * fix hard coded gpu targets for cpu only env * try retunring early if scan_size is 0 * another change for empty tensor and thrust path Co-authored-by: masa <masa@pop-os.localdomain>
This PR generalizes the cumsum IR developed in #7123 as a reusable, exclusive scan primitive. Also enabled offloading to thrust
exclusive_scan
(andinclusive_scan
). This makesget_valid_counts
on CUDA faster, for example.get_valid_counts
performance using different scan implementation (numbers in milli sec)Currently thrust scan is about 10x faster than TIR scan. Thrust scan is so fast that other kernels in
get_valid_counts
become bottleneck. That's why there are only 2x difference in(16, 1000000, 5)
result, for example.To show the utility of
exclusive_scan
, I'll follow up with the following PRs.argwhere
added in [TOPI][OP] cuda for argwhere #6868 usingexclusive_scan
. This will enable removing atomic and sort, vastly simplifying the implementation. Also the performance ofargwhere
will be completely bounded by the performance ofexclusive_scan
(i.e., it should be much faster than the current implementation).cumsum
Relay/TOPI op. As expected, this will simply be a wrapper aroundexclusive_scan
.unique
Relay/TOPI op. This is sort + adjacent difference + exclusive scanFor now, only 1D and 2D inputs are supported, and scan is always done on the inner most axis. The API is preliminary, comments are welcome. On CUDA, thrust scan is always used when available, but it is also possible to use thrust scan only if the target is
cuda -libs=thrust
, for example.Actually, something called "scan" is already a thing in TVM from the very beginning, but it seems it has some limitations and I don't think people care about it, so I went ahead and added
topi/cuda/scan.py
. An alternativeprefix_sum.py
also seems good to me, if people prefer.ready for review @mbrookhart @Laurawly @zhiics @trevor-m @anijain2305 @antinucleon