Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* [ORO-0] Working 8 bit radix sort. * [ORO-0] Some optimization. * Create LICENSE * Update README.md (#15) * Feature/oro 0 raw get set (#19) * [ORO-0] Rename setter and getter. * [ORO-0] Fix when there is a dll but no device. * [ORO-0] Deletion function. * [ORO-0] Multi processor count. * [ORO-0] Extended the sort to more than 8 bits. Implemented tests. * [ORO-0] Moved temp buffer allocation out from the sort(). * [ORO-0] README. References. * [ORO-0] Debug flag. * Refactor the code to add the basic constructs to support selecting different scan algorithms. Add different implementation of the scan algorithm: CPU, single WG and all WG . Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <takahiroharada@gmail.com> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * Optimization: Implement the single-pass kernel for GPU parallel scan. Fix a GPU memory bug. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Feature/oro 0 kernel cache (#4) * [ORO-0] Cache kernel. * [ORO-0] Support newer HIP builds on windows (#22) * [ORO-0] Unit test. (#23) * Fix LDS scan bug. The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block). Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap). Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize the LDS scan algorithm. (#6) * Optimize the LDS scan algorithm. This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Support an input array in LDS that is 2 times the WG size. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Feature/oro 0 clean up (#7) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <takahiroharada@gmail.com> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * Feature/oro 0 clean up (#10) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <takahiroharada@gmail.com> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * [ORO-0] SortKernel1. Less complex. (#8) SortKernel (occupancy: 8) - vgpr: 128 - lds: 6704 SortKernel1 (occupancy: 9) - vgpr: 106 - lds 7720 * [ORO-0] Kernel execution time check. * Fix the memory access pattern and change it to coalesced memory access. (#11) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Single kernel sort for small keys. (#12) * Optimize the Count kernel for less LDS usage to achieve full occupancy (#13) * Optimize the Count kernel to let it use less LDS and could achieve full occupancy. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Remove __threadfence_block() Removes the boundary check in the inner loop. The upper bound is set only once before going into the loop. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Introduce DRIVER and RTC APIs * Disable enum-variant * Improve paths * Add fields * Update Vulkan test * Define CUDA in terms of DRIVER and RTC * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> * Merging another merge (#18) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15) * Calculate the number of WGs based on LDS and max-thread-per-WGP. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Add a workaround for CUDA. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Implement key-value pair sorting (#17) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Add basic structure for key-value pair sorting. Fix an error in single pass sort Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Add Value data in the test and sort it according to keys. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Support Key only sorting. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Make single pass kernel non compile time switch. * Support both Key-Only & Key-Value pair sort kernels Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Test change. * [ORO-0] A bug. * [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible. Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> * [ORO-0] Revert demo code. * Fix missing CUDA properties. (#26) * Update Orochi.cpp * [ORO-0] Clean up. * [ORO-0] OroUtils. (#27) * [ORO-0] OroUtils. * [ORO-0] Linux build fix. * [ORO-0] Forgot to add. * [ORO-0] Linux build fix. * [ORO-0] Clean up. Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com> Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com>
- Loading branch information