Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add gitignore to the repository Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Feature/oro 0 radix sort (#19) * [ORO-0] Working 8 bit radix sort. * [ORO-0] Some optimization. * Create LICENSE * Update README.md (#15) * Feature/oro 0 raw get set (#19) * [ORO-0] Rename setter and getter. * [ORO-0] Fix when there is a dll but no device. * [ORO-0] Deletion function. * [ORO-0] Multi processor count. * [ORO-0] Extended the sort to more than 8 bits. Implemented tests. * [ORO-0] Moved temp buffer allocation out from the sort(). * [ORO-0] README. References. * [ORO-0] Debug flag. * Refactor the code to add the basic constructs to support selecting different scan algorithms. Add different implementation of the scan algorithm: CPU, single WG and all WG . Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <takahiroharada@gmail.com> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * Optimization: Implement the single-pass kernel for GPU parallel scan. Fix a GPU memory bug. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Feature/oro 0 kernel cache (#4) * [ORO-0] Cache kernel. * [ORO-0] Support newer HIP builds on windows (#22) * [ORO-0] Unit test. (#23) * Fix LDS scan bug. The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block). Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap). Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize the LDS scan algorithm. (#6) * Optimize the LDS scan algorithm. This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Support an input array in LDS that is 2 times the WG size. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Feature/oro 0 clean up (#7) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <takahiroharada@gmail.com> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * Feature/oro 0 clean up (#10) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <takahiroharada@gmail.com> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * [ORO-0] SortKernel1. Less complex. (#8) SortKernel (occupancy: 8) - vgpr: 128 - lds: 6704 SortKernel1 (occupancy: 9) - vgpr: 106 - lds 7720 * [ORO-0] Kernel execution time check. * Fix the memory access pattern and change it to coalesced memory access. (#11) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Single kernel sort for small keys. (#12) * Optimize the Count kernel for less LDS usage to achieve full occupancy (#13) * Optimize the Count kernel to let it use less LDS and could achieve full occupancy. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Remove __threadfence_block() Removes the boundary check in the inner loop. The upper bound is set only once before going into the loop. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Introduce DRIVER and RTC APIs * Disable enum-variant * Improve paths * Add fields * Update Vulkan test * Define CUDA in terms of DRIVER and RTC * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> * Merging another merge (#18) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15) * Calculate the number of WGs based on LDS and max-thread-per-WGP. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Add a workaround for CUDA. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Implement key-value pair sorting (#17) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Add basic structure for key-value pair sorting. Fix an error in single pass sort Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Add Value data in the test and sort it according to keys. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Support Key only sorting. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Make single pass kernel non compile time switch. * Support both Key-Only & Key-Value pair sort kernels Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Test change. * [ORO-0] A bug. * [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible. Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> * [ORO-0] Revert demo code. * Fix missing CUDA properties. (#26) * Update Orochi.cpp * [ORO-0] Clean up. * [ORO-0] OroUtils. (#27) * [ORO-0] OroUtils. * [ORO-0] Linux build fix. * [ORO-0] Forgot to add. * [ORO-0] Linux build fix. * [ORO-0] Clean up. Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com> Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com> * Add kernel path and include dir to the functions. (#20) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] BakeKernel. (#21) * [ORO-0] BakeKernel. * Update tools/genArgs.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/genArgs.py dead code removal * Update tools/stringify.py dead code removal * fix include Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * fix script Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * fix Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix Orochi CUDA API (#23) Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Linux build fix. (#22) * [ORO-0] Linux build fix. * Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Quick fix for old linux gcc which does not support std::exclusive_scan (#24) Quick fix for old linux gcc which does not support std::exclusive_scan Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix the kernel cache bug. (#25) Fix the kernel cache bug. The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Remove static variables. (#26) * [ORO-0] Remove static variables. * [ORO-0] Applied the suggestions. * [ORO-0] Linux regression fix. * Fix OrochiUtils::getFunctionFromString API (#27) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Adding missing assert (#28) * Adding missing assert * Adding more asserts * Feature/oro 0 gpuopen merge (#31) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <takahiroharada@gmail.com> * fix footnote markdown format (#39) * Fix orochi utils issue in unit tests Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: NevesLucas <neves.lucas.m@gmail.com> Co-authored-by: PixelClear <pariku@amd.com> * remove space after -I (#33) * Feature/oro 0 gpuopen merge 2 (#32) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <takahiroharada@gmail.com> * fix footnote markdown format (#39) * Feature/oro 0 amdadvtech merge (#43) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Feature/oro 0 radix sort (#19) * [ORO-0] Working 8 bit radix sort. * [ORO-0] Some optimization. * Create LICENSE * Update README.md (#15) * Feature/oro 0 raw get set (#19) * [ORO-0] Rename setter and getter. * [ORO-0] Fix when there is a dll but no device. * [ORO-0] Deletion function. * [ORO-0] Multi processor count. * [ORO-0] Extended the sort to more than 8 bits. Implemented tests. * [ORO-0] Moved temp buffer allocation out from the sort(). * [ORO-0] README. References. * [ORO-0] Debug flag. * Refactor the code to add the basic constructs to support selecting different scan algorithms. Add different implementation of the scan algorithm: CPU, single WG and all WG . Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <takahiroharada@gmail.com> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * Optimization: Implement the single-pass kernel for GPU parallel scan. Fix a GPU memory bug. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Feature/oro 0 kernel cache (#4) * [ORO-0] Cache kernel. * [ORO-0] Support newer HIP builds on windows (#22) * [ORO-0] Unit test. (#23) * Fix LDS scan bug. The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block). Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap). Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize the LDS scan algorithm. (#6) * Optimize the LDS scan algorithm. This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Support an input array in LDS that is 2 times the WG size. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Feature/oro 0 clean up (#7) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <takahiroharada@gmail.com> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * Feature/oro 0 clean up (#10) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <takahiroharada@gmail.com> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * [ORO-0] SortKernel1. Less complex. (#8) SortKernel (occupancy: 8) - vgpr: 128 - lds: 6704 SortKernel1 (occupancy: 9) - vgpr: 106 - lds 7720 * [ORO-0] Kernel execution time check. * Fix the memory access pattern and change it to coalesced memory access. (#11) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Single kernel sort for small keys. (#12) * Optimize the Count kernel for less LDS usage to achieve full occupancy (#13) * Optimize the Count kernel to let it use less LDS and could achieve full occupancy. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Remove __threadfence_block() Removes the boundary check in the inner loop. The upper bound is set only once before going into the loop. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Introduce DRIVER and RTC APIs * Disable enum-variant * Improve paths * Add fields * Update Vulkan test * Define CUDA in terms of DRIVER and RTC * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> * Merging another merge (#18) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15) * Calculate the number of WGs based on LDS and max-thread-per-WGP. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Add a workaround for CUDA. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Implement key-value pair sorting (#17) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Add basic structure for key-value pair sorting. Fix an error in single pass sort Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Add Value data in the test and sort it according to keys. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Support Key only sorting. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Make single pass kernel non compile time switch. * Support both Key-Only & Key-Value pair sort kernels Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Test change. * [ORO-0] A bug. * [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible. Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> * [ORO-0] Revert demo code. * Fix missing CUDA properties. (#26) * Update Orochi.cpp * [ORO-0] Clean up. * [ORO-0] OroUtils. (#27) * [ORO-0] OroUtils. * [ORO-0] Linux build fix. * [ORO-0] Forgot to add. * [ORO-0] Linux build fix. * [ORO-0] Clean up. Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com> Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com> * Add kernel path and include dir to the functions. (#20) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] BakeKernel. (#21) * [ORO-0] BakeKernel. * Update tools/genArgs.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/genArgs.py dead code removal * Update tools/stringify.py dead code removal * fix include Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * fix script Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * fix Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix Orochi CUDA API (#23) Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Linux build fix. (#22) * [ORO-0] Linux build fix. * Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Quick fix for old linux gcc which does not support std::exclusive_scan (#24) Quick fix for old linux gcc which does not support std::exclusive_scan Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix the kernel cache bug. (#25) Fix the kernel cache bug. The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Remove static variables. (#26) * [ORO-0] Remove static variables. * [ORO-0] Applied the suggestions. * [ORO-0] Linux regression fix. * Fix OrochiUtils::getFunctionFromString API (#27) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Adding missing assert (#28) * Adding missing assert * Adding more asserts * Feature/oro 0 gpuopen merge (#31) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <takahiroharada@gmail.com> * fix footnote markdown format (#39) * Fix orochi utils issue in unit tests Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: NevesLucas <neves.lucas.m@gmail.com> Co-authored-by: PixelClear <pariku@amd.com> Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com> Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com> Co-authored-by: Daniel Meister <daniel.meister@amd.com> Co-authored-by: NevesLucas <neves.lucas.m@gmail.com> Co-authored-by: PixelClear <pariku@amd.com> * [ORO-0] bitcode/cubin linking APIs (#40) * [ORO-0] Link apis. * [ORO-0] Forgot to add. * [ORO-0] Linking test. * [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize * [ORO-0] Update link unit tests with comments * [ORO-0] Change test for CUBIN instead of PTX * [ORO-0] Fix loadfile to use binary mode, remove printf in kernel * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Load amdhip first, then hiprtc. * [ORO-0] Remove assert from hiprtc library checks * [ORO-0] Add gfx1030 bitcode for navi21 * [MNN-0] Fix premake and add more link testcases * [ORO-0] Update a link_null_name testcase * [ORO-0] Make unit tests more stable on CUDA * [ORO-0] Update bitcode for gfx1030 * [ORO-0] Add bitcodes for navi1,2, vega * [ORO-0] Add hiprtc.dll and comgr dll * [ORO-0] Add gfx906 bitcodes * [ORO-0] Support unit tests on both HIP and CUDA * [ORO-0] Update dlls and bitcodes * [ORO-0] Update bitcodes and generation script * [ORO-0] Minor fixes in bundled bitcode unit tests * [ORO-0] Fix typo in options * [ORO-0] Fix getCUBIN/PTX signatures * [ORO-0] Fix unit tests and generate fatbin for CUDA * [ORO-0] Regenerate fatbin and fix script * [ORO-0] Cleanup * [ORO-0] Update bundled bitcodes to only contain navi21 for now * [ORO-0] Updated bundled bitcode * [ORO-0] add ORO_LAUNCH_PARAMS_* * [ORO-0] Add unit test for orortcLinkAddFile * [ORO-0] Add unittest scripts for TC * [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA * [ORO-0] Add bitcode+bundled bitcode link test * [ORO-0] Cleanup * [ORO-0] Fix typo in script * [ORO-0] Update linux TC script Co-authored-by: takahiroharada <takahiroharada@gmail.com> * [ORO-0] Get global memory size for CUDA (#44) * [ORO-0] Update HIP dll's for bitcode linking support * [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46) * [ORO-0] Update HIP dll's for bitcode linking support * [ORO-0] Add getLoweredName testcase * [ORO-0] Update unittest filter * [ORO-0] Update loweredName test * [ORO-0] Add missing test kernel * [ORO-0] Fix loweredName test * [ORO-0] Fix linux compilation * [ORO-0] Remove printf from test kernel (#37) * [ORO-0] Allow usage of libhiprtc64.so if exists * [ORO-0] Fix linux loading of libhiprtc.so Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: NevesLucas <neves.lucas.m@gmail.com> Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com> Co-authored-by: Daniel Meister <daniel.meister@amd.com> Co-authored-by: PixelClear <pariku@amd.com> * Feature/oro 0 radix sort stream (#34) * Initial commit * Streams to the configuration * Mutex in OrochiUtils * Feature/oro 0 radix sort mutex baking (#36) * Locking other methods in OrochiUtils * Removing mutex from static methods * Making mutex and map static * Removing static from OrochiUtils * Removing static from OrochiUtils * Support Precompiled Kernels in Orochi (#37) * Add bitcode support: getFunctionFromPrecompiledBinary Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Add bitcode and the script to generate it. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * rewrite OROASSERT. Fix include file order. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Use string instead of const char* Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Rename the option from bitcode to precompiled Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Add bitcode script for nvidia fatbin * [ORO-0] CUDA - hipfb->fatbin rename Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com> * Feature/oro 0 resource limits (#38) * Adding limit functions * Removing enum * Removing enum * Limit enum * char string Windows API (#39) * [ORO-0] Update precompiled radix sort kernels to use -ffast-math (#42) * [ORO-0] Update precompiled radix sort kernels to use -ffast-math * [ORO-0] Update RadixSort fatbin for NVIDIA and use fast math * [ORO-0] Function pointer test. (#40) * [ORO-0] Function pointer test. * [ORO-0] launch2d. * [ORO-0] Event, OroStopwatch. * Implement GpuMemory to handle device memory operations. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Sync with GPUOpen/LibrariesAndSDKs/Orochi (#44) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <takahiroharada@gmail.com> * fix footnote markdown format (#39) * Feature/oro 0 amdadvtech merge (#43) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Feature/oro 0 radix sort (#19) * [ORO-0] Working 8 bit radix sort. * [ORO-0] Some optimization. * Create LICENSE * Update README.md (#15) * Feature/oro 0 raw get set (#19) * [ORO-0] Rename setter and getter. * [ORO-0] Fix when there is a dll but no device. * [ORO-0] Deletion function. * [ORO-0] Multi processor count. * [ORO-0] Extended the sort to more than 8 bits. Implemented tests. * [ORO-0] Moved temp buffer allocation out from the sort(). * [ORO-0] README. References. * [ORO-0] Debug flag. * Refactor the code to add the basic constructs to support selecting different scan algorithms. Add different implementation of the scan algorithm: CPU, single WG and all WG . Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <takahiroharada@gmail.com> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * Optimization: Implement the single-pass kernel for GPU parallel scan. Fix a GPU memory bug. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Feature/oro 0 kernel cache (#4) * [ORO-0] Cache kernel. * [ORO-0] Support newer HIP builds on windows (#22) * [ORO-0] Unit test. (#23) * Fix LDS scan bug. The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block). Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap). Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize the LDS scan algorithm. (#6) * Optimize the LDS scan algorithm. This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Support an input array in LDS that is 2 times the WG size. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Feature/oro 0 clean up (#7) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <takahiroharada@gmail.com> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * Feature/oro 0 clean up (#10) * Squashed commit of the following: commit 3f32bea2244653d59efb3c3eaa9433018dde5835 Author: takahiroharada <takahiroharada@gmail.com> Date: Wed Apr 13 10:48:35 2022 -0700 [ORO-0] Fix nvrtc. * [ORO-0] Clean up. * [ORO-0] SortKernel1. Less complex. (#8) SortKernel (occupancy: 8) - vgpr: 128 - lds: 6704 SortKernel1 (occupancy: 9) - vgpr: 106 - lds 7720 * [ORO-0] Kernel execution time check. * Fix the memory access pattern and change it to coalesced memory access. (#11) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Single kernel sort for small keys. (#12) * Optimize the Count kernel for less LDS usage to achieve full occupancy (#13) * Optimize the Count kernel to let it use less LDS and could achieve full occupancy. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Remove __threadfence_block() Removes the boundary check in the inner loop. The upper bound is set only once before going into the loop. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Introduce DRIVER and RTC APIs * Disable enum-variant * Improve paths * Add fields * Update Vulkan test * Define CUDA in terms of DRIVER and RTC * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> * Merging another merge (#18) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15) * Calculate the number of WGs based on LDS and max-thread-per-WGP. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Add a workaround for CUDA. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14) * Fix a minor issue in CountKernel to make it more robust. Implement a single-pass 8-bit local sort. Implement a single-pass 8-bit local sort with shared bins. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix nItemsPerWI and enable the version with shared LDS. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Print driver version. * [ORO-0] Repro case. * Fix SORT_WG_SIZE. Fix stable sort order. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Optimize sort kernel to remove inner boundary check. Adjust nItemsPerWI. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Implement key-value pair sorting (#17) * Add gitignore to the repository Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix missing CUDA properties. (#16) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Add basic structure for key-value pair sorting. Fix an error in single pass sort Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Add Value data in the test and sort it according to keys. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Support Key only sorting. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Make single pass kernel non compile time switch. * Support both Key-Only & Key-Value pair sort kernels Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Test change. * [ORO-0] A bug. * [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible. Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> * [ORO-0] Revert demo code. * Fix missing CUDA properties. (#26) * Update Orochi.cpp * [ORO-0] Clean up. * [ORO-0] OroUtils. (#27) * [ORO-0] OroUtils. * [ORO-0] Linux build fix. * [ORO-0] Forgot to add. * [ORO-0] Linux build fix. * [ORO-0] Clean up. Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com> Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com> * Add kernel path and include dir to the functions. (#20) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] BakeKernel. (#21) * [ORO-0] BakeKernel. * Update tools/genArgs.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/stringify.py commented code removal * Update tools/genArgs.py dead code removal * Update tools/stringify.py dead code removal * fix include Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * fix script Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * fix Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix Orochi CUDA API (#23) Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Linux build fix. (#22) * [ORO-0] Linux build fix. * Fix Orochi CUDA API Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Quick fix for old linux gcc which does not support std::exclusive_scan (#24) Quick fix for old linux gcc which does not support std::exclusive_scan Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Fix the kernel cache bug. (#25) Fix the kernel cache bug. The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Remove static variables. (#26) * [ORO-0] Remove static variables. * [ORO-0] Applied the suggestions. * [ORO-0] Linux regression fix. * Fix OrochiUtils::getFunctionFromString API (#27) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Adding missing assert (#28) * Adding missing assert * Adding more asserts * Feature/oro 0 gpuopen merge (#31) * Fix oroGetDeviceProperties in cuda path. * Fix linux crash (#29) * [ORO-0] Added missing file. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Fix hipGetErrorString (#32) * [ORO-0] Fix hipGetErrorString It was incorrectly importing this API. Import the correct API in hipew. * [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31) * [ORO-0] Skip compilation of vulkan test on Linux * [ORO-0] Update kernelExec unit test - remove printf * [ORO-0] Remove cout * [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33) * Add missing path on Apple config. (#34) * [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38) * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Add hiprtc.dll and comgr dll Co-authored-by: takahiroharada <takahiroharada@gmail.com> * fix footnote markdown format (#39) * Fix orochi utils issue in unit tests Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: NevesLucas <neves.lucas.m@gmail.com> Co-authored-by: PixelClear <pariku@amd.com> Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com> Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com> Co-authored-by: Daniel Meister <daniel.meister@amd.com> Co-authored-by: NevesLucas <neves.lucas.m@gmail.com> Co-authored-by: PixelClear <pariku@amd.com> * [ORO-0] bitcode/cubin linking APIs (#40) * [ORO-0] Link apis. * [ORO-0] Forgot to add. * [ORO-0] Linking test. * [ORO-0] Add orortcGetBitcode/orortcGetBitcodeSize * [ORO-0] Update link unit tests with comments * [ORO-0] Change test for CUBIN instead of PTX * [ORO-0] Fix loadfile to use binary mode, remove printf in kernel * [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022. * [ORO-0] Created win64 subdir. * [ORO-0] Load amdhip first, then hiprtc. * [ORO-0] Remove assert from hiprtc library checks * [ORO-0] Add gfx1030 bitcode for navi21 * [MNN-0] Fix premake and add more link testcases * [ORO-0] Update a link_null_name testcase * [ORO-0] Make unit tests more stable on CUDA * [ORO-0] Update bitcode for gfx1030 * [ORO-0] Add bitcodes for navi1,2, vega * [ORO-0] Add hiprtc.dll and comgr dll * [ORO-0] Add gfx906 bitcodes * [ORO-0] Support unit tests on both HIP and CUDA * [ORO-0] Update dlls and bitcodes * [ORO-0] Update bitcodes and generation script * [ORO-0] Minor fixes in bundled bitcode unit tests * [ORO-0] Fix typo in options * [ORO-0] Fix getCUBIN/PTX signatures * [ORO-0] Fix unit tests and generate fatbin for CUDA * [ORO-0] Regenerate fatbin and fix script * [ORO-0] Cleanup * [ORO-0] Update bundled bitcodes to only contain navi21 for now * [ORO-0] Updated bundled bitcode * [ORO-0] add ORO_LAUNCH_PARAMS_* * [ORO-0] Add unit test for orortcLinkAddFile * [ORO-0] Add unittest scripts for TC * [ORO-0] Set separate LAUNCH_PARAM_END for HIP/CUDA * [ORO-0] Add bitcode+bundled bitcode link test * [ORO-0] Cleanup * [ORO-0] Fix typo in script * [ORO-0] Update linux TC script Co-authored-by: takahiroharada <takahiroharada@gmail.com> * [ORO-0] Get global memory size for CUDA (#44) * [ORO-0] Update HIP dll's for bitcode+bundled bitcode linking support (#46) * [ORO-0] Update HIP dll's for bitcode linking support * [ORO-0] Add getLoweredName testcase * [ORO-0] Update unittest filter * [ORO-0] Update loweredName test * [ORO-0] Add missing test kernel * [ORO-0] Fix loweredName test * [ORO-0] Fix linux compilation * [ORO-0] Remove printf from test kernel (#37) * [ORO-0] Fix linux loading of libhiprtc.so (#49) * [ORO-0] Update test scripts (#50) * [ORO-0] Update scripts for linux (#51) * [ORO-0] Add new scripts (#52) * [ORO-0] Add new scripts * [ORO-0] Add execute permissions to scripts * Fix Unit Test: getErrorString (#54) Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] Support hiprtc0504 (#55) * [ORO-0] Update hiprtc and orortc error codes (#57) * [ORO-0] Update test scripts to delete cache before running (#58) * [ORO-0] Update hiprtc dlls * [ORO-0] Support gfx1100,gfx1102 for radix sort kernel precompilation * Fix apt python installation (#63) Update checkout version Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] OrochiUtils update. (#61) * [ORO-0] Add WMMA test (#62) * [ORO-0] Add WMMA test * [ORO-0] Add a comment for WMMA * [ORO-0] Cleanup * [ORO-0] Add a couple more comments * [ORO-0] Remove hip_runtime include * [ORO-0] Cleanup * [ORO-0] Fix comment * [ORO-0] Add Copyright notice * [ORO-0] Load binary from the directory where DLL is. * [ORO-0] Fix for linux. --------- Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: NevesLucas <neves.lucas.m@gmail.com> Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com> Co-authored-by: Daniel Meister <daniel.meister@amd.com> Co-authored-by: PixelClear <pariku@amd.com> * [ORO-0] Remove unnecessary template. * [ORO-0] Clean up. Added python script kernelCompile.py for compilation. (#46) * [ORO-0] Clean up. Added python script kernelCompile.py for compilation. * [ORO-0] hipsdk should be next to orochi dir. * Update ParallelPrimitives/RadixSortKernels.h Remove commented line --------- Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] add automatic arch selection (#47) * [ORO-0] add automatic arch selection * [ORO-0] Refactor and error output when it cannot find llc. --------- Co-authored-by: takahiroharada <takahiroharada@gmail.com> * Feature/oro 0 flexible rtc error handling cherrypick (#48) * add a handler for RTC load failure case on cuda. * [ORO-0] add a handler for RTC load failure case on hip. * [ORO-0] add cuda 12.0 sdk in nvrtc path * [ORO-0] Remove non bundled bitcode tests. Clean up. * [ORO-0] Clean up. * [ORO-0] Add hiprtcGetBitcodeSize back. * Update Orochi.cpp * Update Orochi.cpp * [ORO-0] Fix for multi-GPU/iGPU * [HIPSDK-0] compute-22.40-osdb/36/ * [ORO-0] compute-23.10-osdb/9/ * [ORO-0] Update dll names * [ORO-0] implement new test for managed memory, enable managed memory api, fix all warnings and cleanup * [ORO-0] fix compile issues * [ORO-0] fix declaration of oroManagedMalloc * [ORO-0] change streaming kernel * [ORO-0] enable it on windows too * [ORO-0] add more asserts * [ORO-0] update kernel * [ORO-0] add host copy times * [ORO-0] add malloc times * Refactor Count Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Refactor Radix Sort class: - Now the tmp buffer is allocated internally. - All GPU memory buffers are changed to the GpuMemory class - `configure` will now calculate the total number of GPU blocks for the count and the scan kernel - The client does not need to call configure explicitly - Refactor function parameters - Remove count reference kernel Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Add `const` Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Thid commit does the followings: - Support setting the the number of thread per block (a.k.a block size) dynamically - Refactor `exclusiveScanCpu` - Extend `printKernelInfo`. Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * The 1st working example for the radix sort optimization Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * Support configuring dynamic "NUM_WARPS_PER_BLOCK" in the sort kernel Compute the optimal number of inputs for each block to handle. Refactor the usage of stopwatch Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> * [ORO-0] add hiprtc future dll names in hiprtc path * Add linux paths and dll names (#66) * [ORO-0] Change path and rtc dll names * [ORO-0] Make scripts executable * [ORO-0] Add hiprtc path * [ORO-0] Remove ParallelPrimitives, test/radix sort * [ORO-0] Edit premake --------- Signed-off-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Chih-Chen Kao <ChihChen.Kao@amd.com> Co-authored-by: Takahiro Harada <takahiroharada@users.noreply.github.com> Co-authored-by: Mehmet Oguz Derin <mehmetoguzderin@mehmetoguzderin.com> Co-authored-by: takahiroharada <takahiroharada@gmail.com> Co-authored-by: Daniel Meister <daniel.meister@amd.com> Co-authored-by: NevesLucas <neves.lucas.m@gmail.com> Co-authored-by: PixelClear <pariku@amd.com> Co-authored-by: Richard Geslot <richard.geslot@amd.com> Co-authored-by: Atsushi Yoshimura <51312299+AtsushiYoshimura0302@users.noreply.github.com> Co-authored-by: Atsushi.Yoshimura <Atsushi.Yoshimura@amd.com>
- Loading branch information