perf:update preprocess #29

manato · 2024-05-10T14:11:46Z

Description

This PR accelerates top-k selection to solve the issue #18.

Tests performed

build/main onnx/mtr_dynamic.onnx --dynamic

Effects on system behavior

Most modern GPUs have limitations for the shared memory size that can be assigned to one thread block; the upper limit is 48KB. The current implementation assigns a fixed number of items that can be handled by one CUDA thread, which the fixed number is calculated from the shared memory limitation. For this reason, if the value of L exceeds 256*24=6144, the function named polylinePreprocessWithTopkLauncher returns an invalid value error.

Signed-off-by: Manato HIRABAYASHI <manato.hirabayashi@tier4.jp>

ktro2828 · 2024-05-13T04:24:17Z

@manato Thank you for your great PR! While building, I got the following error. Is this depending on my enviroment?

ktro2828@ktro2828-desktop  ~/myWorkspace/TensorRT-MTR  [perf/update_preprocess] $ cmake --build build -j${nproc}
Consolidate compiler generated dependencies of target test_intention_point
Consolidate compiler generated dependencies of target test_polyline
Consolidate compiler generated dependencies of target test_agent
[  4%] Building NVCC (Device) object CMakeFiles/custom_kernel.dir/lib/src/preprocess/custom_kernel_generated_polyline_preprocess_kernel.cu.o
Consolidate compiler generated dependencies of target custom_plugin
[ 12%] Built target test_intention_point
[ 20%] Built target test_agent
[ 29%] Built target test_polyline
[ 66%] Built target custom_plugin
/home/ktro2828/myWorkspace/TensorRT-MTR/lib/src/preprocess/polyline_preprocess_kernel.cu(48): error: function "<unnamed>::decomposer_t::operator()" returns incomplete type "cuda::std::__4::tuple<float &>"
    __attribute__((device)) ::cuda::std::tuple<float&> operator()(index_and_value_t& key) const
                                                       ^

/home/ktro2828/myWorkspace/TensorRT-MTR/lib/src/preprocess/polyline_preprocess_kernel.cu(52): error: list-initialization of an object type "cuda::std::__4::tuple<float &>" is not allowed because the type is incomplete
      return {key.value};
             ^

2 errors detected in the compilation of "/home/ktro2828/myWorkspace/TensorRT-MTR/lib/src/preprocess/polyline_preprocess_kernel.cu".
CMake Error at custom_kernel_generated_polyline_preprocess_kernel.cu.o.cmake:280 (message):
  Error generating file
  /home/ktro2828/myWorkspace/TensorRT-MTR/build/CMakeFiles/custom_kernel.dir/lib/src/preprocess/./custom_kernel_generated_polyline_preprocess_kernel.cu.o


gmake[2]: *** [CMakeFiles/custom_kernel.dir/build.make:949: CMakeFiles/custom_kernel.dir/lib/src/preprocess/custom_kernel_generated_polyline_preprocess_kernel.cu.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:120: CMakeFiles/custom_kernel.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2

My CUDA version is 12.1:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

…r CUDA ver. Signed-off-by: Manato HIRABAYASHI <manato.hirabayashi@tier4.jp>

manato · 2024-05-13T14:18:31Z

@ktro2828

While building, I got the following error.

Thank you very much for checking this PR and pointing this out. As far as I searched, CUB does not support sorting user-defined data types as of CUDA 12.1. Fortunately, the other type of overloaded function performing "key-value sort" can be used this time, so I replaced the code to use that type of function (at 073a877) because it is supported by both CUDA12.1 and CUDA12.3.

I'd appreciate it if you could try it again. Thx!

ktro2828 · 2024-05-13T22:53:17Z

@ktro2828

While building, I got the following error.

Thank you very much for checking this PR and pointing this out. As far as I searched, CUB does not support sorting user-defined data types as of CUDA 12.1. Fortunately, the other type of overloaded function performing "key-value sort" can be used this time, so I replaced the code to use that type of function (at 073a877) because it is supported by both CUDA12.1 and CUDA12.3.

I'd appreciate it if you could try it again. Thx!

Thanks for updates! I confirmed that all CUDA12.1~12.3 passed building and performed well!

ktro2828

I left some comments and questions, but the code looks good to me!

include/mtr/cuda_helper.hpp

lib/src/preprocess/polyline_preprocess_kernel.cu

Signed-off-by: Manato HIRABAYASHI <manato.hirabayashi@tier4.jp>

ktro2828 · 2024-05-21T02:45:38Z

@manato Thank you for your contribution! I will merge this PR.

manato added 2 commits May 10, 2024 23:12

perf: use radix sort

75cd53b

Signed-off-by: Manato HIRABAYASHI <manato.hirabayashi@tier4.jp>

feat: introduce multiple cuda stream to perform memcpy in parallel

bc94689

Signed-off-by: Manato HIRABAYASHI <manato.hirabayashi@tier4.jp>

manato force-pushed the perf/update_preprocess branch from 03016d3 to bc94689 Compare May 10, 2024 14:12

ktro2828 linked an issue May 13, 2024 that may be closed by this pull request

[TODO] topK extraction is too slow #18

Closed

ktro2828 changed the title ~~Perf/update preprocess~~ perf:update preprocess May 13, 2024

ktro2828 self-requested a review May 13, 2024 06:52

feat: use key-value sort instead of user-defined type to support wide…

073a877

…r CUDA ver. Signed-off-by: Manato HIRABAYASHI <manato.hirabayashi@tier4.jp>

ktro2828 approved these changes May 13, 2024

View reviewed changes

ktro2828 reviewed May 14, 2024

View reviewed changes

include/mtr/cuda_helper.hpp Outdated Show resolved Hide resolved

ktro2828 reviewed May 14, 2024

View reviewed changes

lib/src/preprocess/polyline_preprocess_kernel.cu Outdated Show resolved Hide resolved

manato added 2 commits May 18, 2024 12:35

style: use lower camel case for a member function

50d4158

Signed-off-by: Manato HIRABAYASHI <manato.hirabayashi@tier4.jp>

chore: use the same CUDA stream for the code structure consistency

f500b38

Signed-off-by: Manato HIRABAYASHI <manato.hirabayashi@tier4.jp>

ktro2828 merged commit 4ebb81b into ktro2828:main May 21, 2024
1 check failed

manato deleted the perf/update_preprocess branch May 23, 2024 00:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf:update preprocess #29

perf:update preprocess #29

manato commented May 10, 2024

ktro2828 commented May 13, 2024 •

edited

Loading

manato commented May 13, 2024 •

edited

Loading

ktro2828 commented May 13, 2024

ktro2828 left a comment •

edited

Loading

ktro2828 commented May 21, 2024

perf:update preprocess #29

perf:update preprocess #29

Conversation

manato commented May 10, 2024

Description

Tests performed

Effects on system behavior

ktro2828 commented May 13, 2024 • edited Loading

manato commented May 13, 2024 • edited Loading

ktro2828 commented May 13, 2024

ktro2828 left a comment • edited Loading

Choose a reason for hiding this comment

ktro2828 commented May 21, 2024

ktro2828 commented May 13, 2024 •

edited

Loading

manato commented May 13, 2024 •

edited

Loading

ktro2828 left a comment •

edited

Loading