Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf:update preprocess #29

Merged
merged 5 commits into from
May 21, 2024
Merged

Conversation

manato
Copy link
Contributor

@manato manato commented May 10, 2024

Description

This PR accelerates top-k selection to solve the issue #18.
elapsed_comparison

Tests performed

build/main onnx/mtr_dynamic.onnx --dynamic

Effects on system behavior

Most modern GPUs have limitations for the shared memory size that can be assigned to one thread block; the upper limit is 48KB. The current implementation assigns a fixed number of items that can be handled by one CUDA thread, which the fixed number is calculated from the shared memory limitation. For this reason, if the value of L exceeds 256*24=6144, the function named polylinePreprocessWithTopkLauncher returns an invalid value error.

Signed-off-by: Manato HIRABAYASHI <manato.hirabayashi@tier4.jp>
Signed-off-by: Manato HIRABAYASHI <manato.hirabayashi@tier4.jp>
@ktro2828 ktro2828 linked an issue May 13, 2024 that may be closed by this pull request
@ktro2828
Copy link
Owner

ktro2828 commented May 13, 2024

@manato Thank you for your great PR! While building, I got the following error. Is this depending on my enviroment?

ktro2828@ktro2828-desktop  ~/myWorkspace/TensorRT-MTR  [perf/update_preprocess] $ cmake --build build -j${nproc}
Consolidate compiler generated dependencies of target test_intention_point
Consolidate compiler generated dependencies of target test_polyline
Consolidate compiler generated dependencies of target test_agent
[  4%] Building NVCC (Device) object CMakeFiles/custom_kernel.dir/lib/src/preprocess/custom_kernel_generated_polyline_preprocess_kernel.cu.o
Consolidate compiler generated dependencies of target custom_plugin
[ 12%] Built target test_intention_point
[ 20%] Built target test_agent
[ 29%] Built target test_polyline
[ 66%] Built target custom_plugin
/home/ktro2828/myWorkspace/TensorRT-MTR/lib/src/preprocess/polyline_preprocess_kernel.cu(48): error: function "<unnamed>::decomposer_t::operator()" returns incomplete type "cuda::std::__4::tuple<float &>"
    __attribute__((device)) ::cuda::std::tuple<float&> operator()(index_and_value_t& key) const
                                                       ^

/home/ktro2828/myWorkspace/TensorRT-MTR/lib/src/preprocess/polyline_preprocess_kernel.cu(52): error: list-initialization of an object type "cuda::std::__4::tuple<float &>" is not allowed because the type is incomplete
      return {key.value};
             ^

2 errors detected in the compilation of "/home/ktro2828/myWorkspace/TensorRT-MTR/lib/src/preprocess/polyline_preprocess_kernel.cu".
CMake Error at custom_kernel_generated_polyline_preprocess_kernel.cu.o.cmake:280 (message):
  Error generating file
  /home/ktro2828/myWorkspace/TensorRT-MTR/build/CMakeFiles/custom_kernel.dir/lib/src/preprocess/./custom_kernel_generated_polyline_preprocess_kernel.cu.o


gmake[2]: *** [CMakeFiles/custom_kernel.dir/build.make:949: CMakeFiles/custom_kernel.dir/lib/src/preprocess/custom_kernel_generated_polyline_preprocess_kernel.cu.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:120: CMakeFiles/custom_kernel.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2

My CUDA version is 12.1:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

@ktro2828 ktro2828 changed the title Perf/update preprocess perf:update preprocess May 13, 2024
@ktro2828 ktro2828 self-requested a review May 13, 2024 06:52
…r CUDA ver.

Signed-off-by: Manato HIRABAYASHI <manato.hirabayashi@tier4.jp>
@manato
Copy link
Contributor Author

manato commented May 13, 2024

@ktro2828

While building, I got the following error.

Thank you very much for checking this PR and pointing this out. As far as I searched, CUB does not support sorting user-defined data types as of CUDA 12.1. Fortunately, the other type of overloaded function performing "key-value sort" can be used this time, so I replaced the code to use that type of function (at 073a877) because it is supported by both CUDA12.1 and CUDA12.3.

I'd appreciate it if you could try it again. Thx!

@ktro2828
Copy link
Owner

@ktro2828

While building, I got the following error.

Thank you very much for checking this PR and pointing this out. As far as I searched, CUB does not support sorting user-defined data types as of CUDA 12.1. Fortunately, the other type of overloaded function performing "key-value sort" can be used this time, so I replaced the code to use that type of function (at 073a877) because it is supported by both CUDA12.1 and CUDA12.3.

I'd appreciate it if you could try it again. Thx!

Thanks for updates! I confirmed that all CUDA12.1~12.3 passed building and performed well!

Copy link
Owner

@ktro2828 ktro2828 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments and questions, but the code looks good to me!

Signed-off-by: Manato HIRABAYASHI <manato.hirabayashi@tier4.jp>
Signed-off-by: Manato HIRABAYASHI <manato.hirabayashi@tier4.jp>
@ktro2828
Copy link
Owner

@manato Thank you for your contribution! I will merge this PR.

@ktro2828 ktro2828 merged commit 4ebb81b into ktro2828:main May 21, 2024
1 check failed
@manato manato deleted the perf/update_preprocess branch May 23, 2024 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TODO] topK extraction is too slow
2 participants