Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault on main branch (problem in TopP layer) #2040

Closed
akhoroshev opened this issue Jul 28, 2024 · 4 comments
Closed

Segfault on main branch (problem in TopP layer) #2040

akhoroshev opened this issue Jul 28, 2024 · 4 comments
Labels
bug Something isn't working stale

Comments

@akhoroshev
Copy link
Contributor

==== backtrace (tid:1508653) ====                                                                                                                                                                                                                    
 0  /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x7f4c5a8dae4c]                                                                                                                                                                                      
 1  /lib64/libucs.so.0(+0x2c02c) [0x7f4c5a8db02c]                                                                                                                                                                                                    
 2  /lib64/libucs.so.0(+0x2c1fa) [0x7f4c5a8db1fa]                                                                                                                                                                                                    
 3  /lib64/libpthread.so.0(+0x12cf0) [0x7f4c5ca8dcf0]                                                                                                                                                                                                
 4  /lib64/libcuda.so.1(+0x18d25c) [0x7f4c5e1d625c]                                                                                                                                                                                                  
 5  /lib64/libcuda.so.1(+0xe3ee3) [0x7f4c5e12cee3]                                                                                                                                                                                                   
 6  /lib64/libcuda.so.1(+0x23100c) [0x7f4c5e27a00c]                                                                                                                                                                                                  
 7  /lib64/libcuda.so.1(+0x4ddc05) [0x7f4c5e526c05]                                                                                                                                                                                                  
 8  /lib64/libcuda.so.1(+0x13e746) [0x7f4c5e187746]                                                                                                                                                                                                  
 9  /lib64/libcuda.so.1(+0x13ec60) [0x7f4c5e187c60]                                                                                                                                                                                                  
10  /lib64/libcuda.so.1(+0x13f237) [0x7f4c5e188237]                                                                                                                                                                                                  
11  /lib64/libcuda.so.1(+0x2ea161) [0x7f4c5e333161]                                                                                                                                                                                                  
12  /home/askhoroshev/github_rebase/tensorrt-llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x1d8e9c8) [0x7f4cb76459c8]                                                                                                                              
13  /home/askhoroshev/github_rebase/tensorrt-llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x1d5ac82) [0x7f4cb7611c82]                                                                                                                              
14  /home/askhoroshev/github_rebase/tensorrt-llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x1db864c) [0x7f4cb766f64c]                                                                                                                              
15  /home/askhoroshev/github_rebase/tensorrt-llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZNK12tensorrt_llm7runtime13BufferManager4copyEPKvRNS0_7IBufferENS0_10MemoryTypeE+0xb5) [0x7f4cb72340e5]                                                  
16  /home/askhoroshev/github_rebase/tensorrt-llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm6layers17TopPSamplingLayerI6__halfE5setupEiiSt10shared_ptrIKNS_7runtime7IBufferEERKS4_INS0_15BaseSetupParamsEE+0x63b) [0x7f4cb722144b]   
17  /home/askhoroshev/github_rebase/tensorrt-llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm6layers13SamplingLayerI6__halfE5setupEiiSt10shared_ptrIKNS_7runtime7IBufferEERKS4_INS0_15BaseSetupParamsEE+0x3fd) [0x7f4cb71ffb6d]       
18  /home/askhoroshev/github_rebase/tensorrt-llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm6layers13DecodingLayerI6__halfE5setupEiiSt10shared_ptrIKNS_7runtime7IBufferEERKS4_INS0_15BaseSetupParamsEE+0x10a) [0x7f4cb71b178a]       
19  /home/askhoroshev/github_rebase/tensorrt-llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm6layers18DynamicDecodeLayerI6__halfE5setupEiiSt10shared_ptrIKNS_7runtime7IBufferEERKS4_INS0_15BaseSetupParamsEE+0x13c) [0x7f4cb71bd07c]  
20  /home/askhoroshev/github_rebase/tensorrt-llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime10GptDecoderI6__halfE5setupERKNS0_14SamplingConfigEmRKSt8optionalISt10shared_ptrIKNS0_7ITensorEEERKS7_INS0_14DecodingOutputEE+0x$
ec) [0x7f4cb72765bc]                                                                                                                                                                                                                                 
21  /home/askhoroshev/github_rebase/tensorrt-llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime15GptDecoderBatch11newRequestsERKSt6vectorIiSaIiEERKS2_INS0_13decoder_batch7RequestESaIS8_EERKS2_INS0_14SamplingConfigESaISD_EE+$
x5f2) [0x7f4cb7286242]                                                                                                                                                                                                                               
22  /home/askhoroshev/github_rebase/tensorrt-llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching16setupDecoderStepERKSt6vectorISt10shared_ptrINS0_10LlmRequestEESaIS5_EE+0x2de) [0x7f4cb75b77
9e]
23  /home/askhoroshev/github_rebase/tensorrt-llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatching12forwardAsyncERKNSt7__cxx114listISt10shared_ptrINS0_10LlmRequestEESaIS6_EEE+0xcfb) [0x7f4cb7
5b939b]
24  /home/askhoroshev/github_rebase/tensorrt-llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager10GptManager12forwardAsyncERNSt7__cxx114listISt10shared_ptrINS0_10LlmRequestEESaIS6_EEERSt13unordered_setImSt4hashImESt8eq
ual_toImESaImEE+0x24) [0x7f4cb75599d4]
25  /home/askhoroshev/github_rebase/tensorrt-llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager10GptManager24decoupled_execution_loopEv+0x143) [0x7f4cb75605b3]
26  /home/askhoroshev/github_rebase/tensorrt-llm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/nvrtcWrapper/x86_64-linux-gnu/libtensorrt_llm_nvrtc_wrapper.so(+0x31e3470) [0x7f4ca54cd470]
27  /lib64/libpthread.so.0(+0x81ca) [0x7f4c5ca831ca]
28  /lib64/libc.so.6(clone+0x43) [0x7f4c5bdc0e73]
=================================

Fixed in this PR #2039

@hypdeb
Copy link

hypdeb commented Jul 29, 2024

Hello @akhoroshev, I will be looking into that issue and the related PR. A priori, I think that you are correct in your analysis and your solution looks good.

Could you please tell me at which values for batch size and maximum batch size the issue appears?

@akhoroshev
Copy link
Contributor Author

akhoroshev commented Jul 29, 2024

maximum batch size is 256 (default value).

The batch size varied because I was doing load testing with different batch sizes.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

@github-actions github-actions bot added the stale label Aug 29, 2024
@lfr-0531 lfr-0531 added the bug Something isn't working label Sep 2, 2024
@github-actions github-actions bot removed the stale label Sep 3, 2024
Copy link

github-actions bot commented Oct 3, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

4 participants