Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: SID example crashes at Triton inference stage (24.03.01 runtime image) #1639

Closed
2 tasks done
pdmack opened this issue Apr 18, 2024 · 0 comments · Fixed by #1640
Closed
2 tasks done

[BUG]: SID example crashes at Triton inference stage (24.03.01 runtime image) #1639

pdmack opened this issue Apr 18, 2024 · 0 comments · Fixed by #1640
Assignees
Labels
bug Something isn't working

Comments

@pdmack
Copy link
Contributor

pdmack commented Apr 18, 2024

Version

24.03.01

Which installation method(s) does this occur on?

Docker, Kubernetes

Describe the bug.

A manual test and a variant of examples/nlp_si_detection/README.md reveals a core dump in a SID pipeline test. Confirmed against both Triton 23.10 and 24.03. In fact, the request is never properly formed by the tritonclient library.

The validation test succeeds ./scripts/validation/sid/val-sid-all.sh. However, that test uses CSV for the input file as opposed to jsonlines.

Minimum reproducible example

morpheus --log_level=DEBUG run --num_threads=3 --edge_buffer_size=4 --use_cpp=True --pipeline_batch_size=8196 --model_max_batch_size=32 pipeline-nlp --model_seq_length=256 from-file --filename=/common/data/pcap_dump.jsonlines monitor --description 'FromFile Rate' --smoothing=0.001 deserialize preprocess --vocab_hash_file=data/bert-base-uncased-hash.txt --truncation=True --do_lower_case=True --add_special_tokens=False monitor --description='Preprocessing rate' inf-triton --force_convert_inputs=True --model_name=sid-minibert-onnx --server_url=ai-engine:8000 monitor --description='Inference rate' --smoothing=0.001 --unit inf add-class serialize --exclude '^ts_' to-file --filename=/common/data/output/sid-minibert-onnx-output.jsonlines --overwrite

Relevant log output

Click here to see error details

====Building Segment Complete!====
FromFile Rate[Complete]: 93085 messages [00:00, 125077.06 messaFailed to update context stat: Timer not set correctly. Send time from 1713457558087234741 to 0.ocessing rate: 24588 messages [00:00, 14472.27 messages/s]
E20240418 16:25:58.087323 501 triton_inference.cpp:74] Triton Error while executing 'results->Shape(model_output.name, &output_shape)'. Error: failed to parse the request JSON buffer: The document is empty. at 0
../morpheus/_lib/src/stages/triton_inference.cpp(469)
*** Aborted at 1713457558 (unix time) try "date -d @1713457558" if you are using GNU date ***
W20240418 16:25:58.090701 501 inference_client_stage.cpp:255] Exception while processing message for InferenceClientStage, attempting retry.
Failed to update context stat: Timer not set correctly. Send time from 1713457558091008494 to 0.
E20240418 16:25:58.091076 502 triton_inference.cpp:74] Triton Error while executing 'results->Shape(model_output.name, &output_shape)'. Error: failed to parse the request JSON buffer: The document is empty. at 0
../morpheus/_lib/src/stages/triton_inference.cpp(469)
W20240418 16:25:58.093281 502 inference_client_stage.cpp:255] Exception while processing message for InferenceClientStage, attempting retry.
E20240418 16:25:58.093786 501 triton_inference.cpp:74] Triton Error while executing 'm_client.async_infer( [this, handle](triton::client::InferResult* result) { m_result.reset(result); handle(); }, m_options, m_inputs, m_outputs)'. Error: failed to parse the request JSON buffer: The document is empty. at 0
../morpheus/_lib/src/stages/triton_inference.cpp(113)
E20240418 16:25:58.093787 502 triton_inference.cpp:74] Triton Error while executing 'm_client.async_infer( [this, handle](triton::client::InferResult* result) { m_result.reset(result); handle(); }, m_options, m_inputs, m_outputs)'. Error: failed to parse the request JSON buffer: The document is empty. at 0
../morpheus/_lib/src/stages/triton_inference.cpp(113)
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 485 (TID 0x7fbbd57fe640) from PID 0; stack trace: ***
@ 0x7fbd0f094197 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7fbd11c79520 (unknown)
@ 0x7fbd0dc73cc0 Curl_checkheaders
@ 0x7fbd0dc39824 Curl_http_host
@ 0x7fbd0dc3acbb Curl_http
@ 0x7fbd0dc56cdf multi_runsingle
@ 0x7fbd0dc57dc6 curl_multi_perform
@ 0x7fbd0dc28a5c curl_easy_perform
@ 0x7fbcb0a7974c triton::client::InferenceServerHttpClient::Infer()
@ 0x7fbcb09b13c3 morpheus::HttpTritonClient::async_infer()
@ 0x7fbcb09b3a42 (anonymous namespace)::TritonInferOperation::await_suspend()
@ 0x7fbcb09b69fd _ZN8morpheus28TritonInferenceClientSession5inferEPZNS0_5inferEOSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_12TensorObjectESt4lessIS7_ESaISt4pairIKS7_S8_EEEE166_ZN8morpheus28TritonInferenceClientSession5inferEOSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_12TensorObjectESt4lessIS7_ESaISt4pairIKS7_S8_EEE.frame.actor
@ 0x7fbcb0957c02 ZZN8pybind1112cpp_function10initializeIZN3mrc5pymrc16AsyncioScheduler6resumeENS3_14PyObjectHolderENSt7__n486116coroutine_handleIvEEEUlvE_vJEJEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESN
@ 0x7fbcb07bc743 pybind11::cpp_function::dispatcher()
@ 0x5576332e85a6 cfunction_call
@ 0x5576332e1a6b _PyObject_MakeTpCall.localalias
@ 0x5576332a1d90 context_run
@ 0x5576332e02a3 cfunction_vectorcall_FASTCALL_KEYWORDS
@ 0x5576332de205 _PyEval_EvalFrameDefault
@ 0x5576332e8a2c _PyFunction_Vectorcall
@ 0x5576332d8c5c _PyEval_EvalFrameDefault
@ 0x5576332e8a2c _PyFunction_Vectorcall
@ 0x5576332d8c5c _PyEval_EvalFrameDefault
@ 0x5576332e8a2c _PyFunction_Vectorcall
@ 0x5576332d8c5c _PyEval_EvalFrameDefault
@ 0x5576332f46d8 method_vectorcall
@ 0x7fbcb07fa286 pybind11::detail::simple_collector<>::call()
@ 0x7fbcb095fe0c mrc::pymrc::AsyncioRunnable<>::run()
@ 0x7fbcb07ab440 mrc::runnable::RunnableWithContext<>::main()
@ 0x7fbcf8ddf13e _ZNSt17_Function_handlerIFvvEZN3mrc8runnable6Runner7enqueueESt10shared_ptrINS2_8IEnginesEEOSt6vectorIS4_INS2_7ContextEESaIS9_EEEUlvE_E9_M_invokeERKSt9_Any_data
@ 0x7fbcf8d028f5 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZNK3mrc6system15ThreadResources11make_threadIN5boost6fibers13packaged_taskIFvvEEEEENS4_6ThreadENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS3_6CpuSetEOT_EUlvE_EEEEE6_M_runEv
@ 0x7fbd0f753e95 execute_native_thread_routine
Segmentation fault (core dumped)

Full env printout

Click here to see environment details

[Paste the results of print_env.sh here, it will be hidden by default]

Other/Misc.

Reducing the thread count to 1 prevents the crash.

Code of Conduct

  • I agree to follow Morpheus' Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@pdmack pdmack added the bug Something isn't working label Apr 18, 2024
@dagardner-nv dagardner-nv self-assigned this Apr 18, 2024
@dagardner-nv dagardner-nv moved this from Todo to In Progress in Morpheus Boards Apr 18, 2024
dagardner-nv added a commit to dagardner-nv/Morpheus that referenced this issue Apr 18, 2024
@jarmak-nv jarmak-nv moved this from In Progress to Review - Ready for Review in Morpheus Boards Apr 18, 2024
@jarmak-nv jarmak-nv moved this from Review - Ready for Review to In Progress in Morpheus Boards Apr 18, 2024
dagardner-nv added a commit to dagardner-nv/Morpheus that referenced this issue Apr 18, 2024
@jarmak-nv jarmak-nv moved this from In Progress to Review - Ready for Review in Morpheus Boards Apr 18, 2024
rapids-bot bot pushed a commit that referenced this issue Apr 19, 2024
* Ensure that both `pe_count` & `engines_per_pe` are both set to `1` for the C++ impl of the `TritonInferenceStage`
* Remove hard-coded `--num_threads=1` from validation scripts
* Disable hammah validation script until #1641 can be resolved
* Back-port of #1636

Closes #1639

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/nv-morpheus/Morpheus/blob/main/docs/source/developer_guide/contributing.md).
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

Authors:
  - David Gardner (https://github.com/dagardner-nv)
  - Eli Fajardo (https://github.com/efajardo-nv)

Approvers:
  - Michael Demoret (https://github.com/mdemoret-nv)

URL: #1640
@github-project-automation github-project-automation bot moved this from Review - Ready for Review to Done in Morpheus Boards Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants