Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: TrainAEStage fails with a Segmentation fault #1641

Closed
2 tasks done
Tracked by #1715
dagardner-nv opened this issue Apr 18, 2024 · 1 comment · Fixed by #1903
Closed
2 tasks done
Tracked by #1715

[BUG]: TrainAEStage fails with a Segmentation fault #1641

dagardner-nv opened this issue Apr 18, 2024 · 1 comment · Fixed by #1903
Labels
bug Something isn't working

Comments

@dagardner-nv
Copy link
Contributor

Version

24.03

Which installation method(s) does this occur on?

Source

Describe the bug.

The validation script is failing, even though the equivalent unittest is passing.

Minimum reproducible example

./scripts/validation/hammah/val-hammah-all.sh

Relevant log output

Click here to see error details

====Building Segment Complete!====
Inference Rate: 0 inf [00:00, ? inf/s]PC: @ 0x0 (unknown)
*** SIGSEGV (@0x4) received by PID 1751190 (TID 0x7f6207fff6c0) from PID 4; stack trace: ***
@ 0x7f6335d6e197 google::(anonymous namespace)::FailureSignalHandler()
@ 0x7f6336b88050 (unknown)
@ 0x7f632b18da1e boost::fibers::wait_queue::notify_all()
@ 0x7f632b18b3c3 boost::fibers::condition_variable_any::notify_all()
@ 0x7f631ecd9431 ZN5boost6fibers6detail11task_objectIZN3mrc4core14FiberTaskQueue7enqueueIZNS3_7segment15SegmentInstanceC4ESt10shared_ptrIKNS7_17SegmentDefinitionEEtRNS3_8pipeline17PipelineResourcesEmEUlvE_JEEENS0_6futureINSt9result_ofIFT_DpT0_EE4typeEEEONS3_13FiberMetaDataEOSJ_DpOSK_EUlvE_SaINS0_13packaged_taskIFvvEEEEvJEE3runEv
@ 0x7f631ecea9c6 boost::fibers::worker_context<>::run
()
@ 0x7f631ece86dc boost::context::detail::fiber_entry<>()
@ 0x7f632bffc11f make_fcontext

Full env printout

Click here to see environment details

[Paste the results of print_env.sh here, it will be hidden by default]

Other/Misc.

No response

Code of Conduct

  • I agree to follow Morpheus' Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@dagardner-nv
Copy link
Contributor Author

This same bug exists for the pipelines documented in examples/digital_fingerprinting/starter/README.md, problem appears to be in the TrainAEStage

rapids-bot bot pushed a commit that referenced this issue Apr 19, 2024
* Ensure that both `pe_count` & `engines_per_pe` are both set to `1` for the C++ impl of the `TritonInferenceStage`
* Remove hard-coded `--num_threads=1` from validation scripts
* Disable hammah validation script until #1641 can be resolved
* Back-port of #1636

Closes #1639

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/nv-morpheus/Morpheus/blob/main/docs/source/developer_guide/contributing.md).
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

Authors:
  - David Gardner (https://github.com/dagardner-nv)
  - Eli Fajardo (https://github.com/efajardo-nv)

Approvers:
  - Michael Demoret (https://github.com/mdemoret-nv)

URL: #1640
@dagardner-nv dagardner-nv changed the title [BUG]: hammah validation script failing with a Segmentation fault [BUG]: TrainAEStage fails with a Segmentation fault Apr 22, 2024
@dagardner-nv dagardner-nv mentioned this issue Jun 18, 2024
2 tasks
@morpheus-bot-test morpheus-bot-test bot moved this from Todo to Review - Ready for Review in Morpheus Boards Sep 20, 2024
@github-project-automation github-project-automation bot moved this from Review - Ready for Review to Done in Morpheus Boards Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant