-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MIGraphX EP] Fix CopyTensorAsync and add guards for stream sync CopyTensors #16787
[MIGraphX EP] Fix CopyTensorAsync and add guards for stream sync CopyTensors #16787
Conversation
…Tensors() Add compile guards to gate functionality based on MIGRAPHX_STREAM_SYNC for adding the following - remove excess hipStreamSyncronize to nullstream on CopyTensor calls - Add proper call for stream synchronized CopyTensorAsync for DeviceToHost case Without this change subsequent CopyTensorAsync() calls will fail for cards that don't use pinned memory thus causing hipMemcpy() calls to occur before certain kernel operations occur.
ping @cloudhan @PeixuanZuo . This is related to the issue I found earlier. #16774 |
/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed |
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,ONNX Runtime Web CI Pipeline |
Azure Pipelines successfully started running 8 pipeline(s). |
1 similar comment
Azure Pipelines successfully started running 8 pipeline(s). |
This is already handled in the EP as end of run performs: OnRunEnd()->hipStreamQuery()->hipStreamSyncronize() as well as Sync()->hipStreamSync() Also after every hipMemCpy() we perform a hipStreamSyncronize(stream)
let me know if you need anything further for this or there are any additional concerns. |
/azp run Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed |
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,ONNX Runtime Web CI Pipeline |
Azure Pipelines successfully started running 8 pipeline(s). |
1 similar comment
Azure Pipelines successfully started running 8 pipeline(s). |
@cloudhan, looks like this test isn't building for some reason
|
…Tensors (microsoft#16787) Add compile guards to gate functionality based on MIGRAPHX_STREAM_SYNC for adding the following - remove excess hipStreamSyncronize to nullstream on CopyTensor calls - Add proper call for stream synchronized CopyTensorAsync for DeviceToHost case Without this change subsequent CopyTensorAsync() calls will fail for cards that don't use pinned memory thus causing hipMemcpy() calls to occur before certain kernel operations occur. ![image](https://github.com/microsoft/onnxruntime/assets/107195283/4915c18a-fb2d-40c9-a50e-a7c6613c324b) becomes ![image](https://github.com/microsoft/onnxruntime/assets/107195283/f661acf4-e2af-4c9a-b26a-30fca339cf1d) --------- Co-authored-by: Ted Themistokleous <tthemist@amd.com>
…Tensors (microsoft#16787) Add compile guards to gate functionality based on MIGRAPHX_STREAM_SYNC for adding the following - remove excess hipStreamSyncronize to nullstream on CopyTensor calls - Add proper call for stream synchronized CopyTensorAsync for DeviceToHost case Without this change subsequent CopyTensorAsync() calls will fail for cards that don't use pinned memory thus causing hipMemcpy() calls to occur before certain kernel operations occur. ![image](https://github.com/microsoft/onnxruntime/assets/107195283/4915c18a-fb2d-40c9-a50e-a7c6613c324b) becomes ![image](https://github.com/microsoft/onnxruntime/assets/107195283/f661acf4-e2af-4c9a-b26a-30fca339cf1d) --------- Co-authored-by: Ted Themistokleous <tthemist@amd.com>
…Tensors (microsoft#16787) (#13) Add compile guards to gate functionality based on MIGRAPHX_STREAM_SYNC for adding the following - remove excess hipStreamSyncronize to nullstream on CopyTensor calls - Add proper call for stream synchronized CopyTensorAsync for DeviceToHost case Without this change subsequent CopyTensorAsync() calls will fail for cards that don't use pinned memory thus causing hipMemcpy() calls to occur before certain kernel operations occur. ![image](https://github.com/microsoft/onnxruntime/assets/107195283/4915c18a-fb2d-40c9-a50e-a7c6613c324b) becomes ![image](https://github.com/microsoft/onnxruntime/assets/107195283/f661acf4-e2af-4c9a-b26a-30fca339cf1d) --------- Co-authored-by: Ted Themistokleous <tthemist@amd.com>
…Tensors (microsoft#16787) Add compile guards to gate functionality based on MIGRAPHX_STREAM_SYNC for adding the following - remove excess hipStreamSyncronize to nullstream on CopyTensor calls - Add proper call for stream synchronized CopyTensorAsync for DeviceToHost case Without this change subsequent CopyTensorAsync() calls will fail for cards that don't use pinned memory thus causing hipMemcpy() calls to occur before certain kernel operations occur. ![image](https://github.com/microsoft/onnxruntime/assets/107195283/4915c18a-fb2d-40c9-a50e-a7c6613c324b) becomes ![image](https://github.com/microsoft/onnxruntime/assets/107195283/f661acf4-e2af-4c9a-b26a-30fca339cf1d) --------- Co-authored-by: Ted Themistokleous <tthemist@amd.com>
…Tensors (#16787) Add compile guards to gate functionality based on MIGRAPHX_STREAM_SYNC for adding the following - remove excess hipStreamSyncronize to nullstream on CopyTensor calls - Add proper call for stream synchronized CopyTensorAsync for DeviceToHost case Without this change subsequent CopyTensorAsync() calls will fail for cards that don't use pinned memory thus causing hipMemcpy() calls to occur before certain kernel operations occur. ![image](https://github.com/microsoft/onnxruntime/assets/107195283/4915c18a-fb2d-40c9-a50e-a7c6613c324b) becomes ![image](https://github.com/microsoft/onnxruntime/assets/107195283/f661acf4-e2af-4c9a-b26a-30fca339cf1d) --------- Co-authored-by: Ted Themistokleous <tthemist@amd.com>
Add compile guards to gate functionality based on MIGRAPHX_STREAM_SYNC for adding the following
Without this change subsequent CopyTensorAsync() calls will fail for cards that don't use pinned memory thus causing hipMemcpy() calls to occur before certain kernel operations occur.
Description
Remove excess syncronization when stream sync is enabled but also block the DeviceTohost CopyTensorAsync() to the desired GPU stream.
Motivation and Context
Without this change we fail to properly wait for a kernel to compute and synchronize correctly if memory isn't specifically pinned for the task. This was observed when doing another run with test_parity_gelu and test_parity_layernorm tests.
becomes