-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU] Fix in-order queue synchronization issue related to OCL/OneDNN impls interaction with CPU impls #17976
[GPU] Fix in-order queue synchronization issue related to OCL/OneDNN impls interaction with CPU impls #17976
Conversation
78e171f
to
97bd3cd
Compare
// Enqueue marker with empty events wait list (which will trigger wait for all previously enqueued tasks) and | ||
// return it as oneDNN primitive's event as it is a single option for proper synchronization | ||
if (instance.is_output_event()) | ||
event = stream.enqueue_marker({}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it to ensure that all onednn kernels are done before CPU layer execution, not for network output? If so, could you describe that on comment? I think it is unclear why it is necessary..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in most cases we need it to ensure that onednn kernel is finished before CPU layer execution, but also event with marker will be created if onednn primitive will be network's output (it may be used in loop primitive for networks synchronization, and maybe in some other cases I'm not aware of...)
And I think we can try to use it for get_memory() or reset_network() methods for in_order queue - in such a case we will avoid clFinish() call at all which theoretically may improve performance a bit (if I'm not mistaken, clFinish() call flushes caches, and maybe has some other side-effects)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@isanghao, I have updated comment and mentioned cases when this event is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm basically I agree with removing clfinish as much as possible.
However I have a curiosity about "flush cache" you mentioned, becuase
previously I experienced cache inconsistency b/w CPU and dGPU on host memory even though I did clFinish.
…impls interaction with CPU impls
97bd3cd
to
9c331b9
Compare
Performance looks good: |
…impls interaction with CPU impls (openvinotoolkit#17976)
Details:
ocl_stream::enqueue_marker()
method with OpenCL specification (marker with empty dependencies vector waits for all previous tasks in the queue)network::execute_primitive()
method. Previously we stored events only for OOO queue and profiling mode, despite the fact that we need events for synchronization between GPU and CPU impls in in-order queueprimitive_inst::execute()
method for CPU implementations and optimized out implementation if it has CPU users (previously it was done only for OOO queue)onednn_primitive::execute()
method if the next operation require synchronizationTickets: