-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Runtime] Flush L2 cache in time eval #15305
Conversation
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment. Generated by tvm-bot |
Also CC: @yzh119 |
I suppose we already have a and you can reuse the |
I don't want it to be cuda only |
Yes the major concern is that L2 cache size is device specific, and later architectures may have L2 cache greater than 256mb |
To make it generalized. how about we instead introduce a l2_cache_flush_bytes, which default to 0, and use that as a parameter to indicate what array to allocate. This way it would generalize across GPUs as long as we set this argument right |
The implementation per se is not specific to L2 either. We could say it’s cache_flush_bytes |
cache_flush_bytes sounds good to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@tvm-bot rerun |
This PR introduces an optional cache flush functionality to `time_evaluator`. It is implemented by allocating two large empty NDArrays on the device so that the L2 cache are flushed. This gives us more accurate evaluation on the performance of a runtime function.
Followup of #15305 , this PR creates API to query device L2 cache size in bytes. Currently, the API-supported devices includes CUDA, OpenCL, and ROCM. Note that OpenCL's API does not return the accurate device L2 cache size. I cannot find a Vulkan API that returns L2 texture cache size, but the `vkCmdPipelineBarrier` call will flush the L2 texture cache automatically(https://zeux.io/2020/02/27/writing-an-efficient-vulkan-renderer/), thus we return 0 by default.
Followup of apache#15305 , this PR creates API to query device L2 cache size in bytes. Currently, the API-supported devices includes CUDA, OpenCL, and ROCM. Note that OpenCL's API does not return the accurate device L2 cache size. I cannot find a Vulkan API that returns L2 texture cache size, but the `vkCmdPipelineBarrier` call will flush the L2 texture cache automatically(https://zeux.io/2020/02/27/writing-an-efficient-vulkan-renderer/), thus we return 0 by default.
This PR introduces an optional cache flush functionality to `time_evaluator`. It is implemented by allocating two large empty NDArrays on the device so that the L2 cache are flushed. This gives us more accurate evaluation on the performance of a runtime function.
Followup of apache#15305 , this PR creates API to query device L2 cache size in bytes. Currently, the API-supported devices includes CUDA, OpenCL, and ROCM. Note that OpenCL's API does not return the accurate device L2 cache size. I cannot find a Vulkan API that returns L2 texture cache size, but the `vkCmdPipelineBarrier` call will flush the L2 texture cache automatically(https://zeux.io/2020/02/27/writing-an-efficient-vulkan-renderer/), thus we return 0 by default.
This PR introduces an optional cache flush functionality to `time_evaluator`. It is implemented by allocating two large empty NDArrays on the device so that the L2 cache are flushed. This gives us more accurate evaluation on the performance of a runtime function.
Followup of apache#15305 , this PR creates API to query device L2 cache size in bytes. Currently, the API-supported devices includes CUDA, OpenCL, and ROCM. Note that OpenCL's API does not return the accurate device L2 cache size. I cannot find a Vulkan API that returns L2 texture cache size, but the `vkCmdPipelineBarrier` call will flush the L2 texture cache automatically(https://zeux.io/2020/02/27/writing-an-efficient-vulkan-renderer/), thus we return 0 by default.
This PR introduces an optional cache flush functionality to `time_evaluator`. It is implemented by allocating two large empty NDArrays on the device so that the L2 cache are flushed. This gives us more accurate evaluation on the performance of a runtime function.
This PR introduces an optional cache flush functionality to
time_evaluator
. It is implemented by allocating two large empty NDArrays on the device so that the L2 cache are flushed. This gives us more accurate evaluation on the performance of a runtime function.