[Profiler] Allow user to flush L2 cache in `time_evalutor` function for profiling CUDA kernels #13726

yzh119 · 2023-01-08T12:45:09Z

Motivation

Currently, our default profiler (time_evaluator) does not flush the L2 cache per execution, this might lead to incorrect time measurement because the input data last run might reside in L2 cache and reduce the data fetching time in the next run. Both Triton and nvbench consider this effect thus reporting more accurate measurements.

Solution

time_evalutor has an argument f_preproc where user can specify a pre-processing function per execution of the kernel being evaluated. Currently, TVM supports cache_flush_cpu_non_first_arg which flushes CPU cache. But similar functionality for GPU is missing.

This PR completely borrows the design of nvbench's l2flush struct and allow the user to specify "l2_cache_flush_cuda" as a preprocessing function which flushes NVIDIA GPU's L2 cache. l2_cache_flush_cuda is not a default value so existing program's behavior would not be influenced.

Note that this PR also changes the location where `f_preproc` being triggered: previously `f_preproc` is triggered per repeat but that doesn't sound correct to me because most users specify `repeat=1` and `f_preproc` need to be triggered once per run.

cc @masahi @tkonolige @junrushao @spectrometerHBH @tqchen

tvm-bot · 2023-01-08T12:45:11Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

cc @echuraev, @Icemist, @tkonolige _{See #10317 for details}

_{Generated by tvm-bot}

src/runtime/profiling.cc

tests/python/unittest/test_evaluator_flush_l2_cache.py

src/runtime/cuda/l2_cache_flush.cc

Icemist

LGTM, except for a couple of minor non-functional issues.

src/runtime/cuda/l2_cache_flush.cc

tkonolige

Thanks @yzh119! This is a great addition to benchmarking. Maybe we should consider flushing the cache by default instead of requiring the user to specify a preprocessor. Maybe something like a cold-start mode.

There's one change I need to you make around moving this code to third party as it is almost directly from nvbench.

src/runtime/cuda/l2_cache_flush.cc

…or profiling CUDA kernels (apache#13726) Currently, our default profiler (time_evaluator) does not flush the L2 cache per execution, this might lead to incorrect time measurement because the input data last run might reside in L2 cache and reduce the data fetching time in the next run. Both Triton and nvbench consider this effect thus reporting more accurate measurements. Solution: time_evalutor has an argument f_preproc where user can specify a pre-processing function per execution of the kernel being evaluated. Currently, TVM supports cache_flush_cpu_non_first_arg which flushes CPU cache. But similar functionality for GPU is missing. This PR completely borrows the design of nvbench's l2flush struct and allow the user to specify "l2_cache_flush_cuda" as a preprocessing function which flushes NVIDIA GPU's L2 cache. l2_cache_flush_cuda is not a default value so existing program's behavior would not be influenced.

flush_l2

177efd9

github-actions bot requested review from junrushao, masahi, spectrometerHBH, tkonolige and tqchen January 8, 2023 12:46

Icemist reviewed Jan 8, 2023

View reviewed changes

src/runtime/profiling.cc Outdated Show resolved Hide resolved

yzh119 added 5 commits January 8, 2023 05:05

not necessarily sm_86

43580b5

fix lint and test

dd32136

fix

f1237c4

revert profiling

530fabf

number=1

d650695

echuraev reviewed Jan 9, 2023

View reviewed changes

tests/python/unittest/test_evaluator_flush_l2_cache.py Outdated Show resolved Hide resolved

src/runtime/cuda/l2_cache_flush.cc Outdated Show resolved Hide resolved

use parametrize

d56c8b8

Icemist approved these changes Jan 9, 2023

View reviewed changes

src/runtime/cuda/l2_cache_flush.cc Outdated Show resolved Hide resolved

src/runtime/cuda/l2_cache_flush.cc Outdated Show resolved Hide resolved

yzh119 added 2 commits January 9, 2023 06:41

use (void**)

486288c

use reinterpret_cast for lint

5fd7717

tkonolige requested changes Jan 9, 2023

View reviewed changes

src/runtime/cuda/l2_cache_flush.cc Show resolved Hide resolved

yzh119 added 3 commits January 10, 2023 01:14

refactor and add license

fcb0f3f

empty line for lint

284b125

header order

71eb46d

tqchen approved these changes Jan 10, 2023

View reviewed changes

tkonolige approved these changes Jan 10, 2023

View reviewed changes

tkonolige merged commit 92da138 into apache:main Jan 10, 2023

ysh329 mentioned this pull request Apr 17, 2023

[Release] v0.12.0 Release Candidate Notes #14645

Closed

yzh119 mentioned this pull request Jul 13, 2023

[Runtime] Flush L2 cache in time eval #15305

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Profiler] Allow user to flush L2 cache in `time_evalutor` function for profiling CUDA kernels #13726

[Profiler] Allow user to flush L2 cache in `time_evalutor` function for profiling CUDA kernels #13726

yzh119 commented Jan 8, 2023 •

edited

Loading

tvm-bot commented Jan 8, 2023

Icemist left a comment

tkonolige left a comment

[Profiler] Allow user to flush L2 cache in time_evalutor function for profiling CUDA kernels #13726

[Profiler] Allow user to flush L2 cache in time_evalutor function for profiling CUDA kernels #13726

Conversation

yzh119 commented Jan 8, 2023 • edited Loading

Motivation

Solution

tvm-bot commented Jan 8, 2023

Icemist left a comment

Choose a reason for hiding this comment

tkonolige left a comment

Choose a reason for hiding this comment

[Profiler] Allow user to flush L2 cache in `time_evalutor` function for profiling CUDA kernels #13726

[Profiler] Allow user to flush L2 cache in `time_evalutor` function for profiling CUDA kernels #13726

yzh119 commented Jan 8, 2023 •

edited

Loading