From 628e2660ed4c5ec10cc91e78cdb44f1d685e3ea5 Mon Sep 17 00:00:00 2001 From: zhuhaozhe Date: Tue, 6 Jun 2023 13:57:27 +0800 Subject: [PATCH] fix format (#2) --- intermediate_source/inductor_debug_cpu.rst | 35 ++++++++++++++++------ 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/intermediate_source/inductor_debug_cpu.rst b/intermediate_source/inductor_debug_cpu.rst index d2c13715f4e..5c2d0f46f5a 100644 --- a/intermediate_source/inductor_debug_cpu.rst +++ b/intermediate_source/inductor_debug_cpu.rst @@ -302,11 +302,13 @@ Note that there exists a debugging tool provided by PyTorch, called `Minifier ` -We can enable kernel profile in inductor by: +To deep dive op-level performance, we can use `Pytorch Profiler `_ + +To enable kernel profile in inductor, we need set ``enable_kernel_profile`` by: + .. code-block:: python + from torch._inductor import config config.cpp.enable_kernel_profile = True -Following the steps in `Pytorch Profiler` -we can get the profiling table and trace files. +Following the steps in `Pytorch Profiler `_ +we are able to get the profiling table and trace files. + .. code-block:: python + from torch.profiler import profile, schedule, ProfilerActivity my_schedule = schedule( skip_first=10, @@ -388,8 +397,10 @@ we can get the profiling table and trace files. p.step() print("latency: {} ms".format(1000*(total)/100)) -We can get following profile tables for eager model +We will get following profile tables for eager model + .. code-block:: shell + ----------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------- ------------ ------------ ------------ ------------ ------------ ------------ @@ -415,8 +426,11 @@ We can get following profile tables for eager model aten::fill_ 0.15% 613.000us 0.15% 613.000us 15.718us 39 ----------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 415.949ms -And for inductor model + +And get above table for inductor model + .. code-block:: shell + ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ @@ -443,8 +457,10 @@ And for inductor model ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 474.360ms -We can search the most time consuming `graph_0_cpp_fused__softmax_7` in `output_code.py` to see the generated code: +We can search the most time consuming ``graph_0_cpp_fused__softmax_7`` in ``output_code.py`` to see the generated code: + .. code-block:: python + cpp_fused__softmax_7 = async_compile.cpp(''' #include #include "/tmp/torchinductor_root/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h" @@ -584,8 +600,9 @@ We can search the most time consuming `graph_0_cpp_fused__softmax_7` in `output_ } } ''') -With the kernel name `cpp_fused__softmax_*` and considering the profile -results together, we may suspect the generated code for 'softmax' is + +With the kernel name ``cpp_fused__softmax_*`` and considering the profile +results together, we may suspect the generated code for ``softmax`` is inefficient. We encourage you to report an issue with all you findings above.