Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check precision only if on CUDA and not inductor. #6407

Merged
merged 2 commits into from
Feb 1, 2024

Conversation

ysiraichi
Copy link
Collaborator

This PR skips precision flag checking for experiments whose: (i) device is not cuda; and (ii) dynamo backends are not inductor.

(i) the flag checked is aimed towards CUDA only: DEFAULT_CUDA_<test>_PRECISION

(ii) PyTorch/benchmark scripts already take care of the model conversion for inductor

cc @miladm

Co-authored-by: Emilio Cota <ecg@google.com>
@golechwierowicz
Copy link
Collaborator

golechwierowicz commented Jan 31, 2024

Can you paste in before, and after kernels just as before? Just hf_Bert will suffice.

Copy link
Collaborator

@cota cota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Collaborator

@frgossen frgossen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ysiraichi
Copy link
Collaborator Author

hf_Bert (before)
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  Torch-Compiled Region        50.54%        7.888s        51.34%        8.013s        8.013s     356.242ms        96.76%     356.242ms     356.242ms             1
                                   triton_gemm_dot_3753         0.00%       0.000us         0.00%       0.000us       0.000us     187.306ms        50.88%     187.306ms       4.683ms            40
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      54.512ms        14.81%      54.512ms       2.942us         18531
                                        redzone_checker         0.00%       0.000us         0.00%       0.000us       0.000us      41.314ms        11.22%      41.314ms      45.300us           912
xla::gpu::buffer_comparator::(anonymous namespace)::...         0.00%       0.000us         0.00%       0.000us       0.000us      24.438ms         6.64%      24.438ms     259.979us            94
                                    triton_gemm_dot_966         0.00%       0.000us         0.00%       0.000us       0.000us      19.491ms         5.29%      19.491ms     442.977us            44
                                     triton_gemm_dot_75         0.00%       0.000us         0.00%       0.000us       0.000us      14.641ms         3.98%      14.641ms     332.750us            44
                                   triton_gemm_dot_3666         0.00%       0.000us         0.00%       0.000us       0.000us       5.925ms         1.61%       5.925ms     123.438us            48
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       4.573ms         1.24%       4.573ms      99.413us            46
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us       3.097ms         0.84%       3.097ms       1.548ms             2
                                                 fusion         0.00%       0.000us         0.00%       0.000us       0.000us       1.839ms         0.50%       1.839ms      61.300us            30
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.541ms         0.42%       1.541ms       1.541ms             1
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.101ms         0.30%       1.101ms       2.002us           550
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us       1.053ms         0.29%       1.053ms       3.464us           304
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     944.000us         0.26%     944.000us      78.667us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     924.000us         0.25%     924.000us      66.000us            14
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     830.000us         0.23%     830.000us      59.286us            14
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     753.000us         0.20%     753.000us      62.750us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     737.000us         0.20%     737.000us      61.417us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     732.000us         0.20%     732.000us      61.000us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     519.000us         0.14%     519.000us      32.438us            16
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     365.000us         0.10%     365.000us     365.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     273.000us         0.07%     273.000us      11.375us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     264.000us         0.07%     264.000us      12.000us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     169.000us         0.05%     169.000us      14.083us            12
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us     163.000us         0.04%     163.000us      27.167us             6
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     156.000us         0.04%     156.000us      13.000us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     132.000us         0.04%     132.000us      11.000us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     120.000us         0.03%     120.000us      10.000us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us     100.000us         0.03%     100.000us       8.333us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.01%      40.000us      40.000us             1
                                 wrapped_concatenate_36         0.00%       0.000us         0.00%       0.000us       0.000us      21.000us         0.01%      21.000us      10.500us             2
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.01%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.00%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.00%      12.000us      12.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.00%      12.000us      12.000us             1
                         Memcpy HtoD (Pinned -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      10.000us         0.00%      10.000us       2.000us             5
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.00%       8.000us       8.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.00%       7.000us       7.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.00%       6.000us       6.000us             1
                               TorchDynamo Cache Lookup         0.00%       2.000us         0.00%       2.000us       0.095us       0.000us         0.00%       0.000us       0.000us            21
         _compile.<locals>.compile_inner (dynamo_timed)        15.00%        2.342s        48.66%        7.595s        7.595s       0.000us         0.00%       0.000us       0.000us             1
                                  cudaStreamIsCapturing         0.02%       2.493ms         0.02%       2.493ms       1.095us       0.000us         0.00%       0.000us       0.000us          2276
                                            aten::clone         1.60%     249.862ms         2.36%     368.144ms     108.757us       0.000us         0.00%       0.000us       0.000us          3385
                                    aten::empty_strided         0.08%      12.428ms         0.08%      12.428ms       5.271us       0.000us         0.00%       0.000us       0.000us          2358
                                            aten::copy_         0.06%       8.656ms         0.06%       8.656ms       4.396us       0.000us         0.00%       0.000us       0.000us          1969
                                           aten::detach         3.33%     519.254ms         7.65%        1.194s     105.702us       0.000us         0.00%       0.000us       0.000us         11295
                                                 detach         0.56%      87.415ms         2.55%     397.311ms     385.739us       0.000us         0.00%       0.000us       0.000us          1030
                                       aten::empty_like         0.01%       1.573ms         0.02%       3.746ms       6.549us       0.000us         0.00%       0.000us       0.000us           572
                                            aten::empty         0.12%      19.331ms         0.12%      19.331ms       8.251us       0.000us         0.00%       0.000us       0.000us          2343
                                             aten::ones         0.04%       5.507ms         0.06%       9.450ms     945.000us       0.000us         0.00%       0.000us       0.000us            10
                                             aten::full         0.00%     275.000us         0.00%     415.000us     138.333us       0.000us         0.00%       0.000us       0.000us             3
                                            aten::slice         0.06%       9.959ms         0.16%      25.495ms     455.268us       0.000us         0.00%       0.000us       0.000us            56
                                       aten::as_strided         0.44%      68.934ms         0.44%      69.128ms       8.615us       0.000us         0.00%       0.000us       0.000us          8024
                                           aten::expand         1.14%     178.716ms         2.40%     374.290ms     253.412us       0.000us         0.00%       0.000us       0.000us          1477
                                prims::broadcast_in_dim         0.26%      39.870ms         0.33%      50.963ms      45.220us       0.000us         0.00%       0.000us       0.000us          1127
                                        aten::unsqueeze         0.04%       5.746ms         0.08%      12.030ms     353.824us       0.000us         0.00%       0.000us       0.000us            34
                                               aten::to         0.23%      35.994ms         0.24%      36.690ms     107.912us       0.000us         0.00%       0.000us       0.000us           340
                                             aten::rsub         0.02%       2.617ms         0.04%       6.280ms     628.000us       0.000us         0.00%       0.000us       0.000us            10
                                              aten::sub         0.35%      55.061ms         0.80%     124.932ms     533.897us       0.000us         0.00%       0.000us       0.000us           234
                                             prims::sub         0.06%       9.778ms         0.08%      13.227ms     113.051us       0.000us         0.00%       0.000us       0.000us           117
                                   aten::empty_permuted         0.12%      19.246ms         0.24%      37.816ms      29.000us       0.000us         0.00%       0.000us       0.000us          1304
                                              aten::mul         0.45%      70.203ms         0.96%     149.206ms     339.877us       0.000us         0.00%       0.000us       0.000us           439
                                             prims::mul         0.15%      22.833ms         0.21%      32.678ms      92.311us       0.000us         0.00%       0.000us       0.000us           354
                                             aten::set_         0.21%      32.542ms         0.28%      43.132ms      34.589us       0.000us         0.00%       0.000us       0.000us          1247
                                        aten::embedding         0.05%       8.421ms         0.15%      23.059ms       1.098ms       0.000us         0.00%       0.000us       0.000us            21
                                            aten::index         0.03%       5.027ms         0.07%      10.653ms     591.833us       0.000us         0.00%       0.000us       0.000us            18
                                        aten::new_empty         0.08%      12.371ms         0.10%      16.329ms      52.337us       0.000us         0.00%       0.000us       0.000us           312
                                              aten::add         2.08%     324.497ms         4.15%     648.217ms     466.008us       0.000us         0.00%       0.000us       0.000us          1391
                                             prims::add         0.25%      38.593ms         0.35%      54.194ms     102.253us       0.000us         0.00%       0.000us       0.000us           530
                                             aten::add_         0.01%       1.954ms         0.03%       4.395ms     879.000us       0.000us         0.00%       0.000us       0.000us             5
                                       aten::layer_norm         0.02%       2.630ms         6.55%        1.023s      13.113ms       0.000us         0.00%       0.000us       0.000us            78
                                aten::native_layer_norm         0.39%      60.521ms         6.54%        1.020s      13.079ms       0.000us         0.00%       0.000us       0.000us            78
                                          aten::reshape         0.68%     105.980ms         4.30%     671.151ms       1.512ms       0.000us         0.00%       0.000us       0.000us           444
                                             aten::view         6.43%        1.004s        14.04%        2.192s     299.365us       0.000us         0.00%       0.000us       0.000us          7321
                                       prims::split_dim         0.28%      43.394ms         0.36%      56.511ms      38.919us       0.000us         0.00%       0.000us       0.000us          1452
                                   prims::collapse_view         0.16%      25.425ms         0.22%      34.070ms      39.616us       0.000us         0.00%       0.000us       0.000us           860
                                aten::native_batch_norm         0.04%       6.421ms         2.90%     453.340ms       5.812ms       0.000us         0.00%       0.000us       0.000us            78
                         aten::_native_batch_norm_legit         0.52%      81.055ms         3.69%     575.675ms       3.129ms       0.000us         0.00%       0.000us       0.000us           184
                                         aten::var_mean         0.31%      48.263ms         0.75%     117.448ms     752.872us       0.000us         0.00%       0.000us       0.000us           156
                                             prims::var         0.03%       5.251ms         0.04%       6.440ms      82.564us       0.000us         0.00%       0.000us       0.000us            78
                                             prims::sum         0.04%       6.329ms         0.05%       7.516ms      65.930us       0.000us         0.00%       0.000us       0.000us           114
                                             prims::div         0.06%       9.464ms         0.09%      13.942ms      92.947us       0.000us         0.00%       0.000us       0.000us           150
                                            aten::rsqrt         0.17%      26.370ms         0.30%      46.419ms     297.558us       0.000us         0.00%       0.000us       0.000us           156
                                           prims::rsqrt         0.03%       4.317ms         0.04%       6.742ms      86.436us       0.000us         0.00%       0.000us       0.000us            78
                                          aten::squeeze         0.31%      48.772ms         0.48%      74.621ms     239.170us       0.000us         0.00%       0.000us       0.000us           312
                                         prims::squeeze         0.04%       6.643ms         0.06%       9.083ms      29.112us       0.000us         0.00%       0.000us       0.000us           312
                                          aten::addcmul         0.43%      67.248ms         1.56%     244.102ms     938.854us       0.000us         0.00%       0.000us       0.000us           260
                                          aten::dropout         0.02%       2.783ms         0.64%     100.105ms     901.847us       0.000us         0.00%       0.000us       0.000us           111
                                           prims::clone         0.10%      14.844ms         0.12%      18.323ms      71.855us       0.000us         0.00%       0.000us       0.000us           255
                                           aten::linear         0.18%      28.865ms         8.68%        1.355s       6.104ms       0.000us         0.00%       0.000us       0.000us           222
                                                aten::t         1.02%     159.408ms         2.59%     404.746ms     321.738us       0.000us         0.00%       0.000us       0.000us          1258
                                        aten::transpose         0.25%      38.934ms         0.70%     109.841ms     137.991us       0.000us         0.00%       0.000us       0.000us           796
                                          aten::permute         0.92%     143.712ms         1.99%     310.967ms     249.572us       0.000us         0.00%       0.000us       0.000us          1246
                                       prims::transpose         0.15%      22.836ms         0.19%      29.694ms      44.319us       0.000us         0.00%       0.000us       0.000us           670
                                           aten::matmul         1.57%     244.399ms         9.90%        1.545s       5.256ms       0.000us         0.00%       0.000us       0.000us           294
                                               aten::mm         1.01%     157.519ms         1.46%     228.149ms     308.309us       0.000us         0.00%       0.000us       0.000us           740
                                     aten::_unsafe_view         0.41%      63.585ms         0.88%     137.275ms     544.742us       0.000us         0.00%       0.000us       0.000us           252
                                              aten::bmm         0.31%      49.013ms         0.46%      72.170ms     300.708us       0.000us         0.00%       0.000us       0.000us           240
                                              aten::div         0.30%      47.564ms         0.60%      93.477ms     486.859us       0.000us         0.00%       0.000us       0.000us           192
                                          aten::softmax         0.01%       1.439ms         0.92%     143.274ms       3.980ms       0.000us         0.00%       0.000us       0.000us            36
                                         aten::_softmax         0.15%      23.427ms         1.15%     180.235ms       2.146ms       0.000us         0.00%       0.000us       0.000us            84
                                             aten::amax         0.08%      11.944ms         0.16%      24.875ms     345.486us       0.000us         0.00%       0.000us       0.000us            72
                                            prims::amax         0.02%       2.999ms         0.02%       3.560ms      98.889us       0.000us         0.00%       0.000us       0.000us            36
                                              aten::exp         0.08%      12.658ms         0.14%      22.462ms     311.972us       0.000us         0.00%       0.000us       0.000us            72
                                             prims::exp         0.01%       2.172ms         0.02%       3.327ms      92.417us       0.000us         0.00%       0.000us       0.000us            36
                                              aten::sum         0.08%      12.520ms         0.16%      24.568ms     341.222us       0.000us         0.00%       0.000us       0.000us            72
                                       aten::contiguous         0.02%       3.176ms         0.21%      32.314ms     897.611us       0.000us         0.00%       0.000us       0.000us            36
                                             aten::gelu         0.20%      31.214ms         0.82%     127.678ms     982.138us       0.000us         0.00%       0.000us       0.000us           130
                                              aten::erf         0.02%       2.606ms         0.03%       5.017ms     128.641us       0.000us         0.00%       0.000us       0.000us            39
                                             prims::erf         0.01%       1.548ms         0.02%       2.411ms      61.821us       0.000us         0.00%       0.000us       0.000us            39
          OutputGraph.call_user_compiler (dynamo_timed)         0.04%       6.746ms        27.98%        4.367s        4.367s       0.000us         0.00%       0.000us       0.000us             1
          create_aot_dispatcher_function (dynamo_timed)         4.79%     747.907ms        27.93%        4.360s        4.360s       0.000us         0.00%       0.000us       0.000us             1
                                       aten::lift_fresh         0.03%       4.576ms         0.03%       5.411ms     225.458us       0.000us         0.00%       0.000us       0.000us            24
                                          aten::detach_         0.00%     302.000us         0.00%     304.000us      25.333us       0.000us         0.00%       0.000us       0.000us            12
                                                detach_         0.00%       2.000us         0.00%       2.000us       0.167us       0.000us         0.00%       0.000us       0.000us            12
                               aten::sym_storage_offset         0.06%       8.666ms         0.06%       8.666ms       8.238us       0.000us         0.00%       0.000us       0.000us          1052
                                        aten::sym_numel         0.07%      11.141ms         0.07%      11.141ms       9.930us       0.000us         0.00%       0.000us       0.000us          1122
                              aten::_propagate_xla_data         0.00%     380.000us         0.00%     380.000us      76.000us       0.000us         0.00%       0.000us       0.000us             5
                                          is_contiguous         0.00%       9.000us         0.00%       9.000us       0.019us       0.000us         0.00%       0.000us       0.000us           472
                                            aten::alias         0.13%      20.840ms         0.79%     123.462ms      72.625us       0.000us         0.00%       0.000us       0.000us          1700
                                         prims::view_of         0.18%      28.165ms         0.27%      42.865ms      25.215us       0.000us         0.00%       0.000us       0.000us          1700
                                            aten::fill_         0.00%     262.000us         0.00%     726.000us      80.667us       0.000us         0.00%       0.000us       0.000us             9
                                             aten::fill         0.00%      89.000us         0.00%     270.000us      90.000us       0.000us         0.00%       0.000us       0.000us             3
                                       aten::slice_copy         0.00%     393.000us         0.00%     393.000us      21.833us       0.000us         0.00%       0.000us       0.000us            18
                                      aten::expand_copy         0.01%       2.223ms         0.01%       2.223ms      15.122us       0.000us         0.00%       0.000us       0.000us           147
                                   aten::unsqueeze_copy         0.00%     255.000us         0.00%     255.000us      21.250us       0.000us         0.00%       0.000us       0.000us            12
                                      aten::result_type         0.00%     187.000us         0.00%     187.000us       0.263us       0.000us         0.00%       0.000us       0.000us           711
                                         aten::_to_copy         0.00%     303.000us         0.00%     728.000us      14.857us       0.000us         0.00%       0.000us       0.000us            49
                                             aten::item         0.00%     116.000us         0.00%     124.000us       3.179us       0.000us         0.00%       0.000us       0.000us            39
                              aten::_local_scalar_dense         0.00%      10.000us         0.00%      10.000us       0.256us       0.000us         0.00%       0.000us       0.000us            39
                                    aten::scalar_tensor         0.00%      64.000us         0.00%      64.000us       7.111us       0.000us         0.00%       0.000us       0.000us             9
                                        aten::view_copy         0.05%       7.635ms         0.05%       7.635ms       8.961us       0.000us         0.00%       0.000us       0.000us           852
                                           aten::t_copy         0.01%       2.137ms         0.01%       2.137ms       9.626us       0.000us         0.00%       0.000us       0.000us           222
                                     aten::permute_copy         0.01%       1.930ms         0.01%       1.930ms      13.403us       0.000us         0.00%       0.000us       0.000us           144
                                   aten::transpose_copy         0.00%     663.000us         0.00%     663.000us      18.417us       0.000us         0.00%       0.000us       0.000us            36
                                    cudaPeekAtLastError         0.00%      28.000us         0.00%      28.000us       0.096us       0.000us         0.00%       0.000us       0.000us           291
                                               cudaFree         0.00%      57.000us         0.00%      57.000us      57.000us       0.000us         0.00%       0.000us       0.000us             1
                                             cudaMalloc         0.00%     403.000us         0.00%     403.000us     134.333us       0.000us         0.00%       0.000us       0.000us             3
                                         cuLaunchKernel         0.04%       6.075ms         0.04%       6.075ms       4.480us       0.000us         0.00%       0.000us       0.000us          1356
                                   cudaFuncSetAttribute         0.22%      34.743ms         0.22%      34.743ms      75.528us       0.000us         0.00%       0.000us       0.000us           460
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         0.00%      78.000us         0.00%      78.000us       2.294us       0.000us         0.00%       0.000us       0.000us            34
                                       cudaLaunchKernel         0.00%     578.000us         0.00%     578.000us       6.283us       0.000us         0.00%       0.000us       0.000us            92
                                    cudaGetFuncBySymbol         0.00%     304.000us         0.00%     304.000us       3.234us       0.000us         0.00%       0.000us       0.000us            94
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.07%      10.684ms         0.07%      10.684ms     593.556us       0.000us         0.00%       0.000us       0.000us            18
                                        cudaMemsetAsync         0.00%      72.000us         0.00%      72.000us       5.143us       0.000us         0.00%       0.000us       0.000us            14
                                  cudaDeviceSynchronize         0.00%      21.000us         0.00%      21.000us      21.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 15.608s
Self CUDA time total: 368.163ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       3.590ms        30.09%       3.590ms      99.722us            36
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.544ms        12.94%       1.544ms       1.544ms             1
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     948.000us         7.95%     948.000us      79.000us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     792.000us         6.64%     792.000us      66.000us            12
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     755.000us         6.33%     755.000us      62.917us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     736.000us         6.17%     736.000us      61.333us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     732.000us         6.14%     732.000us      61.000us            12
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     711.000us         5.96%     711.000us      59.250us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     389.000us         3.26%     389.000us      32.417us            12
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     364.000us         3.05%     364.000us     364.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     286.000us         2.40%     286.000us      11.917us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     264.000us         2.21%     264.000us      12.000us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     169.000us         1.42%     169.000us      14.083us            12
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     157.000us         1.32%     157.000us      13.083us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     133.000us         1.11%     133.000us      11.083us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     121.000us         1.01%     121.000us      10.083us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us      96.000us         0.80%      96.000us       8.000us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.34%      40.000us      40.000us             1
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      24.000us         0.20%      24.000us       2.000us            12
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.16%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.13%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.07%       8.000us       8.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.06%       7.000us       7.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.05%       6.000us       6.000us             1
                               TorchDynamo Cache Lookup        11.16%     241.000us        11.16%     241.000us     241.000us       0.000us         0.00%       0.000us       0.000us             1
                                  Torch-Compiled Region        42.13%     910.000us        42.13%     910.000us     910.000us       0.000us         0.00%       0.000us       0.000us             1
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         2.04%      44.000us         2.04%      44.000us       1.833us       0.000us         0.00%       0.000us       0.000us            24
                                       cudaLaunchKernel        16.85%     364.000us        16.85%     364.000us       5.056us       0.000us         0.00%       0.000us       0.000us            72
                                         cuLaunchKernel        22.55%     487.000us        22.55%     487.000us       3.430us       0.000us         0.00%       0.000us       0.000us           142
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.09%       2.000us         0.09%       2.000us       0.167us       0.000us         0.00%       0.000us       0.000us            12
                                   cudaFuncSetAttribute         1.02%      22.000us         1.02%      22.000us       0.611us       0.000us         0.00%       0.000us       0.000us            36
                                        cudaMemsetAsync         3.24%      70.000us         3.24%      70.000us       5.833us       0.000us         0.00%       0.000us       0.000us            12
                                  cudaDeviceSynchronize         0.93%      20.000us         0.93%      20.000us      20.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.160ms
Self CUDA time total: 11.931ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       3.591ms        30.14%       3.591ms      99.750us            36
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.540ms        12.92%       1.540ms       1.540ms             1
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     946.000us         7.94%     946.000us      78.833us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     797.000us         6.69%     797.000us      66.417us            12
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     752.000us         6.31%     752.000us      62.667us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     734.000us         6.16%     734.000us      61.167us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     733.000us         6.15%     733.000us      61.083us            12
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     712.000us         5.98%     712.000us      59.333us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     388.000us         3.26%     388.000us      32.333us            12
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     364.000us         3.05%     364.000us     364.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     275.000us         2.31%     275.000us      11.458us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     265.000us         2.22%     265.000us      12.045us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     168.000us         1.41%     168.000us      14.000us            12
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     156.000us         1.31%     156.000us      13.000us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     133.000us         1.12%     133.000us      11.083us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     121.000us         1.02%     121.000us      10.083us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us      96.000us         0.81%      96.000us       8.000us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.34%      40.000us      40.000us             1
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      24.000us         0.20%      24.000us       2.000us            12
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.16%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.13%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.07%       8.000us       8.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.06%       7.000us       7.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.05%       6.000us       6.000us             1
                               TorchDynamo Cache Lookup        12.04%     211.000us        12.04%     211.000us     211.000us       0.000us         0.00%       0.000us       0.000us             1
                                  Torch-Compiled Region        39.70%     696.000us        39.70%     696.000us     696.000us       0.000us         0.00%       0.000us       0.000us             1
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         0.29%       5.000us         0.29%       5.000us       0.208us       0.000us         0.00%       0.000us       0.000us            24
                                       cudaLaunchKernel        18.43%     323.000us        18.43%     323.000us       4.486us       0.000us         0.00%       0.000us       0.000us            72
                                         cuLaunchKernel        26.18%     459.000us        26.18%     459.000us       3.232us       0.000us         0.00%       0.000us       0.000us           142
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.06%       1.000us         0.06%       1.000us       0.083us       0.000us         0.00%       0.000us       0.000us            12
                                   cudaFuncSetAttribute         0.23%       4.000us         0.23%       4.000us       0.111us       0.000us         0.00%       0.000us       0.000us            36
                                        cudaMemsetAsync         2.11%      37.000us         2.11%      37.000us       3.083us       0.000us         0.00%       0.000us       0.000us            12
                                  cudaDeviceSynchronize         0.97%      17.000us         0.97%      17.000us      17.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.753ms
Self CUDA time total: 11.915ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       3.590ms        30.10%       3.590ms      99.722us            36
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.543ms        12.94%       1.543ms       1.543ms             1
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     944.000us         7.91%     944.000us      78.667us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     795.000us         6.67%     795.000us      66.250us            12
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     755.000us         6.33%     755.000us      62.917us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     737.000us         6.18%     737.000us      61.417us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     733.000us         6.15%     733.000us      61.083us            12
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     709.000us         5.94%     709.000us      59.083us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     389.000us         3.26%     389.000us      32.417us            12
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     364.000us         3.05%     364.000us     364.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     286.000us         2.40%     286.000us      11.917us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     264.000us         2.21%     264.000us      12.000us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     168.000us         1.41%     168.000us      14.000us            12
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     157.000us         1.32%     157.000us      13.083us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     132.000us         1.11%     132.000us      11.000us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     120.000us         1.01%     120.000us      10.000us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us      96.000us         0.80%      96.000us       8.000us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.34%      40.000us      40.000us             1
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      24.000us         0.20%      24.000us       2.000us            12
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.16%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.13%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      13.000us         0.11%      13.000us      13.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.07%       8.000us       8.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.06%       7.000us       7.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.05%       6.000us       6.000us             1
                               TorchDynamo Cache Lookup        12.10%     210.000us        12.10%     210.000us     210.000us       0.000us         0.00%       0.000us       0.000us             1
                                  Torch-Compiled Region        39.40%     684.000us        39.40%     684.000us     684.000us       0.000us         0.00%       0.000us       0.000us             1
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         0.23%       4.000us         0.23%       4.000us       0.167us       0.000us         0.00%       0.000us       0.000us            24
                                       cudaLaunchKernel        18.49%     321.000us        18.49%     321.000us       4.458us       0.000us         0.00%       0.000us       0.000us            72
                                         cuLaunchKernel        26.50%     460.000us        26.50%     460.000us       3.239us       0.000us         0.00%       0.000us       0.000us           142
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.06%       1.000us         0.06%       1.000us       0.083us       0.000us         0.00%       0.000us       0.000us            12
                                   cudaFuncSetAttribute         0.23%       4.000us         0.23%       4.000us       0.111us       0.000us         0.00%       0.000us       0.000us            36
                                        cudaMemsetAsync         2.02%      35.000us         2.02%      35.000us       2.917us       0.000us         0.00%       0.000us       0.000us            12
                                  cudaDeviceSynchronize         0.98%      17.000us         0.98%      17.000us      17.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.736ms
Self CUDA time total: 11.927ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       3.589ms        30.12%       3.589ms      99.694us            36
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.544ms        12.96%       1.544ms       1.544ms             1
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     945.000us         7.93%     945.000us      78.750us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     793.000us         6.66%     793.000us      66.083us            12
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     752.000us         6.31%     752.000us      62.667us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     737.000us         6.19%     737.000us      61.417us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     733.000us         6.15%     733.000us      61.083us            12
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     711.000us         5.97%     711.000us      59.250us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     390.000us         3.27%     390.000us      32.500us            12
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     364.000us         3.06%     364.000us     364.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     273.000us         2.29%     273.000us      11.375us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     264.000us         2.22%     264.000us      12.000us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     172.000us         1.44%     172.000us      14.333us            12
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     156.000us         1.31%     156.000us      13.000us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     132.000us         1.11%     132.000us      11.000us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     120.000us         1.01%     120.000us      10.000us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us      95.000us         0.80%      95.000us       7.917us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.34%      40.000us      40.000us             1
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      24.000us         0.20%      24.000us       2.000us            12
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.16%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.13%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.07%       8.000us       8.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.06%       7.000us       7.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.05%       6.000us       6.000us             1
                               TorchDynamo Cache Lookup        11.77%     215.000us        11.77%     215.000us     215.000us       0.000us         0.00%       0.000us       0.000us             1
                                  Torch-Compiled Region        41.68%     761.000us        41.68%     761.000us     761.000us       0.000us         0.00%       0.000us       0.000us             1
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         0.27%       5.000us         0.27%       5.000us       0.208us       0.000us         0.00%       0.000us       0.000us            24
                                       cudaLaunchKernel        17.52%     320.000us        17.52%     320.000us       4.444us       0.000us         0.00%       0.000us       0.000us            72
                                         cuLaunchKernel        25.52%     466.000us        25.52%     466.000us       3.282us       0.000us         0.00%       0.000us       0.000us           142
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.05%       1.000us         0.05%       1.000us       0.083us       0.000us         0.00%       0.000us       0.000us            12
                                   cudaFuncSetAttribute         0.22%       4.000us         0.22%       4.000us       0.111us       0.000us         0.00%       0.000us       0.000us            36
                                        cudaMemsetAsync         1.97%      36.000us         1.97%      36.000us       3.000us       0.000us         0.00%       0.000us       0.000us            12
                                  cudaDeviceSynchronize         0.99%      18.000us         0.99%      18.000us      18.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.826ms
Self CUDA time total: 11.914ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       3.589ms        30.11%       3.589ms      99.694us            36
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.542ms        12.94%       1.542ms       1.542ms             1
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     945.000us         7.93%     945.000us      78.750us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     790.000us         6.63%     790.000us      65.833us            12
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     753.000us         6.32%     753.000us      62.750us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     737.000us         6.18%     737.000us      61.417us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     732.000us         6.14%     732.000us      61.000us            12
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     712.000us         5.97%     712.000us      59.333us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     387.000us         3.25%     387.000us      32.250us            12
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     364.000us         3.05%     364.000us     364.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     287.000us         2.41%     287.000us      11.958us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     265.000us         2.22%     265.000us      12.045us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     168.000us         1.41%     168.000us      14.000us            12
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     157.000us         1.32%     157.000us      13.083us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     132.000us         1.11%     132.000us      11.000us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     120.000us         1.01%     120.000us      10.000us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us      96.000us         0.81%      96.000us       8.000us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.34%      40.000us      40.000us             1
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      24.000us         0.20%      24.000us       2.000us            12
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.16%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.13%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.07%       8.000us       8.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.06%       7.000us       7.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.06%       7.000us       7.000us             1
                               TorchDynamo Cache Lookup        11.46%     214.000us        11.46%     214.000us     214.000us       0.000us         0.00%       0.000us       0.000us             1
                                  Torch-Compiled Region        41.67%     778.000us        41.67%     778.000us     778.000us       0.000us         0.00%       0.000us       0.000us             1
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         1.23%      23.000us         1.23%      23.000us       0.958us       0.000us         0.00%       0.000us       0.000us            24
                                       cudaLaunchKernel        17.51%     327.000us        17.51%     327.000us       4.542us       0.000us         0.00%       0.000us       0.000us            72
                                         cuLaunchKernel        24.64%     460.000us        24.64%     460.000us       3.239us       0.000us         0.00%       0.000us       0.000us           142
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.05%       1.000us         0.05%       1.000us       0.083us       0.000us         0.00%       0.000us       0.000us            12
                                   cudaFuncSetAttribute         0.54%      10.000us         0.54%      10.000us       0.278us       0.000us         0.00%       0.000us       0.000us            36
                                        cudaMemsetAsync         1.93%      36.000us         1.93%      36.000us       3.000us       0.000us         0.00%       0.000us       0.000us            12
                                  cudaDeviceSynchronize         0.96%      18.000us         0.96%      18.000us      18.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.867ms
Self CUDA time total: 11.921ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       3.592ms        30.14%       3.592ms      99.778us            36
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.542ms        12.94%       1.542ms       1.542ms             1
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     943.000us         7.91%     943.000us      78.583us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     796.000us         6.68%     796.000us      66.333us            12
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     754.000us         6.33%     754.000us      62.833us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     737.000us         6.18%     737.000us      61.417us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     732.000us         6.14%     732.000us      61.000us            12
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     712.000us         5.97%     712.000us      59.333us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     387.000us         3.25%     387.000us      32.250us            12
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     364.000us         3.05%     364.000us     364.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     274.000us         2.30%     274.000us      11.417us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     266.000us         2.23%     266.000us      12.091us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     169.000us         1.42%     169.000us      14.083us            12
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     156.000us         1.31%     156.000us      13.000us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     133.000us         1.12%     133.000us      11.083us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     122.000us         1.02%     122.000us      10.167us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us      96.000us         0.81%      96.000us       8.000us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.34%      40.000us      40.000us             1
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      24.000us         0.20%      24.000us       2.000us            12
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.16%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.13%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.07%       8.000us       8.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.06%       7.000us       7.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.05%       6.000us       6.000us             1
                               TorchDynamo Cache Lookup         9.76%     216.000us         9.76%     216.000us     216.000us       0.000us         0.00%       0.000us       0.000us             1
                                  Torch-Compiled Region        33.35%     738.000us        33.35%     738.000us     738.000us       0.000us         0.00%       0.000us       0.000us             1
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         0.54%      12.000us         0.54%      12.000us       0.500us       0.000us         0.00%       0.000us       0.000us            24
                                       cudaLaunchKernel        21.19%     469.000us        21.19%     469.000us       6.514us       0.000us         0.00%       0.000us       0.000us            72
                                         cuLaunchKernel        30.68%     679.000us        30.68%     679.000us       4.782us       0.000us         0.00%       0.000us       0.000us           142
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.59%      13.000us         0.59%      13.000us       1.083us       0.000us         0.00%       0.000us       0.000us            12
                                   cudaFuncSetAttribute         0.72%      16.000us         0.72%      16.000us       0.444us       0.000us         0.00%       0.000us       0.000us            36
                                        cudaMemsetAsync         2.30%      51.000us         2.30%      51.000us       4.250us       0.000us         0.00%       0.000us       0.000us            12
                                  cudaDeviceSynchronize         0.86%      19.000us         0.86%      19.000us      19.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.213ms
Self CUDA time total: 11.919ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       3.589ms        30.12%       3.589ms      99.694us            36
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.544ms        12.96%       1.544ms       1.544ms             1
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     943.000us         7.91%     943.000us      78.583us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     791.000us         6.64%     791.000us      65.917us            12
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     755.000us         6.34%     755.000us      62.917us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     737.000us         6.18%     737.000us      61.417us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     732.000us         6.14%     732.000us      61.000us            12
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     709.000us         5.95%     709.000us      59.083us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     389.000us         3.26%     389.000us      32.417us            12
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     363.000us         3.05%     363.000us     363.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     283.000us         2.37%     283.000us      11.792us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     264.000us         2.22%     264.000us      12.000us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     168.000us         1.41%     168.000us      14.000us            12
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     157.000us         1.32%     157.000us      13.083us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     132.000us         1.11%     132.000us      11.000us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     120.000us         1.01%     120.000us      10.000us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us      96.000us         0.81%      96.000us       8.000us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.34%      40.000us      40.000us             1
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      24.000us         0.20%      24.000us       2.000us            12
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.16%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.13%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.07%       8.000us       8.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.06%       7.000us       7.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.05%       6.000us       6.000us             1
                               TorchDynamo Cache Lookup        11.95%     218.000us        11.95%     218.000us     218.000us       0.000us         0.00%       0.000us       0.000us             1
                                  Torch-Compiled Region        41.37%     755.000us        41.37%     755.000us     755.000us       0.000us         0.00%       0.000us       0.000us             1
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         0.27%       5.000us         0.27%       5.000us       0.208us       0.000us         0.00%       0.000us       0.000us            24
                                       cudaLaunchKernel        17.70%     323.000us        17.70%     323.000us       4.486us       0.000us         0.00%       0.000us       0.000us            72
                                         cuLaunchKernel        25.48%     465.000us        25.48%     465.000us       3.275us       0.000us         0.00%       0.000us       0.000us           142
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.05%       1.000us         0.05%       1.000us       0.083us       0.000us         0.00%       0.000us       0.000us            12
                                   cudaFuncSetAttribute         0.22%       4.000us         0.22%       4.000us       0.111us       0.000us         0.00%       0.000us       0.000us            36
                                        cudaMemsetAsync         1.97%      36.000us         1.97%      36.000us       3.000us       0.000us         0.00%       0.000us       0.000us            12
                                  cudaDeviceSynchronize         0.99%      18.000us         0.99%      18.000us      18.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.825ms
Self CUDA time total: 11.916ms

@ysiraichi
Copy link
Collaborator Author

hf_Bert (after)
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                  Torch-Compiled Region        50.81%        7.950s        51.67%        8.084s        8.084s     356.192ms        96.76%     356.192ms     356.192ms             1
                                   triton_gemm_dot_3753         0.00%       0.000us         0.00%       0.000us       0.000us     187.282ms        50.87%     187.282ms       4.682ms            40
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      54.530ms        14.81%      54.530ms       2.943us         18531
                                        redzone_checker         0.00%       0.000us         0.00%       0.000us       0.000us      41.258ms        11.21%      41.258ms      45.239us           912
xla::gpu::buffer_comparator::(anonymous namespace)::...         0.00%       0.000us         0.00%       0.000us       0.000us      24.435ms         6.64%      24.435ms     259.947us            94
                                    triton_gemm_dot_966         0.00%       0.000us         0.00%       0.000us       0.000us      19.489ms         5.29%      19.489ms     442.932us            44
                                     triton_gemm_dot_75         0.00%       0.000us         0.00%       0.000us       0.000us      14.644ms         3.98%      14.644ms     332.818us            44
                                   triton_gemm_dot_3666         0.00%       0.000us         0.00%       0.000us       0.000us       5.933ms         1.61%       5.933ms     123.604us            48
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       4.577ms         1.24%       4.577ms      99.500us            46
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us       3.094ms         0.84%       3.094ms       1.547ms             2
                                                 fusion         0.00%       0.000us         0.00%       0.000us       0.000us       1.839ms         0.50%       1.839ms      61.300us            30
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.543ms         0.42%       1.543ms       1.543ms             1
                       Memcpy DtoH (Device -> Pageable)         0.00%       0.000us         0.00%       0.000us       0.000us       1.101ms         0.30%       1.101ms       2.002us           550
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us       1.061ms         0.29%       1.061ms       3.490us           304
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     947.000us         0.26%     947.000us      78.917us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     934.000us         0.25%     934.000us      66.714us            14
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     821.000us         0.22%     821.000us      58.643us            14
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     752.000us         0.20%     752.000us      62.667us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     734.000us         0.20%     734.000us      61.167us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     732.000us         0.20%     732.000us      61.000us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     525.000us         0.14%     525.000us      32.812us            16
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     364.000us         0.10%     364.000us     364.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     273.000us         0.07%     273.000us      11.375us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     266.000us         0.07%     266.000us      12.091us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     168.000us         0.05%     168.000us      14.000us            12
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us     163.000us         0.04%     163.000us      27.167us             6
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     157.000us         0.04%     157.000us      13.083us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     132.000us         0.04%     132.000us      11.000us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     120.000us         0.03%     120.000us      10.000us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us      99.000us         0.03%      99.000us       8.250us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.01%      40.000us      40.000us             1
                                 wrapped_concatenate_36         0.00%       0.000us         0.00%       0.000us       0.000us      21.000us         0.01%      21.000us      10.500us             2
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.01%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.00%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.00%      12.000us      12.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.00%      12.000us      12.000us             1
                         Memcpy HtoD (Pinned -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us      10.000us         0.00%      10.000us       2.000us             5
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.00%       8.000us       8.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.00%       7.000us       7.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.00%       7.000us       7.000us             1
                               TorchDynamo Cache Lookup         0.00%       2.000us         0.00%       2.000us       0.095us       0.000us         0.00%       0.000us       0.000us            21
         _compile.<locals>.compile_inner (dynamo_timed)        14.96%        2.341s        48.33%        7.562s        7.562s       0.000us         0.00%       0.000us       0.000us             1
                                  cudaStreamIsCapturing         0.02%       2.422ms         0.02%       2.422ms       1.064us       0.000us         0.00%       0.000us       0.000us          2276
                                            aten::clone         1.63%     254.368ms         2.38%     371.723ms     109.815us       0.000us         0.00%       0.000us       0.000us          3385
                                    aten::empty_strided         0.08%      12.807ms         0.08%      12.807ms       5.431us       0.000us         0.00%       0.000us       0.000us          2358
                                            aten::copy_         0.06%       8.664ms         0.06%       8.664ms       4.400us       0.000us         0.00%       0.000us       0.000us          1969
                                           aten::detach         3.18%     497.288ms         7.31%        1.143s     101.221us       0.000us         0.00%       0.000us       0.000us         11295
                                                 detach         0.56%      86.956ms         2.39%     374.275ms     363.374us       0.000us         0.00%       0.000us       0.000us          1030
                                       aten::empty_like         0.01%       1.542ms         0.02%       3.655ms       6.390us       0.000us         0.00%       0.000us       0.000us           572
                                            aten::empty         0.18%      27.554ms         0.18%      27.554ms      11.760us       0.000us         0.00%       0.000us       0.000us          2343
                                             aten::ones         0.04%       5.487ms         0.06%       9.185ms     918.500us       0.000us         0.00%       0.000us       0.000us            10
                                             aten::full         0.00%     309.000us         0.00%     422.000us     140.667us       0.000us         0.00%       0.000us       0.000us             3
                                            aten::slice         0.06%       9.674ms         0.16%      24.532ms     438.071us       0.000us         0.00%       0.000us       0.000us            56
                                       aten::as_strided         0.43%      67.937ms         0.44%      68.149ms       8.493us       0.000us         0.00%       0.000us       0.000us          8024
                                           aten::expand         1.13%     177.267ms         2.37%     370.755ms     251.019us       0.000us         0.00%       0.000us       0.000us          1477
                                prims::broadcast_in_dim         0.25%      39.313ms         0.32%      50.508ms      44.816us       0.000us         0.00%       0.000us       0.000us          1127
                                        aten::unsqueeze         0.04%       5.661ms         0.08%      11.836ms     348.118us       0.000us         0.00%       0.000us       0.000us            34
                                               aten::to         0.23%      35.384ms         0.23%      36.036ms     105.988us       0.000us         0.00%       0.000us       0.000us           340
                                             aten::rsub         0.02%       2.640ms         0.04%       6.267ms     626.700us       0.000us         0.00%       0.000us       0.000us            10
                                              aten::sub         0.31%      47.741ms         0.75%     117.459ms     501.962us       0.000us         0.00%       0.000us       0.000us           234
                                             prims::sub         0.06%       9.837ms         0.08%      13.254ms     113.282us       0.000us         0.00%       0.000us       0.000us           117
                                   aten::empty_permuted         0.12%      19.047ms         0.29%      46.060ms      35.322us       0.000us         0.00%       0.000us       0.000us          1304
                                              aten::mul         0.45%      69.710ms         0.95%     148.875ms     339.123us       0.000us         0.00%       0.000us       0.000us           439
                                             prims::mul         0.15%      23.021ms         0.21%      32.752ms      92.520us       0.000us         0.00%       0.000us       0.000us           354
                                             aten::set_         0.20%      31.725ms         0.27%      41.809ms      33.528us       0.000us         0.00%       0.000us       0.000us          1247
                                        aten::embedding         0.04%       6.350ms         0.13%      20.751ms     988.143us       0.000us         0.00%       0.000us       0.000us            21
                                            aten::index         0.03%       4.763ms         0.07%      10.454ms     580.778us       0.000us         0.00%       0.000us       0.000us            18
                                        aten::new_empty         0.08%      12.456ms         0.10%      16.383ms      52.510us       0.000us         0.00%       0.000us       0.000us           312
                                              aten::add         2.19%     342.568ms         4.42%     691.435ms     497.078us       0.000us         0.00%       0.000us       0.000us          1391
                                             prims::add         0.25%      38.523ms         0.40%      62.475ms     117.877us       0.000us         0.00%       0.000us       0.000us           530
                                             aten::add_         0.01%       1.901ms         0.03%       4.334ms     866.800us       0.000us         0.00%       0.000us       0.000us             5
                                       aten::layer_norm         0.02%       2.580ms         6.47%        1.013s      12.983ms       0.000us         0.00%       0.000us       0.000us            78
                                aten::native_layer_norm         0.38%      59.306ms         6.46%        1.010s      12.950ms       0.000us         0.00%       0.000us       0.000us            78
                                          aten::reshape         0.67%     104.914ms         4.26%     666.355ms       1.501ms       0.000us         0.00%       0.000us       0.000us           444
                                             aten::view         6.26%     979.758ms        13.65%        2.135s     291.667us       0.000us         0.00%       0.000us       0.000us          7321
                                       prims::split_dim         0.27%      42.516ms         0.36%      55.562ms      38.266us       0.000us         0.00%       0.000us       0.000us          1452
                                   prims::collapse_view         0.16%      25.435ms         0.22%      34.085ms      39.634us       0.000us         0.00%       0.000us       0.000us           860
                                aten::native_batch_norm         0.04%       6.103ms         2.88%     450.074ms       5.770ms       0.000us         0.00%       0.000us       0.000us            78
                         aten::_native_batch_norm_legit         0.52%      80.605ms         3.65%     571.110ms       3.104ms       0.000us         0.00%       0.000us       0.000us           184
                                         aten::var_mean         0.31%      47.794ms         0.75%     116.880ms     749.231us       0.000us         0.00%       0.000us       0.000us           156
                                             prims::var         0.03%       5.113ms         0.04%       6.311ms      80.910us       0.000us         0.00%       0.000us       0.000us            78
                                             prims::sum         0.04%       6.298ms         0.05%       7.437ms      65.237us       0.000us         0.00%       0.000us       0.000us           114
                                             prims::div         0.06%       9.717ms         0.09%      14.120ms      94.133us       0.000us         0.00%       0.000us       0.000us           150
                                            aten::rsqrt         0.17%      26.231ms         0.30%      46.692ms     299.308us       0.000us         0.00%       0.000us       0.000us           156
                                           prims::rsqrt         0.03%       4.470ms         0.04%       6.907ms      88.551us       0.000us         0.00%       0.000us       0.000us            78
                                          aten::squeeze         0.31%      47.834ms         0.46%      72.636ms     232.808us       0.000us         0.00%       0.000us       0.000us           312
                                         prims::squeeze         0.04%       6.353ms         0.06%       8.697ms      27.875us       0.000us         0.00%       0.000us       0.000us           312
                                          aten::addcmul         0.51%      80.072ms         1.63%     255.773ms     983.742us       0.000us         0.00%       0.000us       0.000us           260
                                          aten::dropout         0.02%       2.949ms         0.67%     105.583ms     951.198us       0.000us         0.00%       0.000us       0.000us           111
                                           prims::clone         0.09%      14.726ms         0.12%      18.269ms      71.643us       0.000us         0.00%       0.000us       0.000us           255
                                           aten::linear         0.18%      28.175ms         8.60%        1.346s       6.063ms       0.000us         0.00%       0.000us       0.000us           222
                                                aten::t         1.02%     158.922ms         2.57%     402.641ms     320.064us       0.000us         0.00%       0.000us       0.000us          1258
                                        aten::transpose         0.25%      38.915ms         0.70%     109.166ms     137.143us       0.000us         0.00%       0.000us       0.000us           796
                                          aten::permute         0.97%     151.011ms         2.04%     319.612ms     256.510us       0.000us         0.00%       0.000us       0.000us          1246
                                       prims::transpose         0.14%      22.593ms         0.19%      29.424ms      43.916us       0.000us         0.00%       0.000us       0.000us           670
                                           aten::matmul         1.54%     240.867ms         9.60%        1.502s       5.110ms       0.000us         0.00%       0.000us       0.000us           294
                                               aten::mm         0.93%     146.287ms         1.38%     216.326ms     292.332us       0.000us         0.00%       0.000us       0.000us           740
                                     aten::_unsafe_view         0.40%      62.658ms         0.87%     135.706ms     538.516us       0.000us         0.00%       0.000us       0.000us           252
                                              aten::bmm         0.31%      48.369ms         0.45%      71.110ms     296.292us       0.000us         0.00%       0.000us       0.000us           240
                                              aten::div         0.30%      47.149ms         0.59%      92.832ms     483.500us       0.000us         0.00%       0.000us       0.000us           192
                                          aten::softmax         0.01%       1.431ms         0.88%     136.930ms       3.804ms       0.000us         0.00%       0.000us       0.000us            36
                                         aten::_softmax         0.16%      25.005ms         1.11%     173.204ms       2.062ms       0.000us         0.00%       0.000us       0.000us            84
                                             aten::amax         0.08%      11.933ms         0.16%      24.721ms     343.347us       0.000us         0.00%       0.000us       0.000us            72
                                            prims::amax         0.02%       2.940ms         0.02%       3.497ms      97.139us       0.000us         0.00%       0.000us       0.000us            36
                                              aten::exp         0.08%      12.557ms         0.14%      22.482ms     312.250us       0.000us         0.00%       0.000us       0.000us            72
                                             prims::exp         0.01%       2.158ms         0.02%       3.361ms      93.361us       0.000us         0.00%       0.000us       0.000us            36
                                              aten::sum         0.08%      12.467ms         0.16%      24.306ms     337.583us       0.000us         0.00%       0.000us       0.000us            72
                                       aten::contiguous         0.02%       3.069ms         0.21%      32.125ms     892.361us       0.000us         0.00%       0.000us       0.000us            36
                                             aten::gelu         0.20%      30.812ms         0.81%     127.404ms     980.031us       0.000us         0.00%       0.000us       0.000us           130
                                              aten::erf         0.02%       2.646ms         0.03%       5.130ms     131.538us       0.000us         0.00%       0.000us       0.000us            39
                                             prims::erf         0.01%       1.567ms         0.02%       2.484ms      63.692us       0.000us         0.00%       0.000us       0.000us            39
          OutputGraph.call_user_compiler (dynamo_timed)         0.04%       6.698ms        27.73%        4.339s        4.339s       0.000us         0.00%       0.000us       0.000us             1
          create_aot_dispatcher_function (dynamo_timed)         4.87%     761.555ms        27.69%        4.332s        4.332s       0.000us         0.00%       0.000us       0.000us             1
                                       aten::lift_fresh         0.03%       4.480ms         0.03%       5.296ms     220.667us       0.000us         0.00%       0.000us       0.000us            24
                                          aten::detach_         0.00%     278.000us         0.00%     280.000us      23.333us       0.000us         0.00%       0.000us       0.000us            12
                                                detach_         0.00%       2.000us         0.00%       2.000us       0.167us       0.000us         0.00%       0.000us       0.000us            12
                               aten::sym_storage_offset         0.05%       8.374ms         0.05%       8.374ms       7.960us       0.000us         0.00%       0.000us       0.000us          1052
                                        aten::sym_numel         0.07%      10.617ms         0.07%      10.617ms       9.463us       0.000us         0.00%       0.000us       0.000us          1122
                              aten::_propagate_xla_data         0.00%     365.000us         0.00%     365.000us      73.000us       0.000us         0.00%       0.000us       0.000us             5
                                          is_contiguous         0.00%       3.000us         0.00%       3.000us       0.006us       0.000us         0.00%       0.000us       0.000us           472
                                            aten::alias         0.13%      20.106ms         0.78%     121.706ms      71.592us       0.000us         0.00%       0.000us       0.000us          1700
                                         prims::view_of         0.17%      27.013ms         0.26%      41.097ms      24.175us       0.000us         0.00%       0.000us       0.000us          1700
                                            aten::fill_         0.00%     299.000us         0.00%     654.000us      72.667us       0.000us         0.00%       0.000us       0.000us             9
                                             aten::fill         0.00%      23.000us         0.00%     236.000us      78.667us       0.000us         0.00%       0.000us       0.000us             3
                                       aten::slice_copy         0.00%     374.000us         0.00%     374.000us      20.778us       0.000us         0.00%       0.000us       0.000us            18
                                      aten::expand_copy         0.01%       2.104ms         0.01%       2.104ms      14.313us       0.000us         0.00%       0.000us       0.000us           147
                                   aten::unsqueeze_copy         0.00%     203.000us         0.00%     203.000us      16.917us       0.000us         0.00%       0.000us       0.000us            12
                                      aten::result_type         0.00%     146.000us         0.00%     146.000us       0.205us       0.000us         0.00%       0.000us       0.000us           711
                                         aten::_to_copy         0.00%     335.000us         0.01%     798.000us      16.286us       0.000us         0.00%       0.000us       0.000us            49
                                             aten::item         0.00%      94.000us         0.00%     102.000us       2.615us       0.000us         0.00%       0.000us       0.000us            39
                              aten::_local_scalar_dense         0.00%      14.000us         0.00%      14.000us       0.359us       0.000us         0.00%       0.000us       0.000us            39
                                    aten::scalar_tensor         0.00%      38.000us         0.00%      38.000us       4.222us       0.000us         0.00%       0.000us       0.000us             9
                                        aten::view_copy         0.05%       7.431ms         0.05%       7.431ms       8.722us       0.000us         0.00%       0.000us       0.000us           852
                                           aten::t_copy         0.01%       1.889ms         0.01%       1.889ms       8.509us       0.000us         0.00%       0.000us       0.000us           222
                                     aten::permute_copy         0.01%       1.868ms         0.01%       1.868ms      12.972us       0.000us         0.00%       0.000us       0.000us           144
                                   aten::transpose_copy         0.00%     620.000us         0.00%     620.000us      17.222us       0.000us         0.00%       0.000us       0.000us            36
                                    cudaPeekAtLastError         0.00%      27.000us         0.00%      27.000us       0.093us       0.000us         0.00%       0.000us       0.000us           291
                                               cudaFree         0.00%      36.000us         0.00%      36.000us      36.000us       0.000us         0.00%       0.000us       0.000us             1
                                             cudaMalloc         0.00%     309.000us         0.00%     309.000us     103.000us       0.000us         0.00%       0.000us       0.000us             3
                                         cuLaunchKernel         0.04%       6.171ms         0.04%       6.171ms       4.551us       0.000us         0.00%       0.000us       0.000us          1356
                                   cudaFuncSetAttribute         0.22%      34.475ms         0.22%      34.475ms      74.946us       0.000us         0.00%       0.000us       0.000us           460
                                       cudaLaunchKernel         0.00%     536.000us         0.00%     536.000us       5.826us       0.000us         0.00%       0.000us       0.000us            92
                                    cudaGetFuncBySymbol         0.00%     290.000us         0.00%     290.000us       3.085us       0.000us         0.00%       0.000us       0.000us            94
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         0.00%     120.000us         0.00%     120.000us       3.529us       0.000us         0.00%       0.000us       0.000us            34
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.07%      11.125ms         0.07%      11.125ms     618.056us       0.000us         0.00%       0.000us       0.000us            18
                                        cudaMemsetAsync         0.00%     101.000us         0.00%     101.000us       7.214us       0.000us         0.00%       0.000us       0.000us            14
                                  cudaDeviceSynchronize         0.00%      19.000us         0.00%      19.000us      19.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 15.647s
Self CUDA time total: 368.125ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       3.589ms        30.15%       3.589ms      99.694us            36
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.544ms        12.97%       1.544ms       1.544ms             1
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     945.000us         7.94%     945.000us      78.750us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     793.000us         6.66%     793.000us      66.083us            12
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     755.000us         6.34%     755.000us      62.917us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     735.000us         6.17%     735.000us      61.250us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     732.000us         6.15%     732.000us      61.000us            12
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     698.000us         5.86%     698.000us      58.167us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     395.000us         3.32%     395.000us      32.917us            12
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     363.000us         3.05%     363.000us     363.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     273.000us         2.29%     273.000us      11.375us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     265.000us         2.23%     265.000us      12.045us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     168.000us         1.41%     168.000us      14.000us            12
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     156.000us         1.31%     156.000us      13.000us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     132.000us         1.11%     132.000us      11.000us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     122.000us         1.02%     122.000us      10.167us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us      95.000us         0.80%      95.000us       7.917us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.34%      40.000us      40.000us             1
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      24.000us         0.20%      24.000us       2.000us            12
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.16%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.13%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.07%       8.000us       8.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.06%       7.000us       7.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.05%       6.000us       6.000us             1
                               TorchDynamo Cache Lookup        11.58%     245.000us        11.58%     245.000us     245.000us       0.000us         0.00%       0.000us       0.000us             1
                                  Torch-Compiled Region        42.67%     903.000us        42.67%     903.000us     903.000us       0.000us         0.00%       0.000us       0.000us             1
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         1.80%      38.000us         1.80%      38.000us       1.583us       0.000us         0.00%       0.000us       0.000us            24
                                       cudaLaunchKernel        16.16%     342.000us        16.16%     342.000us       4.750us       0.000us         0.00%       0.000us       0.000us            72
                                         cuLaunchKernel        22.92%     485.000us        22.92%     485.000us       3.415us       0.000us         0.00%       0.000us       0.000us           142
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.05%       1.000us         0.05%       1.000us       0.083us       0.000us         0.00%       0.000us       0.000us            12
                                   cudaFuncSetAttribute         1.04%      22.000us         1.04%      22.000us       0.611us       0.000us         0.00%       0.000us       0.000us            36
                                        cudaMemsetAsync         2.98%      63.000us         2.98%      63.000us       5.250us       0.000us         0.00%       0.000us       0.000us            12
                                  cudaDeviceSynchronize         0.80%      17.000us         0.80%      17.000us      17.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 2.116ms
Self CUDA time total: 11.904ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       3.591ms        30.12%       3.591ms      99.750us            36
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.546ms        12.97%       1.546ms       1.546ms             1
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     945.000us         7.93%     945.000us      78.750us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     799.000us         6.70%     799.000us      66.583us            12
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     754.000us         6.32%     754.000us      62.833us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     733.000us         6.15%     733.000us      61.083us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     732.000us         6.14%     732.000us      61.000us            12
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     701.000us         5.88%     701.000us      58.417us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     396.000us         3.32%     396.000us      33.000us            12
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     364.000us         3.05%     364.000us     364.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     277.000us         2.32%     277.000us      11.542us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     265.000us         2.22%     265.000us      12.045us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     169.000us         1.42%     169.000us      14.083us            12
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     156.000us         1.31%     156.000us      13.000us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     132.000us         1.11%     132.000us      11.000us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     120.000us         1.01%     120.000us      10.000us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us      96.000us         0.81%      96.000us       8.000us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.34%      40.000us      40.000us             1
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      24.000us         0.20%      24.000us       2.000us            12
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.16%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.13%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      13.000us         0.11%      13.000us      13.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.07%       8.000us       8.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.06%       7.000us       7.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.05%       6.000us       6.000us             1
                               TorchDynamo Cache Lookup        12.44%     220.000us        12.44%     220.000us     220.000us       0.000us         0.00%       0.000us       0.000us             1
                                  Torch-Compiled Region        39.25%     694.000us        39.25%     694.000us     694.000us       0.000us         0.00%       0.000us       0.000us             1
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         0.17%       3.000us         0.17%       3.000us       0.125us       0.000us         0.00%       0.000us       0.000us            24
                                       cudaLaunchKernel        17.65%     312.000us        17.65%     312.000us       4.333us       0.000us         0.00%       0.000us       0.000us            72
                                         cuLaunchKernel        27.21%     481.000us        27.21%     481.000us       3.387us       0.000us         0.00%       0.000us       0.000us           142
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.06%       1.000us         0.06%       1.000us       0.083us       0.000us         0.00%       0.000us       0.000us            12
                                   cudaFuncSetAttribute         0.17%       3.000us         0.17%       3.000us       0.083us       0.000us         0.00%       0.000us       0.000us            36
                                        cudaMemsetAsync         2.15%      38.000us         2.15%      38.000us       3.167us       0.000us         0.00%       0.000us       0.000us            12
                                  cudaDeviceSynchronize         0.90%      16.000us         0.90%      16.000us      16.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.768ms
Self CUDA time total: 11.921ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       3.589ms        30.15%       3.589ms      99.694us            36
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.541ms        12.95%       1.541ms       1.541ms             1
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     943.000us         7.92%     943.000us      78.583us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     792.000us         6.65%     792.000us      66.000us            12
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     760.000us         6.38%     760.000us      63.333us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     735.000us         6.17%     735.000us      61.250us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     732.000us         6.15%     732.000us      61.000us            12
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     697.000us         5.86%     697.000us      58.083us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     395.000us         3.32%     395.000us      32.917us            12
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     364.000us         3.06%     364.000us     364.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     273.000us         2.29%     273.000us      11.375us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     265.000us         2.23%     265.000us      12.045us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     168.000us         1.41%     168.000us      14.000us            12
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     156.000us         1.31%     156.000us      13.000us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     132.000us         1.11%     132.000us      11.000us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     121.000us         1.02%     121.000us      10.083us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us      97.000us         0.81%      97.000us       8.083us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.34%      40.000us      40.000us             1
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      24.000us         0.20%      24.000us       2.000us            12
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.16%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.13%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.07%       8.000us       8.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.06%       7.000us       7.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.05%       6.000us       6.000us             1
                               TorchDynamo Cache Lookup        13.98%     251.000us        13.98%     251.000us     251.000us       0.000us         0.00%       0.000us       0.000us             1
                                  Torch-Compiled Region        37.83%     679.000us        37.83%     679.000us     679.000us       0.000us         0.00%       0.000us       0.000us             1
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         0.22%       4.000us         0.22%       4.000us       0.167us       0.000us         0.00%       0.000us       0.000us            24
                                       cudaLaunchKernel        18.44%     331.000us        18.44%     331.000us       4.597us       0.000us         0.00%       0.000us       0.000us            72
                                         cuLaunchKernel        26.41%     474.000us        26.41%     474.000us       3.338us       0.000us         0.00%       0.000us       0.000us           142
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.06%       1.000us         0.06%       1.000us       0.083us       0.000us         0.00%       0.000us       0.000us            12
                                   cudaFuncSetAttribute         0.22%       4.000us         0.22%       4.000us       0.111us       0.000us         0.00%       0.000us       0.000us            36
                                        cudaMemsetAsync         2.06%      37.000us         2.06%      37.000us       3.083us       0.000us         0.00%       0.000us       0.000us            12
                                  cudaDeviceSynchronize         0.78%      14.000us         0.78%      14.000us      14.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.795ms
Self CUDA time total: 11.904ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       3.590ms        30.16%       3.590ms      99.722us            36
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.543ms        12.96%       1.543ms       1.543ms             1
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     943.000us         7.92%     943.000us      78.583us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     795.000us         6.68%     795.000us      66.250us            12
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     754.000us         6.33%     754.000us      62.833us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     732.000us         6.15%     732.000us      61.000us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     732.000us         6.15%     732.000us      61.000us            12
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     699.000us         5.87%     699.000us      58.250us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     396.000us         3.33%     396.000us      33.000us            12
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     364.000us         3.06%     364.000us     364.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     275.000us         2.31%     275.000us      11.458us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     265.000us         2.23%     265.000us      12.045us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     168.000us         1.41%     168.000us      14.000us            12
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     157.000us         1.32%     157.000us      13.083us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     132.000us         1.11%     132.000us      11.000us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     120.000us         1.01%     120.000us      10.000us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us      96.000us         0.81%      96.000us       8.000us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.34%      40.000us      40.000us             1
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      24.000us         0.20%      24.000us       2.000us            12
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.16%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.13%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.07%       8.000us       8.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.06%       7.000us       7.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.05%       6.000us       6.000us             1
                               TorchDynamo Cache Lookup        12.59%     216.000us        12.59%     216.000us     216.000us       0.000us         0.00%       0.000us       0.000us             1
                                  Torch-Compiled Region        37.53%     644.000us        37.53%     644.000us     644.000us       0.000us         0.00%       0.000us       0.000us             1
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         0.23%       4.000us         0.23%       4.000us       0.167us       0.000us         0.00%       0.000us       0.000us            24
                                       cudaLaunchKernel        18.88%     324.000us        18.88%     324.000us       4.500us       0.000us         0.00%       0.000us       0.000us            72
                                         cuLaunchKernel        27.56%     473.000us        27.56%     473.000us       3.331us       0.000us         0.00%       0.000us       0.000us           142
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.06%       1.000us         0.06%       1.000us       0.083us       0.000us         0.00%       0.000us       0.000us            12
                                   cudaFuncSetAttribute         0.17%       3.000us         0.17%       3.000us       0.083us       0.000us         0.00%       0.000us       0.000us            36
                                        cudaMemsetAsync         2.16%      37.000us         2.16%      37.000us       3.083us       0.000us         0.00%       0.000us       0.000us            12
                                  cudaDeviceSynchronize         0.82%      14.000us         0.82%      14.000us      14.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.716ms
Self CUDA time total: 11.905ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       3.588ms        30.14%       3.588ms      99.667us            36
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.542ms        12.95%       1.542ms       1.542ms             1
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     945.000us         7.94%     945.000us      78.750us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     793.000us         6.66%     793.000us      66.083us            12
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     755.000us         6.34%     755.000us      62.917us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     734.000us         6.17%     734.000us      61.167us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     732.000us         6.15%     732.000us      61.000us            12
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     698.000us         5.86%     698.000us      58.167us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     395.000us         3.32%     395.000us      32.917us            12
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     364.000us         3.06%     364.000us     364.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     274.000us         2.30%     274.000us      11.417us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     265.000us         2.23%     265.000us      12.045us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     168.000us         1.41%     168.000us      14.000us            12
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     157.000us         1.32%     157.000us      13.083us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     133.000us         1.12%     133.000us      11.083us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     120.000us         1.01%     120.000us      10.000us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us      96.000us         0.81%      96.000us       8.000us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.34%      40.000us      40.000us             1
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      24.000us         0.20%      24.000us       2.000us            12
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.16%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.13%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      13.000us         0.11%      13.000us      13.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.07%       8.000us       8.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.06%       7.000us       7.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.05%       6.000us       6.000us             1
                               TorchDynamo Cache Lookup        11.69%     217.000us        11.69%     217.000us     217.000us       0.000us         0.00%       0.000us       0.000us             1
                                  Torch-Compiled Region        42.30%     785.000us        42.30%     785.000us     785.000us       0.000us         0.00%       0.000us       0.000us             1
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         0.32%       6.000us         0.32%       6.000us       0.250us       0.000us         0.00%       0.000us       0.000us            24
                                       cudaLaunchKernel        18.00%     334.000us        18.00%     334.000us       4.639us       0.000us         0.00%       0.000us       0.000us            72
                                         cuLaunchKernel        24.52%     455.000us        24.52%     455.000us       3.204us       0.000us         0.00%       0.000us       0.000us           142
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.05%       1.000us         0.05%       1.000us       0.083us       0.000us         0.00%       0.000us       0.000us            12
                                   cudaFuncSetAttribute         0.22%       4.000us         0.22%       4.000us       0.111us       0.000us         0.00%       0.000us       0.000us            36
                                        cudaMemsetAsync         2.05%      38.000us         2.05%      38.000us       3.167us       0.000us         0.00%       0.000us       0.000us            12
                                  cudaDeviceSynchronize         0.86%      16.000us         0.86%      16.000us      16.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.856ms
Self CUDA time total: 11.904ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       3.589ms        30.14%       3.589ms      99.694us            36
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.542ms        12.95%       1.542ms       1.542ms             1
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     944.000us         7.93%     944.000us      78.667us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     798.000us         6.70%     798.000us      66.500us            12
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     755.000us         6.34%     755.000us      62.917us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     732.000us         6.15%     732.000us      61.000us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     732.000us         6.15%     732.000us      61.000us            12
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     700.000us         5.88%     700.000us      58.333us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     396.000us         3.33%     396.000us      33.000us            12
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     364.000us         3.06%     364.000us     364.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     274.000us         2.30%     274.000us      11.417us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     265.000us         2.23%     265.000us      12.045us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     169.000us         1.42%     169.000us      14.083us            12
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     156.000us         1.31%     156.000us      13.000us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     132.000us         1.11%     132.000us      11.000us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     120.000us         1.01%     120.000us      10.000us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us      96.000us         0.81%      96.000us       8.000us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.34%      40.000us      40.000us             1
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      24.000us         0.20%      24.000us       2.000us            12
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.16%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.13%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      13.000us         0.11%      13.000us      13.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.07%       8.000us       8.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.06%       7.000us       7.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.05%       6.000us       6.000us             1
                               TorchDynamo Cache Lookup        12.71%     219.000us        12.71%     219.000us     219.000us       0.000us         0.00%       0.000us       0.000us             1
                                  Torch-Compiled Region        38.02%     655.000us        38.02%     655.000us     655.000us       0.000us         0.00%       0.000us       0.000us             1
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         1.51%      26.000us         1.51%      26.000us       1.083us       0.000us         0.00%       0.000us       0.000us            24
                                       cudaLaunchKernel        18.11%     312.000us        18.11%     312.000us       4.333us       0.000us         0.00%       0.000us       0.000us            72
                                         cuLaunchKernel        26.35%     454.000us        26.35%     454.000us       3.197us       0.000us         0.00%       0.000us       0.000us           142
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.06%       1.000us         0.06%       1.000us       0.083us       0.000us         0.00%       0.000us       0.000us            12
                                   cudaFuncSetAttribute         0.23%       4.000us         0.23%       4.000us       0.111us       0.000us         0.00%       0.000us       0.000us            36
                                        cudaMemsetAsync         2.15%      37.000us         2.15%      37.000us       3.083us       0.000us         0.00%       0.000us       0.000us            12
                                  cudaDeviceSynchronize         0.87%      15.000us         0.87%      15.000us      15.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.723ms
Self CUDA time total: 11.909ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_...         0.00%       0.000us         0.00%       0.000us       0.000us       3.588ms        30.13%       3.588ms      99.667us            36
                                      triton_gemm_dot_1         0.00%       0.000us         0.00%       0.000us       0.000us       1.542ms        12.95%       1.542ms       1.542ms             1
                                             fusion_218         0.00%       0.000us         0.00%       0.000us       0.000us     947.000us         7.95%     947.000us      78.917us            12
ampere_fp16_s16816gemm_fp16_64x128_sliced1x2_ldg8_f2...         0.00%       0.000us         0.00%       0.000us       0.000us     791.000us         6.64%     791.000us      65.917us            12
                                             fusion_228         0.00%       0.000us         0.00%       0.000us       0.000us     757.000us         6.36%     757.000us      63.083us            12
                                             fusion_211         0.00%       0.000us         0.00%       0.000us       0.000us     735.000us         6.17%     735.000us      61.250us            12
                                             fusion_229         0.00%       0.000us         0.00%       0.000us       0.000us     732.000us         6.15%     732.000us      61.000us            12
void cutlass::Kernel<cutlass_80_tensorop_f16_s16816g...         0.00%       0.000us         0.00%       0.000us       0.000us     701.000us         5.89%     701.000us      58.417us            12
sm80_xmma_gemm_f16f16_f16f32_f32_tn_n_tilesize128x12...         0.00%       0.000us         0.00%       0.000us       0.000us     395.000us         3.32%     395.000us      32.917us            12
                                             fusion_271         0.00%       0.000us         0.00%       0.000us       0.000us     363.000us         3.05%     363.000us     363.000us             1
                                             fusion_213         0.00%       0.000us         0.00%       0.000us       0.000us     274.000us         2.30%     274.000us      11.417us            24
                                             fusion_207         0.00%       0.000us         0.00%       0.000us       0.000us     266.000us         2.23%     266.000us      12.091us            22
                                             fusion_221         0.00%       0.000us         0.00%       0.000us       0.000us     168.000us         1.41%     168.000us      14.000us            12
                                             fusion_222         0.00%       0.000us         0.00%       0.000us       0.000us     156.000us         1.31%     156.000us      13.000us            12
                                             fusion_217         0.00%       0.000us         0.00%       0.000us       0.000us     132.000us         1.11%     132.000us      11.000us            12
                                             fusion_216         0.00%       0.000us         0.00%       0.000us       0.000us     121.000us         1.02%     121.000us      10.083us            12
                                  wrapped_concatenate_0         0.00%       0.000us         0.00%       0.000us       0.000us      95.000us         0.80%      95.000us       7.917us            12
                                      triton_gemm_dot_0         0.00%       0.000us         0.00%       0.000us       0.000us      40.000us         0.34%      40.000us      40.000us             1
                                        Memset (Device)         0.00%       0.000us         0.00%       0.000us       0.000us      24.000us         0.20%      24.000us       2.000us            12
                                             fusion_265         0.00%       0.000us         0.00%       0.000us       0.000us      19.000us         0.16%      19.000us      19.000us             1
                                             fusion_268         0.00%       0.000us         0.00%       0.000us       0.000us      16.000us         0.13%      16.000us      16.000us             1
                                             fusion_264         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_212         0.00%       0.000us         0.00%       0.000us       0.000us      12.000us         0.10%      12.000us      12.000us             1
                                             fusion_267         0.00%       0.000us         0.00%       0.000us       0.000us       8.000us         0.07%       8.000us       8.000us             1
                                             fusion_270         0.00%       0.000us         0.00%       0.000us       0.000us       7.000us         0.06%       7.000us       7.000us             1
                                             fusion_269         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.05%       6.000us       6.000us             1
                               TorchDynamo Cache Lookup        12.72%     222.000us        12.72%     222.000us     222.000us       0.000us         0.00%       0.000us       0.000us             1
                                  Torch-Compiled Region        39.66%     692.000us        39.66%     692.000us     692.000us       0.000us         0.00%       0.000us       0.000us             1
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         0.29%       5.000us         0.29%       5.000us       0.208us       0.000us         0.00%       0.000us       0.000us            24
                                       cudaLaunchKernel        17.36%     303.000us        17.36%     303.000us       4.208us       0.000us         0.00%       0.000us       0.000us            72
                                         cuLaunchKernel        26.42%     461.000us        26.42%     461.000us       3.246us       0.000us         0.00%       0.000us       0.000us           142
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.34%       6.000us         0.34%       6.000us       0.500us       0.000us         0.00%       0.000us       0.000us            12
                                   cudaFuncSetAttribute         0.23%       4.000us         0.23%       4.000us       0.111us       0.000us         0.00%       0.000us       0.000us            36
                                        cudaMemsetAsync         2.12%      37.000us         2.12%      37.000us       3.083us       0.000us         0.00%       0.000us       0.000us            12
                                  cudaDeviceSynchronize         0.86%      15.000us         0.86%      15.000us      15.000us       0.000us         0.00%       0.000us       0.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.745ms
Self CUDA time total: 11.907ms

@ysiraichi
Copy link
Collaborator Author

@golechwierowicz Let me know what you think about the profiling numbers.

@golechwierowicz
Copy link
Collaborator

Looks good!

@ysiraichi ysiraichi merged commit 30591ad into master Feb 1, 2024
18 checks passed
amithrm pushed a commit to amithrm/xla that referenced this pull request Mar 1, 2024
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
Co-authored-by: Emilio Cota <ecg@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants