[perf] Reduce SNodeTree materialization time in LLVM backends #3127

strongoier · 2021-10-09T04:21:05Z

Related issue = #

Thanks to @qiao-bo for his experiments on memory allocators, which gives me insights. This PR is mainly based on the profiling results of the following code snippet:

import taichi as ti
import time

ti.init(arch=ti.cuda, device_memory_fraction=0.9)
begin = time.monotonic()
for i in range(300):
    fb = ti.FieldsBuilder()
    x = ti.field(ti.f32)
    fb.dense(ti.ij, (2048, 2048)).place(x)
    fb.finalize()
print(f'Total time: {time.monotonic() - begin}s')
ti.print_profile_info()

Before this PR, the profiling result is:

[Taichi] version 0.8.1, llvm 10.0.0, commit 14bd1022, linux, python 3.8.10
[Taichi] Starting on arch=cuda
Total time: 129.67297151125968s
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[Profiler thread 140095274080064]
     58.639 ms clone_runtime_module          [4 x  14.660 ms]
         17.108 ms 29.17%  module_from_bitcode_file [2 x   8.554 ms]
         29.905 ms 51.00%  clone module          [4 x   7.476 ms]
         11.528 ms 19.66%  link_module_with_cuda_libdevice [1 x  11.528 ms]
              9.105 ms 78.98%  module_from_bitcode_file [1 x   9.105 ms]
              2.423 ms 21.02%  [unaccounted]
      2.935 ms eliminate_unused_functions    [2 x   1.467 ms]
     69.864 ms global_optimize_module_cpu    [1 x  69.864 ms]
          1.748 ms  2.50%  llvm_function_pass    [1 x   1.748 ms]
         66.748 ms 95.54%  llvm_module_pass      [1 x  66.748 ms]
          1.368 ms  1.96%  [unaccounted]
     97.108 ms compile_module_to_ptx         [1 x  97.108 ms]
          1.440 ms  1.48%  llvm_function_pass    [1 x   1.440 ms]
         93.510 ms 96.30%  llvm_module_pass      [1 x  93.510 ms]
          2.158 ms  2.22%  [unaccounted]
      1.948  m run                           [600 x 194.783 ms]
          0.000  m  0.01%  generate_types        [1800 x   5.272 us]
          0.000  m  0.02%  generate_child_accessors [600 x  30.765 us]
              9.303 ms 50.40%  generate_refine_coordinates [600 x  15.506 us]
              8.733 ms 47.31%  generate_child_accessors [600 x  14.555 us]
                  4.970 ms 56.91%  generate_refine_coordinates [600 x   8.283 us]
                  1.404 ms 16.08%  generate_child_accessors [600 x   2.340 us]
                  2.359 ms 27.01%  [unaccounted]
              0.423 ms  2.29%  [unaccounted]
          0.131  m  6.71%  clone_struct_module   [600 x  13.061 ms]
          0.033  m  1.70%  eliminate_unused_functions [600 x   3.316 ms]
          0.367  m 18.83%  global_optimize_module_cpu [300 x  73.369 ms]
              0.525  s  2.38%  llvm_function_pass    [300 x   1.749 ms]
             21.065  s 95.70%  llvm_module_pass      [300 x  70.216 ms]
              0.421  s  1.91%  [unaccounted]
          0.493  m 25.32%  compile_module_to_ptx [300 x  98.644 ms]
              0.444  s  1.50%  llvm_function_pass    [300 x   1.482 ms]
             28.556  s 96.49%  llvm_module_pass      [300 x  95.187 ms]
              0.593  s  2.00%  [unaccounted]
          0.924  m 47.41%  [unaccounted]
      9.853  s clone_struct_module           [598 x  16.477 ms]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

The bottleneck part is triggered by the update_runtime_jit_module() function I'm removing here. In fact, we only need to feed our generated module with latest SNodeTree types as well as compiled kernels into LLVM passes when we're going to run kernels (and we're actually doing this). Therefore, this function call, which happens immediately after SNodeTree materialization, is redundant.

After this PR, the profiling result is:

[Taichi] version 0.8.1, llvm 10.0.0, commit 14bd1022, linux, python 3.8.10
[Taichi] Starting on arch=cuda
Total time: 20.385696676094085s
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[Profiler thread 140383221622592]
     57.925 ms clone_runtime_module          [4 x  14.481 ms]
         16.679 ms 28.79%  module_from_bitcode_file [2 x   8.340 ms]
         29.570 ms 51.05%  clone module          [4 x   7.392 ms]
         11.577 ms 19.99%  link_module_with_cuda_libdevice [1 x  11.577 ms]
              9.162 ms 79.14%  module_from_bitcode_file [1 x   9.162 ms]
              2.415 ms 20.86%  [unaccounted]
      2.982 ms eliminate_unused_functions    [2 x   1.491 ms]
     70.268 ms global_optimize_module_cpu    [1 x  70.268 ms]
          1.712 ms  2.44%  llvm_function_pass    [1 x   1.712 ms]
         67.164 ms 95.58%  llvm_module_pass      [1 x  67.164 ms]
          1.392 ms  1.98%  [unaccounted]
     97.007 ms compile_module_to_ptx         [1 x  97.007 ms]
          1.469 ms  1.51%  llvm_function_pass    [1 x   1.469 ms]
         93.376 ms 96.26%  llvm_module_pass      [1 x  93.376 ms]
          2.162 ms  2.23%  [unaccounted]
     10.819  s run                           [600 x  18.031 ms]
          0.008  s  0.08%  generate_types        [1800 x   4.699 us]
          0.015  s  0.14%  generate_child_accessors [600 x  24.943 us]
              7.736 ms 51.69%  generate_refine_coordinates [600 x  12.893 us]
              6.842 ms 45.72%  generate_child_accessors [600 x  11.404 us]
                  3.723 ms 54.41%  generate_refine_coordinates [600 x   6.205 us]
                  1.059 ms 15.47%  generate_child_accessors [600 x   1.765 us]
                  2.061 ms 30.12%  [unaccounted]
              0.388 ms  2.59%  [unaccounted]
         10.795  s 99.78%  [unaccounted]
      7.841  s clone_struct_module           [598 x  13.113 ms]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

We can see the total running time decreases from ~130s to ~20s, which is ~6.5x speedup.

netlify · 2021-10-09T04:21:10Z

❌ Deploy Preview for jovial-fermat-aa59dc failed.

🔨 Explore the source changes: 9940bdc

🔍 Inspect the deploy log: https://app.netlify.com/sites/jovial-fermat-aa59dc/deploys/616118b27e631f0008dff11d

taichi/llvm/llvm_context.cpp

k-ye

Awesome!

qiao-bo

Thx ;)

[perf] Reduce SNodeTree materialization time in LLVM backends

9940bdc

strongoier requested review from ailzhang, k-ye and qiao-bo October 9, 2021 04:21

k-ye reviewed Oct 9, 2021

View reviewed changes

taichi/llvm/llvm_context.cpp Show resolved Hide resolved

k-ye approved these changes Oct 9, 2021

View reviewed changes

strongoier merged commit a2d3e2d into taichi-dev:master Oct 9, 2021

qiao-bo reviewed Oct 9, 2021

View reviewed changes

strongoier mentioned this pull request Oct 11, 2021

[perf] Further reduce SNodeTree materialization time in LLVM backends #3146

Open

strongoier deleted the field-compilation-time branch October 22, 2021 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf] Reduce SNodeTree materialization time in LLVM backends #3127

[perf] Reduce SNodeTree materialization time in LLVM backends #3127

strongoier commented Oct 9, 2021

netlify bot commented Oct 9, 2021 •

edited

Loading

k-ye left a comment

qiao-bo left a comment

[perf] Reduce SNodeTree materialization time in LLVM backends #3127

[perf] Reduce SNodeTree materialization time in LLVM backends #3127

Conversation

strongoier commented Oct 9, 2021

netlify bot commented Oct 9, 2021 • edited Loading

k-ye left a comment

Choose a reason for hiding this comment

qiao-bo left a comment

Choose a reason for hiding this comment

netlify bot commented Oct 9, 2021 •

edited

Loading