Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[perf] Reduce SNodeTree materialization time in LLVM backends #3127

Merged
merged 1 commit into from
Oct 9, 2021

Conversation

strongoier
Copy link
Contributor

Related issue = #

Thanks to @qiao-bo for his experiments on memory allocators, which gives me insights. This PR is mainly based on the profiling results of the following code snippet:

import taichi as ti
import time

ti.init(arch=ti.cuda, device_memory_fraction=0.9)
begin = time.monotonic()
for i in range(300):
    fb = ti.FieldsBuilder()
    x = ti.field(ti.f32)
    fb.dense(ti.ij, (2048, 2048)).place(x)
    fb.finalize()
print(f'Total time: {time.monotonic() - begin}s')
ti.print_profile_info()

Before this PR, the profiling result is:

[Taichi] version 0.8.1, llvm 10.0.0, commit 14bd1022, linux, python 3.8.10
[Taichi] Starting on arch=cuda
Total time: 129.67297151125968s
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[Profiler thread 140095274080064]
     58.639 ms clone_runtime_module          [4 x  14.660 ms]
         17.108 ms 29.17%  module_from_bitcode_file [2 x   8.554 ms]
         29.905 ms 51.00%  clone module          [4 x   7.476 ms]
         11.528 ms 19.66%  link_module_with_cuda_libdevice [1 x  11.528 ms]
              9.105 ms 78.98%  module_from_bitcode_file [1 x   9.105 ms]
              2.423 ms 21.02%  [unaccounted]
      2.935 ms eliminate_unused_functions    [2 x   1.467 ms]
     69.864 ms global_optimize_module_cpu    [1 x  69.864 ms]
          1.748 ms  2.50%  llvm_function_pass    [1 x   1.748 ms]
         66.748 ms 95.54%  llvm_module_pass      [1 x  66.748 ms]
          1.368 ms  1.96%  [unaccounted]
     97.108 ms compile_module_to_ptx         [1 x  97.108 ms]
          1.440 ms  1.48%  llvm_function_pass    [1 x   1.440 ms]
         93.510 ms 96.30%  llvm_module_pass      [1 x  93.510 ms]
          2.158 ms  2.22%  [unaccounted]
      1.948  m run                           [600 x 194.783 ms]
          0.000  m  0.01%  generate_types        [1800 x   5.272 us]
          0.000  m  0.02%  generate_child_accessors [600 x  30.765 us]
              9.303 ms 50.40%  generate_refine_coordinates [600 x  15.506 us]
              8.733 ms 47.31%  generate_child_accessors [600 x  14.555 us]
                  4.970 ms 56.91%  generate_refine_coordinates [600 x   8.283 us]
                  1.404 ms 16.08%  generate_child_accessors [600 x   2.340 us]
                  2.359 ms 27.01%  [unaccounted]
              0.423 ms  2.29%  [unaccounted]
          0.131  m  6.71%  clone_struct_module   [600 x  13.061 ms]
          0.033  m  1.70%  eliminate_unused_functions [600 x   3.316 ms]
          0.367  m 18.83%  global_optimize_module_cpu [300 x  73.369 ms]
              0.525  s  2.38%  llvm_function_pass    [300 x   1.749 ms]
             21.065  s 95.70%  llvm_module_pass      [300 x  70.216 ms]
              0.421  s  1.91%  [unaccounted]
          0.493  m 25.32%  compile_module_to_ptx [300 x  98.644 ms]
              0.444  s  1.50%  llvm_function_pass    [300 x   1.482 ms]
             28.556  s 96.49%  llvm_module_pass      [300 x  95.187 ms]
              0.593  s  2.00%  [unaccounted]
          0.924  m 47.41%  [unaccounted]
      9.853  s clone_struct_module           [598 x  16.477 ms]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

The bottleneck part is triggered by the update_runtime_jit_module() function I'm removing here. In fact, we only need to feed our generated module with latest SNodeTree types as well as compiled kernels into LLVM passes when we're going to run kernels (and we're actually doing this). Therefore, this function call, which happens immediately after SNodeTree materialization, is redundant.

After this PR, the profiling result is:

[Taichi] version 0.8.1, llvm 10.0.0, commit 14bd1022, linux, python 3.8.10
[Taichi] Starting on arch=cuda
Total time: 20.385696676094085s
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[Profiler thread 140383221622592]
     57.925 ms clone_runtime_module          [4 x  14.481 ms]
         16.679 ms 28.79%  module_from_bitcode_file [2 x   8.340 ms]
         29.570 ms 51.05%  clone module          [4 x   7.392 ms]
         11.577 ms 19.99%  link_module_with_cuda_libdevice [1 x  11.577 ms]
              9.162 ms 79.14%  module_from_bitcode_file [1 x   9.162 ms]
              2.415 ms 20.86%  [unaccounted]
      2.982 ms eliminate_unused_functions    [2 x   1.491 ms]
     70.268 ms global_optimize_module_cpu    [1 x  70.268 ms]
          1.712 ms  2.44%  llvm_function_pass    [1 x   1.712 ms]
         67.164 ms 95.58%  llvm_module_pass      [1 x  67.164 ms]
          1.392 ms  1.98%  [unaccounted]
     97.007 ms compile_module_to_ptx         [1 x  97.007 ms]
          1.469 ms  1.51%  llvm_function_pass    [1 x   1.469 ms]
         93.376 ms 96.26%  llvm_module_pass      [1 x  93.376 ms]
          2.162 ms  2.23%  [unaccounted]
     10.819  s run                           [600 x  18.031 ms]
          0.008  s  0.08%  generate_types        [1800 x   4.699 us]
          0.015  s  0.14%  generate_child_accessors [600 x  24.943 us]
              7.736 ms 51.69%  generate_refine_coordinates [600 x  12.893 us]
              6.842 ms 45.72%  generate_child_accessors [600 x  11.404 us]
                  3.723 ms 54.41%  generate_refine_coordinates [600 x   6.205 us]
                  1.059 ms 15.47%  generate_child_accessors [600 x   1.765 us]
                  2.061 ms 30.12%  [unaccounted]
              0.388 ms  2.59%  [unaccounted]
         10.795  s 99.78%  [unaccounted]
      7.841  s clone_struct_module           [598 x  13.113 ms]
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

We can see the total running time decreases from ~130s to ~20s, which is ~6.5x speedup.

@netlify
Copy link

netlify bot commented Oct 9, 2021

❌ Deploy Preview for jovial-fermat-aa59dc failed.

🔨 Explore the source changes: 9940bdc

🔍 Inspect the deploy log: https://app.netlify.com/sites/jovial-fermat-aa59dc/deploys/616118b27e631f0008dff11d

Copy link
Member

@k-ye k-ye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@strongoier strongoier merged commit a2d3e2d into taichi-dev:master Oct 9, 2021
Copy link
Collaborator

@qiao-bo qiao-bo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants