Some optimizations #393

QuarticCat · 2022-09-27T16:33:52Z

First, enable thin-LTO. This brings ~5% speedup.

Before:

 Performance counter stats for 'cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs':

          1,033.66 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             6,127      page-faults:u             #    5.927 K/sec                  
     3,992,966,896      cycles:u                  #    3.863 GHz                    
       295,590,268      stalled-cycles-frontend:u #    7.40% frontend cycles idle   
     1,229,153,356      stalled-cycles-backend:u  #   30.78% backend cycles idle    
     4,948,279,348      instructions:u            #    1.24  insn per cycle         
                                                  #    0.25  stalled cycles per insn
       975,024,920      branches:u                #  943.275 M/sec                  
        20,612,498      branch-misses:u           #    2.11% of all branches        

       1.034353300 seconds time elapsed

       0.973504000 seconds user
       0.059968000 seconds sys

After:

 Performance counter stats for 'cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs':

            996.05 msec task-clock:u              #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             6,116      page-faults:u             #    6.140 K/sec                  
     3,926,308,401      cycles:u                  #    3.942 GHz                    
       263,648,312      stalled-cycles-frontend:u #    6.71% frontend cycles idle   
     1,349,401,185      stalled-cycles-backend:u  #   34.37% backend cycles idle    
     4,859,980,027      instructions:u            #    1.24  insn per cycle         
                                                  #    0.28  stalled cycles per insn
       920,523,681      branches:u                #  924.177 M/sec                  
        19,978,054      branch-misses:u           #    2.17% of all branches        

       0.998478582 seconds time elapsed

       0.932282000 seconds user
       0.063171000 seconds sys

The numbers of instructions are relatively stable.

I also measured them using hyperfine.

Before:

$ hyperfine --warmup=3 'cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs'
Benchmark 1: cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):      1.046 s ±  0.051 s    [User: 0.993 s, System: 0.052 s]
  Range (min … max):    1.001 s …  1.141 s    10 runs

After:

$ hyperfine --warmup=3 'cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs'
Benchmark 1: cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     986.4 ms ±  15.3 ms    [User: 930.6 ms, System: 54.3 ms]
  Range (min … max):   971.3 ms … 1024.4 ms    10 runs

QuarticCat · 2022-09-27T18:02:41Z

Enabling PGO could get another >5% speedup. I use this tool to do the optimization https://github.com/Kobzol/cargo-pgo.

$ cargo pgo run -- sample_files/slow_before.rs sample_files/slow_after.rs
$ cargo pgo optimize run -- sample_files/slow_before.rs sample_files/slow_after.rs

Before (this time I directly invoke the binary rather than through cargo run):

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            975.13 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,423      page-faults:u             #    1.459 K/sec                  
     3,873,705,459      cycles:u                  #    3.973 GHz                    
       312,841,226      stalled-cycles-frontend:u #    8.08% frontend cycles idle   
     1,356,026,289      stalled-cycles-backend:u  #   35.01% backend cycles idle    
     4,416,634,146      instructions:u            #    1.14  insn per cycle         
                                                  #    0.31  stalled cycles per insn
       834,316,552      branches:u                #  855.598 M/sec                  
        18,932,572      branch-misses:u           #    2.27% of all branches        

       0.975598289 seconds time elapsed

       0.947813000 seconds user
       0.026704000 seconds sy

After:

 Performance counter stats for '/home/qc/.cargo/target/x86_64-unknown-linux-gnu/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            907.64 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,441      page-faults:u             #    1.588 K/sec                  
     3,573,701,152      cycles:u                  #    3.937 GHz                    
        81,488,884      stalled-cycles-frontend:u #    2.28% frontend cycles idle   
     1,843,191,214      stalled-cycles-backend:u  #   51.58% backend cycles idle    
     4,175,713,686      instructions:u            #    1.17  insn per cycle         
                                                  #    0.44  stalled cycles per insn
       773,151,834      branches:u                #  851.826 M/sec                  
        18,228,346      branch-misses:u           #    2.36% of all branches        

       0.908512976 seconds time elapsed

       0.860485000 seconds user
       0.046691000 seconds sys

QuarticCat · 2022-09-27T20:12:08Z

Substitute rpds::Stack with a simplified one, which removes all unused parts. ~20% speedup and ~12% less memory usage (without PGO).

 Performance counter stats for 'cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs':

            802.58 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             5,885      page-faults:u             #    7.333 K/sec                  
     3,161,969,385      cycles:u                  #    3.940 GHz                    
        91,496,359      stalled-cycles-frontend:u #    2.89% frontend cycles idle   
     1,254,938,698      stalled-cycles-backend:u  #   39.69% backend cycles idle    
     3,982,788,439      instructions:u            #    1.26  insn per cycle         
                                                  #    0.32  stalled cycles per insn
       752,314,449      branches:u                #  937.376 M/sec                  
        19,227,259      branch-misses:u           #    2.56% of all branches        

       0.803042017 seconds time elapsed

       0.752424000 seconds user
       0.049889000 seconds sys

Original memory usage:

cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs   0.94s  user 0.07s system 99% cpu 1.007 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                549 
page faults from disk:     0
other page faults:         6810

Current memroy usage:

cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs   0.74s  user 0.07s system 99% cpu 0.806 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                479 
page faults from disk:     0
other page faults:         6742

QuarticCat · 2022-09-27T20:39:01Z

Switching from mimalloc to snmalloc brings a negligible speedup and slightly less memory usage.

Since the time difference is too small, hyperfine is used again.

Before this change:

Benchmark 1: cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     863.1 ms ±  36.6 ms    [User: 816.1 ms, System: 45.9 ms]
  Range (min … max):   824.6 ms … 922.8 ms    10 runs

After this change:

Benchmark 1: cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     852.8 ms ±  40.2 ms    [User: 804.4 ms, System: 47.2 ms]
  Range (min … max):   817.9 ms … 929.6 ms    10 runs

cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs   0.73s  user 0.06s system 99% cpu 0.791 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                464 
page faults from disk:     0
other page faults:         7519

If you think such an improvement is worthy then I will commit & push it.

QuarticCat · 2022-09-27T21:59:49Z

Change a RefCell in Vertex to Cell and save some memory.

Benchmark results (without PGO and snmalloc):

 Performance counter stats for 'cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs':

            808.79 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             5,874      page-faults:u             #    7.263 K/sec                  
     3,158,422,412      cycles:u                  #    3.905 GHz                    
       124,285,663      stalled-cycles-frontend:u #    3.94% frontend cycles idle   
     1,218,082,291      stalled-cycles-backend:u  #   38.57% backend cycles idle    
     3,999,740,470      instructions:u            #    1.27  insn per cycle         
                                                  #    0.30  stalled cycles per insn
       753,983,362      branches:u                #  932.233 M/sec                  
        19,338,603      branch-misses:u           #    2.56% of all branches        

       0.809403610 seconds time elapsed

       0.742209000 seconds user
       0.066496000 seconds sys

cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs   0.76s  user 0.04s system 99% cpu 0.802 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                463 
page faults from disk:     0
other page faults:         6740

I don't know why the number of instructions rised a little bit.

Enter-tainer · 2022-09-28T05:54:45Z

cool! nice work

QuarticCat · 2022-09-28T13:06:50Z

Remove len field in the stack.

 Performance counter stats for 'cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs':

            805.14 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             5,863      page-faults:u             #    7.282 K/sec                  
     3,115,521,172      cycles:u                  #    3.870 GHz                    
        61,609,793      stalled-cycles-frontend:u #    1.98% frontend cycles idle   
     1,267,554,739      stalled-cycles-backend:u  #   40.69% backend cycles idle    
     3,956,185,667      instructions:u            #    1.27  insn per cycle         
                                                  #    0.32  stalled cycles per insn
       750,392,095      branches:u                #  932.006 M/sec                  
        19,234,997      branch-misses:u           #    2.56% of all branches        

       0.806000988 seconds time elapsed

       0.764996000 seconds user
       0.040091000 seconds sys

cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs   0.79s  user 0.05s system 99% cpu 0.839 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                435 
page faults from disk:     0
other page faults:         6727

QuarticCat · 2022-09-28T14:12:40Z

I have to focus on other works so the optimization ends here.

In conclusion (without PGO and snmalloc):

~25% speedup
~20% less memory usage

Wilfred · 2022-09-29T06:14:18Z

Wow, really great changes! It's incredible to see a ~25% speedup in code I've already tried to make fast :)

Thanks for mentioning snmalloc, I will take a look at it too. I've had a few problems with mimalloc (see #297) so I'm interested in looking at other malloc implementations.

Enable thin-LTO

06b46e9

Use a faster stack impl

d48ee2d

QuarticCat changed the title ~~Enable thin-LTO~~ Some optimizations Sep 27, 2022

QuarticCat added 2 commits September 28, 2022 05:36

Fix the clippy::clone_double_ref warning

fa44d4c

Fix more clippy warnings

2c6972c

QuarticCat force-pushed the master branch from d3cd293 to 2c6972c Compare September 27, 2022 21:47

Change a RefCell in Vertex to Cell

3b0edb4

Further simplify stack

b88625d

Wilfred merged commit 7e102e1 into Wilfred:master Sep 29, 2022

Masynchin mentioned this pull request Oct 3, 2022

Use successors #398

Closed

QuarticCat mentioned this pull request Oct 7, 2022

Some optimizations (3) #401

Open

Wilfred added a commit that referenced this pull request Oct 14, 2022

Mention perf improvements from #393 and #395

6b0009c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some optimizations #393

Some optimizations #393

QuarticCat commented Sep 27, 2022 •

edited

Loading

QuarticCat commented Sep 27, 2022 •

edited

Loading

QuarticCat commented Sep 27, 2022 •

edited

Loading

QuarticCat commented Sep 27, 2022

QuarticCat commented Sep 27, 2022 •

edited

Loading

Enter-tainer commented Sep 28, 2022

QuarticCat commented Sep 28, 2022

QuarticCat commented Sep 28, 2022

Wilfred commented Sep 29, 2022

Some optimizations #393

Some optimizations #393

Conversation

QuarticCat commented Sep 27, 2022 • edited Loading

QuarticCat commented Sep 27, 2022 • edited Loading

QuarticCat commented Sep 27, 2022 • edited Loading

QuarticCat commented Sep 27, 2022

QuarticCat commented Sep 27, 2022 • edited Loading

Enter-tainer commented Sep 28, 2022

QuarticCat commented Sep 28, 2022

QuarticCat commented Sep 28, 2022

Wilfred commented Sep 29, 2022

QuarticCat commented Sep 27, 2022 •

edited

Loading

QuarticCat commented Sep 27, 2022 •

edited

Loading

QuarticCat commented Sep 27, 2022 •

edited

Loading

QuarticCat commented Sep 27, 2022 •

edited

Loading