Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some optimizations #393

Merged
merged 6 commits into from
Sep 29, 2022
Merged

Some optimizations #393

merged 6 commits into from
Sep 29, 2022

Conversation

QuarticCat
Copy link
Contributor

@QuarticCat QuarticCat commented Sep 27, 2022

First, enable thin-LTO. This brings ~5% speedup.

Before:

 Performance counter stats for 'cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs':

          1,033.66 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             6,127      page-faults:u             #    5.927 K/sec                  
     3,992,966,896      cycles:u                  #    3.863 GHz                    
       295,590,268      stalled-cycles-frontend:u #    7.40% frontend cycles idle   
     1,229,153,356      stalled-cycles-backend:u  #   30.78% backend cycles idle    
     4,948,279,348      instructions:u            #    1.24  insn per cycle         
                                                  #    0.25  stalled cycles per insn
       975,024,920      branches:u                #  943.275 M/sec                  
        20,612,498      branch-misses:u           #    2.11% of all branches        

       1.034353300 seconds time elapsed

       0.973504000 seconds user
       0.059968000 seconds sys

After:

 Performance counter stats for 'cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs':

            996.05 msec task-clock:u              #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             6,116      page-faults:u             #    6.140 K/sec                  
     3,926,308,401      cycles:u                  #    3.942 GHz                    
       263,648,312      stalled-cycles-frontend:u #    6.71% frontend cycles idle   
     1,349,401,185      stalled-cycles-backend:u  #   34.37% backend cycles idle    
     4,859,980,027      instructions:u            #    1.24  insn per cycle         
                                                  #    0.28  stalled cycles per insn
       920,523,681      branches:u                #  924.177 M/sec                  
        19,978,054      branch-misses:u           #    2.17% of all branches        

       0.998478582 seconds time elapsed

       0.932282000 seconds user
       0.063171000 seconds sys

The numbers of instructions are relatively stable.

I also measured them using hyperfine.

Before:

$ hyperfine --warmup=3 'cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs'
Benchmark 1: cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):      1.046 s ±  0.051 s    [User: 0.993 s, System: 0.052 s]
  Range (min … max):    1.001 s …  1.141 s    10 runs

After:

$ hyperfine --warmup=3 'cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs'
Benchmark 1: cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     986.4 ms ±  15.3 ms    [User: 930.6 ms, System: 54.3 ms]
  Range (min … max):   971.3 ms … 1024.4 ms    10 runs

@QuarticCat
Copy link
Contributor Author

QuarticCat commented Sep 27, 2022

Enabling PGO could get another >5% speedup. I use this tool to do the optimization https://github.com/Kobzol/cargo-pgo.

$ cargo pgo run -- sample_files/slow_before.rs sample_files/slow_after.rs
$ cargo pgo optimize run -- sample_files/slow_before.rs sample_files/slow_after.rs

Before (this time I directly invoke the binary rather than through cargo run):

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            975.13 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,423      page-faults:u             #    1.459 K/sec                  
     3,873,705,459      cycles:u                  #    3.973 GHz                    
       312,841,226      stalled-cycles-frontend:u #    8.08% frontend cycles idle   
     1,356,026,289      stalled-cycles-backend:u  #   35.01% backend cycles idle    
     4,416,634,146      instructions:u            #    1.14  insn per cycle         
                                                  #    0.31  stalled cycles per insn
       834,316,552      branches:u                #  855.598 M/sec                  
        18,932,572      branch-misses:u           #    2.27% of all branches        

       0.975598289 seconds time elapsed

       0.947813000 seconds user
       0.026704000 seconds sy

After:

 Performance counter stats for '/home/qc/.cargo/target/x86_64-unknown-linux-gnu/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            907.64 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,441      page-faults:u             #    1.588 K/sec                  
     3,573,701,152      cycles:u                  #    3.937 GHz                    
        81,488,884      stalled-cycles-frontend:u #    2.28% frontend cycles idle   
     1,843,191,214      stalled-cycles-backend:u  #   51.58% backend cycles idle    
     4,175,713,686      instructions:u            #    1.17  insn per cycle         
                                                  #    0.44  stalled cycles per insn
       773,151,834      branches:u                #  851.826 M/sec                  
        18,228,346      branch-misses:u           #    2.36% of all branches        

       0.908512976 seconds time elapsed

       0.860485000 seconds user
       0.046691000 seconds sys

@QuarticCat QuarticCat changed the title Enable thin-LTO Some optimizations Sep 27, 2022
@QuarticCat
Copy link
Contributor Author

QuarticCat commented Sep 27, 2022

Substitute rpds::Stack with a simplified one, which removes all unused parts. ~20% speedup and ~12% less memory usage (without PGO).

 Performance counter stats for 'cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs':

            802.58 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             5,885      page-faults:u             #    7.333 K/sec                  
     3,161,969,385      cycles:u                  #    3.940 GHz                    
        91,496,359      stalled-cycles-frontend:u #    2.89% frontend cycles idle   
     1,254,938,698      stalled-cycles-backend:u  #   39.69% backend cycles idle    
     3,982,788,439      instructions:u            #    1.26  insn per cycle         
                                                  #    0.32  stalled cycles per insn
       752,314,449      branches:u                #  937.376 M/sec                  
        19,227,259      branch-misses:u           #    2.56% of all branches        

       0.803042017 seconds time elapsed

       0.752424000 seconds user
       0.049889000 seconds sys

Original memory usage:

cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs   0.94s  user 0.07s system 99% cpu 1.007 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                549 
page faults from disk:     0
other page faults:         6810

Current memroy usage:

cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs   0.74s  user 0.07s system 99% cpu 0.806 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                479 
page faults from disk:     0
other page faults:         6742

@QuarticCat
Copy link
Contributor Author

Switching from mimalloc to snmalloc brings a negligible speedup and slightly less memory usage.

Since the time difference is too small, hyperfine is used again.

Before this change:

Benchmark 1: cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     863.1 ms ±  36.6 ms    [User: 816.1 ms, System: 45.9 ms]
  Range (min … max):   824.6 ms … 922.8 ms    10 runs

After this change:

Benchmark 1: cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     852.8 ms ±  40.2 ms    [User: 804.4 ms, System: 47.2 ms]
  Range (min … max):   817.9 ms … 929.6 ms    10 runs
cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs   0.73s  user 0.06s system 99% cpu 0.791 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                464 
page faults from disk:     0
other page faults:         7519

If you think such an improvement is worthy then I will commit & push it.

@QuarticCat
Copy link
Contributor Author

QuarticCat commented Sep 27, 2022

Change a RefCell in Vertex to Cell and save some memory.

Benchmark results (without PGO and snmalloc):

 Performance counter stats for 'cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs':

            808.79 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             5,874      page-faults:u             #    7.263 K/sec                  
     3,158,422,412      cycles:u                  #    3.905 GHz                    
       124,285,663      stalled-cycles-frontend:u #    3.94% frontend cycles idle   
     1,218,082,291      stalled-cycles-backend:u  #   38.57% backend cycles idle    
     3,999,740,470      instructions:u            #    1.27  insn per cycle         
                                                  #    0.30  stalled cycles per insn
       753,983,362      branches:u                #  932.233 M/sec                  
        19,338,603      branch-misses:u           #    2.56% of all branches        

       0.809403610 seconds time elapsed

       0.742209000 seconds user
       0.066496000 seconds sys
cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs   0.76s  user 0.04s system 99% cpu 0.802 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                463 
page faults from disk:     0
other page faults:         6740

I don't know why the number of instructions rised a little bit.

@Enter-tainer
Copy link
Contributor

cool! nice work

@QuarticCat
Copy link
Contributor Author

Remove len field in the stack.

 Performance counter stats for 'cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs':

            805.14 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             5,863      page-faults:u             #    7.282 K/sec                  
     3,115,521,172      cycles:u                  #    3.870 GHz                    
        61,609,793      stalled-cycles-frontend:u #    1.98% frontend cycles idle   
     1,267,554,739      stalled-cycles-backend:u  #   40.69% backend cycles idle    
     3,956,185,667      instructions:u            #    1.27  insn per cycle         
                                                  #    0.32  stalled cycles per insn
       750,392,095      branches:u                #  932.006 M/sec                  
        19,234,997      branch-misses:u           #    2.56% of all branches        

       0.806000988 seconds time elapsed

       0.764996000 seconds user
       0.040091000 seconds sys
cargo run --release -- sample_files/slow_before.rs sample_files/slow_after.rs   0.79s  user 0.05s system 99% cpu 0.839 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                435 
page faults from disk:     0
other page faults:         6727

@QuarticCat
Copy link
Contributor Author

I have to focus on other works so the optimization ends here.

In conclusion (without PGO and snmalloc):

  • ~25% speedup
  • ~20% less memory usage

@Wilfred Wilfred merged commit 7e102e1 into Wilfred:master Sep 29, 2022
@Wilfred
Copy link
Owner

Wilfred commented Sep 29, 2022

Wow, really great changes! It's incredible to see a ~25% speedup in code I've already tried to make fast :)

Thanks for mentioning snmalloc, I will take a look at it too. I've had a few problems with mimalloc (see #297) so I'm interested in looking at other malloc implementations.

@Masynchin Masynchin mentioned this pull request Oct 3, 2022
Wilfred added a commit that referenced this pull request Oct 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants