Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Skip the first goto call in full construction (-35%) #33

Closed
wants to merge 4 commits into from

Conversation

Marwes
Copy link
Contributor

@Marwes Marwes commented Jul 12, 2018

  • perf: Skip the first goto call in full construction (-35%)

This is probably less significant when larger number of needles are used
but should still be a decent speed improvement.

 name                     old ns/iter  new ns/iter  diff ns/iter   diff %  speedup
 bench_construction       131,148      122,956            -8,192   -6.25%   x 1.07
 bench_full_construction  309,493      200,595          -108,898  -35.19%   x 1.54
  • perf: Use for_each_transition in AcAutomaton::fill (-84%)
 name                     old ns/iter  new ns/iter  diff ns/iter   diff %  speedup
 bench_construction       167,415      25,804           -141,611  -84.59%   x 6.49
 bench_full_construction  275,170      133,253          -141,917  -51.57%   x 2.07

(The two benchmarks were done on different computers)

@Marwes
Copy link
Contributor Author

Marwes commented Jul 12, 2018

This should be the last PR :) It does complicate things a bit more but it really helps when querying for the next state of Dense as many states (at least in these short benchmarks) only have distance of 1 to the first non-fail state.

However, if next_state needs to traverse more than one state then it should be increasingly likely that the memorization kicks so chances are the memoization gives a greater performance improvement than 12% in that case instead!

@Marwes Marwes force-pushed the opt branch 2 times, most recently from 46fd1e8 to f4ccc69 Compare July 12, 2018 21:22
@Marwes
Copy link
Contributor Author

Marwes commented Jul 13, 2018

Another thing that could have improved things further would be to actively update full_automaton as the automaton is searched (essentially doing path compression https://en.wikipedia.org/wiki/Disjoint-set_data_structure#Path_compression). Unfortunately, needing to query full_automation and checking for != FAIL_STATE on each loop to see if the state is already set instead of just checking si < current_si seems to negate any improvements of doing so.

@Marwes
Copy link
Contributor Author

Marwes commented Jul 13, 2018

(*) Second to last PR, it is possible to improve (and complicate) the AcAutomaton a bit more as well (-24%)

@BurntSushi
Copy link
Owner

Thanks for this! I'm going to let this sit for at least a week to allow any additional changes you might have planned to accrue here in order to minimize churn and context switching.

@Marwes
Copy link
Contributor Author

Marwes commented Aug 4, 2018

(*) Second to last PR, it is possible to improve (and complicate) the AcAutomaton a bit more as well (-24%)

This ended up not having that large of an impact so I scrapped it.

Instead I realized that for_each_transition can be used pervasively with some slight restructuring which it seems like LLVM really liked. As Dense states are just stored as Vec<(u8, StateIdx)> we avoid iterating over the entire 0...255 range and instead only iterate the few states in the Vec yielding a massive speedup in constructing AcAutomaton.

The only idea I have left to try is to re-apply the above idea as it could be more significant after this improvment but I don't have those sources accessible atm so that will have to wait.

This is probably less significant when larger number of needles are used
but should still be a decent speed improvement.

```
 name                     old ns/iter  new ns/iter  diff ns/iter   diff %  speedup
 bench_construction       131,148      122,956            -8,192   -6.25%   x 1.07
 bench_full_construction  309,493      200,595          -108,898  -35.19%   x 1.54
```
```
 name                     old ns/iter  new ns/iter  diff ns/iter   diff %  speedup
 bench_construction       167,415      25,804           -141,611  -84.59%   x 6.49
 bench_full_construction  275,170      133,253          -141,917  -51.57%   x 2.07
```
Since the previous commit were so significant I find it safest to add
and explicit function for this so we don't rely on LLVM doing the
optimization for dense states.
@Marwes
Copy link
Contributor Author

Marwes commented Aug 13, 2018

Rebased and added for_each_ok_transition so we don't have to rely on LLVM for the improvement in 2b9df73.

The only idea I have left to try is to re-apply the above idea as it could be more significant after this improvment but I don't have those sources accessible atm so that will have to wait.

Still no improvement from this so I am finished with all optimizations I am able to do here.

@Marwes
Copy link
Contributor Author

Marwes commented Aug 22, 2018

ping @BurntSushi (this is done)

@BurntSushi
Copy link
Owner

Awesome! Thanks so much for this. Sorry it took a while. I needed to make time to carefully review this. I think I buy everything! I've merged this locally (to drop the unneeded CI commit---thank you for that! I fixed it elsewhere) and pushed it to crates.io in aho-corasick 0.6.8.

These speed improvements are really really nice. It actually makes a huge difference in a real test case I've been playing with: BurntSushi/ripgrep#838

@Marwes Marwes deleted the opt branch August 27, 2018 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants