Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More improvements to the DFA's inner loop. #205

Merged
merged 1 commit into from
Apr 22, 2016
Merged

More improvements to the DFA's inner loop. #205

merged 1 commit into from
Apr 22, 2016

Conversation

BurntSushi
Copy link
Member

There were two important changes:

  1. self.at is used sparingly in favor of a local let at binding.
    This seems to convince the compiler to use a register.
  2. Switch the transition table from a Vec<Box<[StatePtr]>> to a
    row-major Vec<StatePtr>.

(2) is the juicier of the two. It makes more efficient use of the cache.
In particular, a critical aspect is that a StatePtr points to the start
of a row in the table, which enables indexing in the inner loop with a
single ADD instruction. (i.e., si + byte instead of
si * #classes + byte.)

@BurntSushi
Copy link
Member Author

Absolutely insane improvements, considering the change mostly amounts to removing a single pointer dereference:

$ cargo-benchcmp ~/rust/regex-master/rust-master rust --threshold 10
name                                     rust-master ns/iter    rust ns/iter            diff ns/iter   diff %
misc::anchored_literal_long_match        24 (16,250 MB/s)       20 (19,500 MB/s)                  -4  -16.67%
misc::anchored_literal_short_match       26 (1,000 MB/s)        20 (1,300 MB/s)                   -6  -23.08%
misc::anchored_literal_short_non_match   33 (787 MB/s)          22 (1,181 MB/s)                  -11  -33.33%
misc::easy0_1K                           31 (33,903 MB/s)       17 (61,823 MB/s)                 -14  -45.16%
misc::easy0_1MB                          39 (26,887,256 MB/s)   21 (49,933,476 MB/s)             -18  -46.15%
misc::easy0_32                           31 (1,903 MB/s)        17 (3,470 MB/s)                  -14  -45.16%
misc::easy0_32K                          31 (1,057,903 MB/s)    17 (1,929,117 MB/s)              -14  -45.16%
misc::easy1_1K                           100 (10,440 MB/s)      44 (23,727 MB/s)                 -56  -56.00%
misc::easy1_1MB                          101 (10,382,138 MB/s)  49 (21,399,918 MB/s)             -52  -51.49%
misc::easy1_32                           99 (525 MB/s)          43 (1,209 MB/s)                  -56  -56.57%
misc::easy1_32K                          100 (327,880 MB/s)     43 (762,511 MB/s)                -57  -57.00%
misc::hard_1K                            124 (8,475 MB/s)       55 (19,109 MB/s)                 -69  -55.65%
misc::hard_1MB                           129 (8,128,705 MB/s)   64 (16,384,421 MB/s)             -65  -50.39%
misc::hard_32                            124 (475 MB/s)         55 (1,072 MB/s)                  -69  -55.65%
misc::hard_32K                           123 (266,626 MB/s)     55 (596,272 MB/s)                -68  -55.28%
misc::literal                            26 (1,961 MB/s)        16 (3,187 MB/s)                  -10  -38.46%
misc::long_needle1                       4,154 (24,073 MB/s)    2,360 (42,373 MB/s)           -1,794  -43.19%
misc::long_needle2                       883,221 (113 MB/s)     637,339 (156 MB/s)          -245,882  -27.84%
misc::match_class                        142 (570 MB/s)         82 (987 MB/s)                    -60  -42.25%
misc::match_class_in_range               49 (1,653 MB/s)        27 (3,000 MB/s)                  -22  -44.90%
misc::match_class_unicode                620 (259 MB/s)         318 (506 MB/s)                  -302  -48.71%
misc::medium_1K                          32 (32,875 MB/s)       17 (61,882 MB/s)                 -15  -46.88%
misc::medium_1MB                         41 (25,575,707 MB/s)   21 (49,933,523 MB/s)             -20  -48.78%
misc::medium_32                          32 (1,875 MB/s)        17 (3,529 MB/s)                  -15  -46.88%
misc::medium_32K                         24 (1,366,500 MB/s)    17 (1,929,176 MB/s)               -7  -29.17%
misc::no_exponential                     402 (248 MB/s)         203 (492 MB/s)                  -199  -49.50%
misc::not_literal                        227 (224 MB/s)         112 (455 MB/s)                  -115  -50.66%
misc::one_pass_long_prefix               135 (192 MB/s)         64 (406 MB/s)                    -71  -52.59%
misc::one_pass_long_prefix_not           134 (194 MB/s)         64 (406 MB/s)                    -70  -52.24%
misc::one_pass_short                     100 (170 MB/s)         51 (333 MB/s)                    -49  -49.00%
misc::one_pass_short_not                 97 (175 MB/s)          48 (354 MB/s)                    -49  -50.52%
misc::reallyhard_1K                      3,856 (272 MB/s)       1,962 (535 MB/s)              -1,894  -49.12%
misc::reallyhard_1MB                     3,778,542 (277 MB/s)   1,939,431 (540 MB/s)      -1,839,111  -48.67%
misc::reallyhard_32                      220 (268 MB/s)         127 (464 MB/s)                   -93  -42.27%
misc::reallyhard_32K                     112,525 (291 MB/s)     60,702 (540 MB/s)            -51,823  -46.05%
sherlock::before_holmes                  2,074,066 (286 MB/s)   1,129,116 (526 MB/s)        -944,950  -45.56%
sherlock::holmes_coword_watson           1,036,968 (573 MB/s)   631,836 (941 MB/s)          -405,132  -39.07%
sherlock::ing_suffix                     2,300,410 (258 MB/s)   1,342,120 (443 MB/s)        -958,290  -41.66%
sherlock::ing_suffix_limited_space       2,257,070 (263 MB/s)   1,285,008 (462 MB/s)        -972,062  -43.07%
sherlock::letters_upper                  2,930,985 (202 MB/s)   2,002,650 (297 MB/s)        -928,335  -31.67%
sherlock::line_boundary_sherlock_holmes  2,044,897 (290 MB/s)   1,104,364 (538 MB/s)        -940,533  -45.99%
sherlock::quotes                         772,286 (770 MB/s)     549,898 (1,081 MB/s)        -222,388  -28.80%
sherlock::the_whitespace                 1,253,174 (474 MB/s)   1,097,548 (542 MB/s)        -155,626  -12.42%
sherlock::word_ending_n                  2,956,155 (201 MB/s)   1,984,136 (299 MB/s)        -972,019  -32.88%

@BurntSushi
Copy link
Member Author

BurntSushi commented Apr 22, 2016

A PR for this is coming soon, but I have RE2 hooked up to the benchmark harness. Not too shabby if I say so myself:

$ cargo-benchcmp re2 rust 
name                                     re2 ns/iter           rust ns/iter            diff ns/iter    diff %
misc::anchored_literal_long_match        92 (4,239 MB/s)       20 (19,500 MB/s)                 -72   -78.26%
misc::anchored_literal_long_non_match    20 (19,500 MB/s)      22 (17,727 MB/s)                   2    10.00%
misc::anchored_literal_short_match       92 (282 MB/s)         20 (1,300 MB/s)                  -72   -78.26%
misc::anchored_literal_short_non_match   20 (1,300 MB/s)       22 (1,181 MB/s)                    2    10.00%
misc::easy0_1K                           168 (6,255 MB/s)      17 (61,823 MB/s)                -151   -89.88%
misc::easy0_1MB                          39,244 (26,720 MB/s)  21 (49,933,476 MB/s)         -39,223   -99.95%
misc::easy0_32                           145 (406 MB/s)        17 (3,470 MB/s)                 -128   -88.28%
misc::easy0_32K                          944 (34,740 MB/s)     17 (1,929,117 MB/s)             -927   -98.20%
misc::easy1_1K                           157 (6,649 MB/s)      44 (23,727 MB/s)                -113   -71.97%
misc::easy1_1MB                          39,157 (26,779 MB/s)  49 (21,399,918 MB/s)         -39,108   -99.87%
misc::easy1_32                           130 (400 MB/s)        43 (1,209 MB/s)                  -87   -66.92%
misc::easy1_32K                          936 (35,029 MB/s)     43 (762,511 MB/s)               -893   -95.41%
misc::hard_1K                            2,684 (391 MB/s)      55 (19,109 MB/s)              -2,629   -97.95%
misc::hard_1MB                           2,587,145 (405 MB/s)  64 (16,384,421 MB/s)      -2,587,081  -100.00%
misc::hard_32                            213 (276 MB/s)        55 (1,072 MB/s)                 -158   -74.18%
misc::hard_32K                           80,978 (404 MB/s)     55 (596,272 MB/s)            -80,923   -99.93%
misc::literal                            88 (579 MB/s)         16 (3,187 MB/s)                  -72   -81.82%
misc::long_needle1                       195,583 (511 MB/s)    2,360 (42,373 MB/s)         -193,223   -98.79%
misc::long_needle2                       195,550 (511 MB/s)    637,339 (156 MB/s)           441,789   225.92%
misc::match_class                        256 (316 MB/s)        82 (987 MB/s)                   -174   -67.97%
misc::match_class_in_range               259 (312 MB/s)        27 (3,000 MB/s)                 -232   -89.58%
misc::match_class_unicode                814 (197 MB/s)        318 (506 MB/s)                  -496   -60.93%
misc::medium_1K                          2,247 (468 MB/s)      17 (61,882 MB/s)              -2,230   -99.24%
misc::medium_1MB                         2,150,720 (487 MB/s)  21 (49,933,523 MB/s)      -2,150,699  -100.00%
misc::medium_32                          221 (271 MB/s)        17 (3,529 MB/s)                 -204   -92.31%
misc::medium_32K                         67,204 (488 MB/s)     17 (1,929,176 MB/s)          -67,187   -99.97%
misc::no_exponential                     298 (335 MB/s)        203 (492 MB/s)                   -95   -31.88%
misc::not_literal                        187 (272 MB/s)        112 (455 MB/s)                   -75   -40.11%
misc::one_pass_long_prefix               87 (298 MB/s)         64 (406 MB/s)                    -23   -26.44%
misc::one_pass_long_prefix_not           140 (185 MB/s)        64 (406 MB/s)                    -76   -54.29%
misc::one_pass_short                     122 (139 MB/s)        51 (333 MB/s)                    -71   -58.20%
misc::one_pass_short_not                 120 (141 MB/s)        48 (354 MB/s)                    -72   -60.00%
misc::reallyhard_1K                      2,679 (392 MB/s)      1,962 (535 MB/s)                -717   -26.76%
misc::reallyhard_1MB                     2,584,739 (405 MB/s)  1,939,431 (540 MB/s)        -645,308   -24.97%
misc::reallyhard_32                      213 (276 MB/s)        127 (464 MB/s)                   -86   -40.38%
misc::reallyhard_32K                     80,821 (405 MB/s)     60,702 (540 MB/s)            -20,119   -24.89%
sherlock::before_holmes                  1,687,815 (352 MB/s)  1,129,116 (526 MB/s)        -558,699   -33.10%
sherlock::everything_greedy              8,615,908 (69 MB/s)   2,547,395 (233 MB/s)      -6,068,513   -70.43%
sherlock::everything_greedy_nl           3,792,735 (156 MB/s)  1,200,717 (495 MB/s)      -2,592,018   -68.34%
sherlock::holmes_cochar_watson           6,073,810 (97 MB/s)   218,081 (2,728 MB/s)      -5,855,729   -96.41%
sherlock::holmes_coword_watson           4,376,321 (135 MB/s)  631,836 (941 MB/s)        -3,744,485   -85.56%
sherlock::ing_suffix                     3,205,345 (185 MB/s)  1,342,120 (443 MB/s)      -1,863,225   -58.13%
sherlock::ing_suffix_limited_space       2,072,987 (286 MB/s)  1,285,008 (462 MB/s)        -787,979   -38.01%
sherlock::letters                        88,468,977 (6 MB/s)   22,487,206 (26 MB/s)     -65,981,771   -74.58%
sherlock::letters_lower                  86,175,788 (6 MB/s)   22,315,809 (26 MB/s)     -63,859,979   -74.10%
sherlock::letters_upper                  4,133,215 (143 MB/s)  2,002,650 (297 MB/s)      -2,130,565   -51.55%
sherlock::line_boundary_sherlock_holmes  4,525,051 (131 MB/s)  1,104,364 (538 MB/s)      -3,420,687   -75.59%
sherlock::name_alt1                      73,875 (8,053 MB/s)   36,391 (16,348 MB/s)         -37,484   -50.74%
sherlock::name_alt2                      3,424,970 (173 MB/s)  184,956 (3,216 MB/s)      -3,240,014   -94.60%
sherlock::name_alt3                      3,335,895 (178 MB/s)  1,249,494 (476 MB/s)      -2,086,401   -62.54%
sherlock::name_alt3_nocase               4,775,870 (124 MB/s)  1,335,119 (445 MB/s)      -3,440,751   -72.04%
sherlock::name_alt4                      3,095,974 (192 MB/s)  225,164 (2,642 MB/s)      -2,870,810   -92.73%
sherlock::name_alt4_nocase               3,208,627 (185 MB/s)  1,309,124 (454 MB/s)      -1,899,503   -59.20%
sherlock::name_alt5                      3,396,094 (175 MB/s)  314,663 (1,890 MB/s)      -3,081,431   -90.73%
sherlock::name_alt5_nocase               4,525,131 (131 MB/s)  1,309,990 (454 MB/s)      -3,215,141   -71.05%
sherlock::name_holmes                    149,393 (3,982 MB/s)  44,097 (13,491 MB/s)        -105,296   -70.48%
sherlock::name_holmes_nocase             4,325,634 (137 MB/s)  1,088,818 (546 MB/s)      -3,236,816   -74.83%
sherlock::name_sherlock                  56,467 (10,535 MB/s)  69,440 (8,567 MB/s)           12,973    22.97%
sherlock::name_sherlock_holmes           60,140 (9,892 MB/s)   35,987 (16,531 MB/s)         -24,153   -40.16%
sherlock::name_sherlock_holmes_nocase    4,519,393 (131 MB/s)  1,144,418 (519 MB/s)      -3,374,975   -74.68%
sherlock::name_sherlock_nocase           4,483,516 (132 MB/s)  1,143,839 (520 MB/s)      -3,339,677   -74.49%
sherlock::name_whitespace                61,915 (9,608 MB/s)   78,885 (7,541 MB/s)           16,970    27.41%
sherlock::no_match_common                484,129 (1,228 MB/s)  25,761 (23,094 MB/s)        -458,368   -94.68%
sherlock::no_match_really_common         484,433 (1,228 MB/s)  360,581 (1,649 MB/s)        -123,852   -25.57%
sherlock::no_match_uncommon              24,054 (24,733 MB/s)  25,764 (23,091 MB/s)           1,710     7.11%
sherlock::quotes                         5,392,893 (110 MB/s)  549,898 (1,081 MB/s)      -4,842,995   -89.80%
sherlock::the_lower                      2,156,154 (275 MB/s)  595,819 (998 MB/s)        -1,560,335   -72.37%
sherlock::the_nocase                     4,094,161 (145 MB/s)  1,571,111 (378 MB/s)      -2,523,050   -61.63%
sherlock::the_upper                      207,334 (2,869 MB/s)  48,469 (12,274 MB/s)        -158,865   -76.62%
sherlock::the_whitespace                 2,132,436 (278 MB/s)  1,097,548 (542 MB/s)      -1,034,888   -48.53%
sherlock::word_ending_n                  3,005,600 (197 MB/s)  1,984,136 (299 MB/s)      -1,021,464   -33.99%
sherlock::words                          26,468,150 (22 MB/s)  9,359,998 (63 MB/s)      -17,108,152   -64.64%

@BurntSushi
Copy link
Member Author

BurntSushi commented Apr 22, 2016

And this is cool. Even though we still run more instructions than grep itself, we appear to be smarter about it and execute more instructions per cycle, presumably due to better cache usage. This regex was specifically chosen to exercise the performance of the inner loop:

$ ls -lh all
-rw-r--r-- 1 andrew users 2.1G Aug  4  2015 all
$ time LC_ALL=C egrep -a -c '(\w+\s+){7}' all
14794820

real    0m6.496s
user    0m6.307s
sys     0m0.187s
$ time xrep -c '(\w+\s+){7}' all
14794820

real    0m4.159s
user    0m4.057s
sys     0m0.100s

We of course aren't always this much better than grep, but it's a nice win. Bonus, we support Unicode with little to no decrease in performance:

$ time xrep -c '(?u)(\w+\s+){7}' all
14813488

real    0m4.417s
user    0m4.330s
sys     0m0.083s

Running grep on the same regex with LC_ALL=en_US.UTF-8 ran so long that I killed it. Owch.

There were two important changes:

  1. self.at is used sparingly in favor of a local `let at` binding.
     This seems to convince the compiler to use a register.
  2. Switch the transition table from a `Vec<Box<[StatePtr]>>` to a
     row-major `Vec<StatePtr>`.

(2) is the juicier of the two. It makes more efficient use of the cache.
In particular, a critical aspect is that a StatePtr points to the start
of a row in the table, which enables indexing in the inner loop with a
single ADD instruction. (i.e., `si + byte` instead of
`si * #classes + byte`.)
@BurntSushi
Copy link
Member Author

Looks like the failure is spurious.

@BurntSushi BurntSushi merged commit b638d7a into master Apr 22, 2016
@BurntSushi BurntSushi deleted the tune-dfa branch April 22, 2016 10:40
@BurntSushi BurntSushi mentioned this pull request Apr 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant