Use O(n*m) sequence alignment algorithm #98

MaskRay · 2018-03-25T18:36:34Z

tl;dr Using a sequence alignment algorithm can improve the fuzzy matching quality, as what fzf does. But this approach is much simpler without sacrifice of quality. See the following example:

    //////////// Ranks(pattern,  {best_candidate, ........, worst_candidate})
    // case
    CHECK(Ranks("monad", {"monad", "Monad", "mONAD"}));
    // initials
    CHECK(Ranks("ab", {"ab", "aoo_boo", "acb"}));
    CHECK(Ranks("CC", {"CamelCase", "camelCase", "camelcase"}));
    CHECK(Ranks("cC", {"camelCase", "CamelCase", "camelcase"}));
    CHECK(Ranks("c c", {"camel case", "camelCase", "CamelCase", "camelcase",
                        "camel ace"}));
    CHECK(Ranks("Da.Te",
                {"Data.Text", "Data.Text.Lazy", "Data.Aeson.Encoding.text"}));
    CHECK(Ranks("foo bar.h", {"foo/bar.h", "foobar.h"}));
    // prefix
    CHECK(Ranks("is", {"isIEEE", "inSuf"}));
    // shorter
    CHECK(Ranks("ma", {"map", "many", "maximum"}));
    CHECK(Ranks("print", {"printf", "sprintf"}));
    // score(PRINT) = kMinScore
    CHECK(Ranks("ast", {"ast", "AST", "INT_FAST16_MAX"}));
    // score(PRINT) > kMinScore
    CHECK(Ranks("Int", {"int", "INT", "PRINT"}));

What helm-M-x returns for the pattern c m:

0. c++-mode          ; deserved champion
1. customize
2. eldoc-mode
4. company-mode   ;;this should be the runner-up, because `c m` are Head-position matching

counsel-M-x:

0. customize-group   ; m is Tail-position matching, this should be given lower weight
1. chmod                ; this is short so it comes first, but `m` is a Tail position
4. c++-mode        ; our champion shouldn't be here
5. css-mode
..........

I have read many fuzzy matching algorithms, including the one used by fzf. It is too complex and also messes about heuristics.

I adapted and simplified the fuzzy matching algorithm used in clangd and created https://github.com/MaskRay/cquery/blob/fuzzy/src/fuzzy_match.cc

It takes many factors into account:

case matching.
whether pattern is case folding search
The initial position (Head) of each word should be given more weight
consecutive or prefix matching is preferred
Break ties by length

While keeps it conceptually simple in the main loop:

  dp[0][0][0] = dp[0][0][1] = 0;
  for (int j = 0; j < n; j++) {
    dp[0][j + 1][0] = dp[0][j][0] + MissScore(j, false);
    dp[0][j + 1][1] = kMinScore * 2;
  }
  for (int i = 0; i < int(pat.size()); i++) {
    int(*pre)[2] = dp[i & 1];
    int(*cur)[2] = dp[(i + 1) & 1];
    cur[i][0] = cur[i][1] = kMinScore;
    for (int j = i; j < n; j++) {
      cur[j + 1][0] = std::max(cur[j][0] + MissScore(j, false),
                               cur[j][1] + MissScore(j, true));
      // For the first char of pattern, apply extra restriction to filter bad
      // candidates (e.g. |int| in |PRINT|)
      if (low_pat[i] == low_text[j] &&
          (i || text_role[j] != Tail || pat[i] == text[j])) {
        cur[j + 1][1] = std::max(pre[j][0] + MatchScore(i, j, false),
                                 pre[j][1] + MatchScore(i, j, true));
      } else
        cur[j + 1][1] = kMinScore * 2;
    }
  }

The heuristics are gathered and coded in one place FuzzyMatcher::MatchScore, instead of scattering all over the code (which is unfortunately the case for many other fuzzy matching algorithms). I have also done something similar for rofi which you can experience by appending -sort -matching fuzzy to its command line.

I'll explain briefly about the current heuristics. Similar rules can be added.

int FuzzyMatcher::MatchScore(int i, int j, bool last) {
  int s = 0;
  // Bonus for case match
  if (pat[i] == text[j]) {
    s++;
    // Bonus for prefix match or case match when the pattern contains upper-case letters
    if ((pat_set & 1 << Upper) || i == j)
      s++;
  }
  // For initial positions of pattern words
  if (pat_role[i] == Head) {
    // Bonus if it is matched to an initial position of some text word
    if (text_role[j] == Head)
      s += 30;
    // Penalty for non-initial positions
    else if (text_role[j] == Tail)
      s -= 10;
  }
  if (text_role[j] == Tail && i && !last)
    s -= 30;
  // The first character of pattern is matched to a non-initial position in the text
  if (i == 0 && text_role[j] == Tail)
    s -= 40;
  return s;
}

One caveat is that for space complexity I use compressed two-row dp[2][n][2] instead of full dp[m][n][2]. This way, we lose the ability to reconstruct the optimal path (matching positions).

Given pattern cm, the sequence alignment algorithm will prefer [c]ompany-[m]ode over [c]o[m]pany-mode, as the initial positions of words are given more weight. However, I do not store the full sequence alignment table dp[m][n][2] so it is not possible to reconstruct the optimal path. But this should only cause a minor visual glitch.

I'm not that good at elisp (implying that I cannot implement this ....) but I believe it is a huge improvement. Appreciated if someone could implement this..........

If you speak Chinese, you may read my complain in https://emacs-china.org/t/topic/5368 ............

ivy integration

Use a subsequence filter because applying this O(n*m) algorithm on every candidate will be slow. And the algorithm will return an awful value for non-matching candidates anyway
Calculate the score for each candidate and use it to sort candidates
This can only be used for ivy--regex-fuzzy, as other modes do not allow subsequence filtering. Also we cannot use arbitrary regular expressions in patterns, because subsequence filter and this sequence alignment algorithm does not understand them. This is not a limitation, if the algorithm is smart enough to assign proper weights to consecutive matching, initial-character matching, .... You will find very rare case where regex is helpful
I don't read the ivy code carefully, because ivy--flx-sort and other regular expression constructions seem unnecessary.

The text was updated successfully, but these errors were encountered:

PythonNut · 2020-04-26T20:07:57Z

Sorry if this wasn't clear, but I believe flx already implements a O(nm) matching algorithm. The implementation is a bit non-traditional, but the asymptotic complexity should be fine (as far as I could tell when I wrote it). There's definitely room to do fuzzy matching more quickly by a constant factor, and I wouldn't be surprised if a large constant speedup could be achieved.

MaskRay mentioned this issue Mar 25, 2018

Use O(n*m) sequence alignment algorithm abo-abo/swiper#1503

Open

myimages mentioned this issue Jul 23, 2018

sequence alignment algorithem skim-rs/skim#87

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use O(n*m) sequence alignment algorithm #98

Use O(n*m) sequence alignment algorithm #98

MaskRay commented Mar 25, 2018 •

edited

Loading

PythonNut commented Apr 26, 2020

Use O(n*m) sequence alignment algorithm #98

Use O(n*m) sequence alignment algorithm #98

Comments

MaskRay commented Mar 25, 2018 • edited Loading

ivy integration

PythonNut commented Apr 26, 2020

MaskRay commented Mar 25, 2018 •

edited

Loading