Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use O(n*m) sequence alignment algorithm #98

Open
MaskRay opened this issue Mar 25, 2018 · 1 comment
Open

Use O(n*m) sequence alignment algorithm #98

MaskRay opened this issue Mar 25, 2018 · 1 comment

Comments

@MaskRay
Copy link

MaskRay commented Mar 25, 2018

tl;dr Using a sequence alignment algorithm can improve the fuzzy matching quality, as what fzf does. But this approach is much simpler without sacrifice of quality. See the following example:

    //////////// Ranks(pattern,  {best_candidate, ........, worst_candidate})
    // case
    CHECK(Ranks("monad", {"monad", "Monad", "mONAD"}));
    // initials
    CHECK(Ranks("ab", {"ab", "aoo_boo", "acb"}));
    CHECK(Ranks("CC", {"CamelCase", "camelCase", "camelcase"}));
    CHECK(Ranks("cC", {"camelCase", "CamelCase", "camelcase"}));
    CHECK(Ranks("c c", {"camel case", "camelCase", "CamelCase", "camelcase",
                        "camel ace"}));
    CHECK(Ranks("Da.Te",
                {"Data.Text", "Data.Text.Lazy", "Data.Aeson.Encoding.text"}));
    CHECK(Ranks("foo bar.h", {"foo/bar.h", "foobar.h"}));
    // prefix
    CHECK(Ranks("is", {"isIEEE", "inSuf"}));
    // shorter
    CHECK(Ranks("ma", {"map", "many", "maximum"}));
    CHECK(Ranks("print", {"printf", "sprintf"}));
    // score(PRINT) = kMinScore
    CHECK(Ranks("ast", {"ast", "AST", "INT_FAST16_MAX"}));
    // score(PRINT) > kMinScore
    CHECK(Ranks("Int", {"int", "INT", "PRINT"}));

What helm-M-x returns for the pattern c m:

0. c++-mode          ; deserved champion
1. customize
2. eldoc-mode
4. company-mode   ;;this should be the runner-up, because `c m` are Head-position matching

counsel-M-x:

0. customize-group   ; m is Tail-position matching, this should be given lower weight
1. chmod                ; this is short so it comes first, but `m` is a Tail position
4. c++-mode        ; our champion shouldn't be here
5. css-mode
..........

I have read many fuzzy matching algorithms, including the one used by fzf. It is too complex and also messes about heuristics.

I adapted and simplified the fuzzy matching algorithm used in clangd and created https://github.com/MaskRay/cquery/blob/fuzzy/src/fuzzy_match.cc

It takes many factors into account:

  • case matching.
  • whether pattern is case folding search
  • The initial position (Head) of each word should be given more weight
  • consecutive or prefix matching is preferred
  • Break ties by length

While keeps it conceptually simple in the main loop:

  dp[0][0][0] = dp[0][0][1] = 0;
  for (int j = 0; j < n; j++) {
    dp[0][j + 1][0] = dp[0][j][0] + MissScore(j, false);
    dp[0][j + 1][1] = kMinScore * 2;
  }
  for (int i = 0; i < int(pat.size()); i++) {
    int(*pre)[2] = dp[i & 1];
    int(*cur)[2] = dp[(i + 1) & 1];
    cur[i][0] = cur[i][1] = kMinScore;
    for (int j = i; j < n; j++) {
      cur[j + 1][0] = std::max(cur[j][0] + MissScore(j, false),
                               cur[j][1] + MissScore(j, true));
      // For the first char of pattern, apply extra restriction to filter bad
      // candidates (e.g. |int| in |PRINT|)
      if (low_pat[i] == low_text[j] &&
          (i || text_role[j] != Tail || pat[i] == text[j])) {
        cur[j + 1][1] = std::max(pre[j][0] + MatchScore(i, j, false),
                                 pre[j][1] + MatchScore(i, j, true));
      } else
        cur[j + 1][1] = kMinScore * 2;
    }
  }

The heuristics are gathered and coded in one place FuzzyMatcher::MatchScore, instead of scattering all over the code (which is unfortunately the case for many other fuzzy matching algorithms). I have also done something similar for rofi which you can experience by appending -sort -matching fuzzy to its command line.

I'll explain briefly about the current heuristics. Similar rules can be added.

int FuzzyMatcher::MatchScore(int i, int j, bool last) {
  int s = 0;
  // Bonus for case match
  if (pat[i] == text[j]) {
    s++;
    // Bonus for prefix match or case match when the pattern contains upper-case letters
    if ((pat_set & 1 << Upper) || i == j)
      s++;
  }
  // For initial positions of pattern words
  if (pat_role[i] == Head) {
    // Bonus if it is matched to an initial position of some text word
    if (text_role[j] == Head)
      s += 30;
    // Penalty for non-initial positions
    else if (text_role[j] == Tail)
      s -= 10;
  }
  if (text_role[j] == Tail && i && !last)
    s -= 30;
  // The first character of pattern is matched to a non-initial position in the text
  if (i == 0 && text_role[j] == Tail)
    s -= 40;
  return s;
}

One caveat is that for space complexity I use compressed two-row dp[2][n][2] instead of full dp[m][n][2]. This way, we lose the ability to reconstruct the optimal path (matching positions).

Given pattern cm, the sequence alignment algorithm will prefer [c]ompany-[m]ode over [c]o[m]pany-mode, as the initial positions of words are given more weight. However, I do not store the full sequence alignment table dp[m][n][2] so it is not possible to reconstruct the optimal path. But this should only cause a minor visual glitch.

I'm not that good at elisp (implying that I cannot implement this ....) but I believe it is a huge improvement. Appreciated if someone could implement this..........

If you speak Chinese, you may read my complain in https://emacs-china.org/t/topic/5368 ............

ivy integration

  • Use a subsequence filter because applying this O(n*m) algorithm on every candidate will be slow. And the algorithm will return an awful value for non-matching candidates anyway
  • Calculate the score for each candidate and use it to sort candidates
  • This can only be used for ivy--regex-fuzzy, as other modes do not allow subsequence filtering. Also we cannot use arbitrary regular expressions in patterns, because subsequence filter and this sequence alignment algorithm does not understand them. This is not a limitation, if the algorithm is smart enough to assign proper weights to consecutive matching, initial-character matching, .... You will find very rare case where regex is helpful
  • I don't read the ivy code carefully, because ivy--flx-sort and other regular expression constructions seem unnecessary.
@PythonNut
Copy link
Collaborator

Sorry if this wasn't clear, but I believe flx already implements a O(nm) matching algorithm. The implementation is a bit non-traditional, but the asymptotic complexity should be fine (as far as I could tell when I wrote it). There's definitely room to do fuzzy matching more quickly by a constant factor, and I wouldn't be surprised if a large constant speedup could be achieved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants