Rewrite scoring algorithm to support run of consecutive character, fix acronyms and add optimal selection of character. #22

jeancroy · 2015-07-09T15:28:50Z

This will be available for testing as an option in fuzzy-finder

Please address problems, comments, and suggestions here:
https://github.com/jeancroy/fuzzaldrin-plus

We start a tuning phase, so anything not quite right is welcome.

plrenaudin · 2015-07-09T15:44:35Z

Don't forget that Travis is running on Linux and AppVeyor on Windows -> Different path separators.
(hopefully it's the problem here)

…rence between window/linux)

walles · 2015-07-09T17:31:01Z

spec/filter-spec.coffee

+  it "returns the result in order", ->
+    candidates = [
+      'Find And Replace: Selet All',
+      'Settings View: Uninstall Packages',


Settings View: View Installed Themes is missing from this list.

Yes, because current ranking was not optimal so I did not wanted to set current ranking as spec.
On the other side if I place "optimal" ranking as per @plrenaudin the build will fail...

If I understand there's no "nice to have" spec, it's pass all or the build fail

Pity.

But even without that this version of the patch is still an improvement over the current situation.

Done :) Added support for your usecase in both spec and scoring (implemented as an extra bonus if exact match happens rigth after token separator)

walles · 2015-07-09T17:48:29Z

Maybe include the test case from #18 in the spec?

walles · 2015-07-09T17:50:06Z

And the test case from atom/atom#7783 perhaps?

plrenaudin · 2015-07-09T18:17:04Z

Brilliant! I hope this will get through!

jeancroy · 2015-07-09T19:06:26Z

One thing I miss by a long shot is this spec:
https://github.com/atom/fuzzaldrin/blob/master/spec/score-spec.coffee#L6

It only pass because that bypass
https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L18

Which with the new scoring scheme would be considered a bug.
I'd ask permission to change the perfect==2 spec if possible.

Reason is that the various bonus, make it very hard to compute a meaninfull upperbound, especially things that act in the middle of the string like CamelCase, position in string bonus etc.

Also I believe in this project usage a accurate relative score ordering matter more than saying I have a 85% match.

jeancroy · 2015-07-09T19:10:40Z

I'd also ask permission to remove this line
https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L12

So if one search for "git push", i get token information and be able to do exact match for "git" then for "push". Or even simply for "git push" (because "gitpush" will never exact match "git push")

mnquintana · 2015-07-09T19:33:28Z

/cc @kevinsawicki

walles · 2015-07-09T19:33:40Z

@jeancroy, regarding score-spec.coffee, I think the only thing that makes sense is to compare relative scores, and possibly 0 scores (if that's the minimum).

What we want tests for is sort order, and absolute comparisons like the one at https://github.com/atom/fuzzaldrin/blob/master/spec/score-spec.coffee#L6 doesn't really do that, and IMO you should just remove that test together with the bypass at https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L18.

And I'd apply similar reasoning for https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L12; if you can remove it and honestly say that any specs that start failing aren't relevant any more, go for it!

Those lines are there to solve a problem, and if that problem isn't there any more neither should those lines.

plrenaudin · 2015-07-09T19:35:07Z

+1 on what @walles said

…tched by either a another space or a slash. Non optional characters are called core. Strict isMatch is applied only on core characters of query. Soft score have access to the full query for scoring. This replace the regex that strip space from query and increase precision for multi word query eg "git push". To be able to do that without recomputing core for each item, it is done at the filter() level.

#21 Added suffix bonus to mirror prefix bonus, exact token match will get both bonus. Now define optional characters as a whitelist instead of a blacklist. This will better support text with no latin character.

…ild fails) Added spec to test for that preference Added spec to confirm we handle issue18 #18 Cleaned up comments

jeancroy · 2015-07-10T04:29:26Z

@walles I really appreciate what you do. Despite contributing a considerable amount of code, I'm still learning etiquette of open source.

walles · 2015-07-10T04:35:21Z

@jeancroy :)

OTOH, all these things are things I'd have said at work as well.

And I do appreciate you both coding and listening. Thanks @jeancroy!

The thing is also that I thought about diving into this myself, but since you're already doing such a great job with it I think my time is better spent reviewing.

plrenaudin · 2015-07-10T08:24:57Z

Great job guys, I'm currently testing the fuzzy-finder using this PR. As expected, it's quite slow but OK IMHO.
Just noticed that the highlight of the character is off when there is space in the query.
Edit: also when there is almost no match, I have some files with highlights and others without.
Edit2: also, the CamelCase didn't work completely. I type MCC (upper case) for MyCamelCase and the results somethingmccsomething (lower case) scores higher. MyCamelCase result is just after though.

walles · 2015-07-10T11:10:44Z

This patch (avoids many calls to toLowerCase()) improves the benchmark results for this PR by about 2x:

diff --git a/src/scorer.coffee b/src/scorer.coffee
index bbbc1f2..bf54696 100644
--- a/src/scorer.coffee
+++ b/src/scorer.coffee
@@ -49,6 +49,9 @@ opt_char_re = /[\ \-\_\xE3\xE0\xE1\xE4\xE2\xE6\u1EBD\xE8\xE9\xEB\xEA\xEC\xED\xEF
 #

 exports.score = score = (subject, query, ignore) ->
+  subject_l = subject.toLowerCase()
+  query_l = query.toLowerCase()
+
   m = query.length + 1
   n = subject.length + 1

@@ -78,7 +81,7 @@ exports.score = score = (subject, query, ignore) ->
       # Score the options
       gapA = gapArow[j] = Math.max(gapArow[j] + we, vrow[j] + wo)
       gapB = Math.max(gapB + we, vrow[j - 1] + wo)
-      align = vd + scoreChar(query, subject, i - 1, j - 1)
+      align = vd + scoreChar(query, query_l, subject, subject_l, i - 1, j - 1)
       vd = vrow[j]

       #Get the best option
@@ -95,7 +98,7 @@ exports.score = score = (subject, query, ignore) ->
   lpos = m - n - 1

   #sustring bonus, start of string bonus
-  if ( p = subject.toLowerCase().indexOf(query.toLowerCase())) > -1
+  if ( p = subject_l.indexOf(query_l)) > -1
     vmax += wex * m * (1.0 + 5.0 / (5.0 + p))

     #sustring happens right after a separator (prefix)
@@ -108,11 +111,13 @@ exports.score = score = (subject, query, ignore) ->

   return vmax

-scoreChar = (query, subject, i, j) ->
+scoreChar = (query, query_l, subject, subject_l, i, j) ->
   qi = query[i]
+  qi_l = query_l[i]
   sj = subject[j]
+  sj_l = subject_l[j]

-  if qi.toLowerCase() == sj.toLowerCase()
+  if qi_l == sj_l

     #Proper casing bonus
     bonus = if qi == sj then wc else 0
@@ -129,13 +134,14 @@ scoreChar = (query, subject, i, j) ->

     #get previous char
     prev_s = subject[j - 1]
+    prev_s_l = subject_l[j - 1]
     prev_q = query[i - 1]

     #match FOLLOW a separator
     return wa + bonus if ( prev_s of sep_map) or ( prev_q of sep_map )

     #match IS Capital in camelCase (preceded by lowercase)
-    return wa + bonus if (sj == sj.toUpperCase() and prev_s == prev_s.toLowerCase())
+    return wa + bonus if (prev_s == prev_s_l and sj == sj.toUpperCase())

     #normal Match, add proper case bonus
     return wm + bonus

Optional are now only space and space like

jeancroy · 2015-07-10T11:31:31Z

Hi @walles I like your idea, but I moved the lowercase test out of scoreChar because it was getting a bit ugly with all those arguments, and the whole function was basically an large if lowercase match. Now we gate the call to this function with the lowercase test.

walles · 2015-07-10T11:36:16Z

I agree your version is more readable. Nice!

kshnurov · 2015-09-23T19:51:03Z

"We don't have time for this" and "it's not our priority" after 2 months of waiting and so much people asking. That's why I hate pseudo-opensource - it doesn't matter that you can see the code if you can't fix it.

jesseleite · 2015-09-23T20:00:42Z

@kshnurov Hey now... Even open source needs direction. Atom shouldn't blindly merge changes in, especially overhauls like this. If you care about the future of the editor, a core team reviewing PRs is smart. The fact that @jeancroy is doing all this work is an amazing community contribution, and the fact that Atom's core team is honest about their stance is also a good thing.

All that said, I hope fuzzy finder doesn't get ignored for too long! It's one of Atom's biggest flaws IMO.

jeancroy · 2015-09-23T20:01:49Z

Honestly 2 weeks of test as an optional feature is not that bad of a compromise. Personally I was going for at least bumping to a major version.

kshnurov · 2015-09-23T20:17:19Z

@jesseleite users feedback should be taken into account when the direction is chosen, not ignored. Core team reviewing PRs is smart, core team saying "we won't even review this" on the ready PR for one of the biggest flaw (and you're 100% right on that) is not smart at all.

I've seen that a dozens of times - it starts with "it's not our priority" on a feature/PR everyone asks for and ends up with a dead software no one uses.

jesseleite · 2015-09-23T20:34:35Z

@jesseleite users feedback should be taken into account when the direction is chosen, not ignored.

@kshnurov: Absolutely, but I don't think they are saying "we won't even review this".

Yes, maybe make a new node module, and make a PR on fuzzy-finder to add an option to use the new module. Then we can try it out for a while. I say we just do fuzzy-finder for now, and try it for a week or two.

^ Seems to me that they are open to the PR, but not ready to skim and merge without thorough review.

This PR is a behemoth. My 1st beef with Atom is the fuzzy finder. BUT I also want any solution(s) to be well tested and properly implemented, otherwise we're just layering bandaids on top of bandaids. Be a bit more positive and trust the process :)

benogle · 2015-09-23T21:00:40Z

but I don't think they are saying "we won't even review this".

Definitely not. I think getting it in as an option is the first step to changing this.

jeancroy · 2015-09-23T21:03:24Z

@benogle
I'm a bit stuck at the step where fuzzy finder don't use fuzzaldrin for filtering/scoring
Instead it rely on atom-space-pen-views which has non optional call to the current version.

https://github.com/atom/fuzzy-finder/blob/master/lib/fuzzy-finder-view.coffee#L135
https://github.com/atom/atom-space-pen-views/blob/master/src/select-list-view.coffee#L161

Should I bypass the super ?

benogle · 2015-09-23T21:15:11Z

Yeah, I suppose you will need to override populateList

jeancroy · 2015-09-24T14:41:46Z

As requested there is now a PR in fuzzy-finder.

benogle · 2015-09-24T21:18:55Z

This has been added to atom/master via atom/fuzzy-finder#142. If you build atom from source, it is available via option:

benogle · 2015-09-24T21:20:05Z

Closing this as we can continue the discussion over at https://github.com/jeancroy/fuzzaldrin-plus

jesseleite · 2015-09-24T21:29:38Z

Thank you <3

jeancroy · 2015-10-06T00:07:08Z

For those who want to test it, fuzzaldrin-plus will be an option of autocomplete-plus 2.22

jesseleite · 2015-10-06T00:12:04Z

Awesome, look forward to it!

AbdullahAli · 2015-11-18T15:49:09Z

The alternate scoring option makes the search really slow and "jittery" 👎

jeancroy · 2015-11-18T15:53:28Z

Hi @AbdullahAli , can you give an example of query and context ?
Is it fuzzy finder, how many file indexed ?

The benchmak show it'll run anywhere between 2x faster to 2x slower depending on query.
(And for most of the query, on most condition I expect to err on the side of faster.)

Are you using this from a official stable distribution of atom or using some hack needed before ?

jesseleite · 2015-11-18T16:03:16Z

I haven't found any performance issues with alternate scoring option. It's been great for me! @jeancroy how do we find out how many files are indexed?

jeancroy · 2015-11-18T16:14:01Z

Very first run of fuzzy finder in a project. If you have many file it show a waiting message with number of file so far.

But to create such scenario I have to open atom in my document folder or something multi-project like that.

jesseleite · 2015-11-18T16:46:47Z

No way to tell index count after it's already indexed? Maybe via console command or something?

AbdullahAli · 2015-11-18T16:46:51Z

Hey @jeancroy I have ~36k files

This is what it looks like when the alternate scoring option is enabled:
https://vid.me/NBz0

This is what it looks like with it disabled:
https://vid.me/hQUp

In the second video you can see it is smoother and not "jittery" - although the video quality might not be the best, so sorry about that

jeancroy · 2015-11-18T17:06:32Z

In both video it show "reindexing project" so that does not helps.
What you see as "jerky" is that you type faster than the time needed to compute a result.

I'll open an issue on fuzzy finder about the debounce delay. (Maybe making a search each 2 char will help reduce pressure)

Btw what kind of system do you have ? There's a lot of ruby user that use this and the difference should not be that large.

AbdullahAli · 2015-11-18T17:17:33Z

It is the same effect after the "reindexing project" message goes away.

I am using a macbook pro on yosemite

jeancroy · 2015-11-18T17:28:43Z

Ok so a recent build? Are the file-name confidential ?
Would you mind sharing the file listing ? Maybe to a gist.

You can save it to a file like so:

  find /Path/To/Project -type f > FilesWithPaths.txt

I'm ok with you redacting out name of company , software, and whatever it need to be comfortable.
Ideally replacing by another string of the same length.

Updated scoring algorithm, and merged @plrenaudin filter spec

9511715

jeancroy added 2 commits July 9, 2015 13:01

Removed SlahCount from baseNameScore (Test if this is source of diffe…

bc509dd

…rence between window/linux)

Increased BasePath bonus

b3d47c5

walles reviewed Jul 9, 2015
View reviewed changes

mnquintana added the needs-review label Jul 9, 2015

plrenaudin mentioned this pull request Jul 9, 2015

attempt to improve search pertinence #19

Closed

This was referenced Jul 9, 2015

Fix scoring to favor exact substring matches #15

Closed

Wrongly ordered candidates with exact file name match #18

Open

walles mentioned this pull request Jul 9, 2015

Fuzzy file finder should prefer direct match atom/atom#7783

Closed

Increase exact match bonus if happens right after a separator

82dfcc3

jeancroy added 2 commits July 9, 2015 16:46

Added test for issue

7539792

#21 Added suffix bonus to mirror prefix bonus, exact token match will get both bonus. Now define optional characters as a whitelist instead of a blacklist. This will better support text with no latin character.

jeancroy mentioned this pull request Jul 9, 2015

Score Capitol Letters Higher in Fuzzy Search #21

Closed

Added back preference for shallower path. (got removed after first bu…

7c7f5d1

…ild fails) Added spec to test for that preference Added spec to confirm we handle issue18 #18 Cleaned up comments

Compute lowercase once (thnaks to @walles)

0e2c1a4

Optional are now only space and space like

jeancroy mentioned this pull request Sep 24, 2015

Optionally allow to use some improvement to fuzzaldrin. atom/fuzzy-finder#142

Merged

jeancroy reopened this Sep 24, 2015

benogle closed this Sep 24, 2015

jeancroy mentioned this pull request Sep 28, 2015

Suggestion order atom/autocomplete-plus#508

Closed

peppy mentioned this pull request Nov 3, 2015

Add support for alternate scoring (fuzzaldrin) atom/command-palette#59

Closed

lee-dohm mentioned this pull request Nov 7, 2015

Finder should prioritize order based on most frequently used. atom/fuzzy-finder#150

Open

jeancroy mentioned this pull request Nov 26, 2015

Extremely slow since 1.2.0 (Atom.) atom/fuzzy-finder#153

Open

Rewrite scoring algorithm to support run of consecutive character, fix acronyms and add optimal selection of character. #22

Rewrite scoring algorithm to support run of consecutive character, fix acronyms and add optimal selection of character. #22

Conversation

jeancroy commented Jul 9, 2015

plrenaudin commented Jul 9, 2015

walles Jul 9, 2015

Choose a reason for hiding this comment

jeancroy Jul 9, 2015

Choose a reason for hiding this comment

jeancroy Jul 9, 2015

Choose a reason for hiding this comment

walles Jul 9, 2015

Choose a reason for hiding this comment

jeancroy Jul 9, 2015

Choose a reason for hiding this comment

walles commented Jul 9, 2015

walles commented Jul 9, 2015

plrenaudin commented Jul 9, 2015

jeancroy commented Jul 9, 2015

jeancroy commented Jul 9, 2015

mnquintana commented Jul 9, 2015

walles commented Jul 9, 2015

plrenaudin commented Jul 9, 2015

jeancroy commented Jul 10, 2015

walles commented Jul 10, 2015

plrenaudin commented Jul 10, 2015

walles commented Jul 10, 2015

jeancroy commented Jul 10, 2015

walles commented Jul 10, 2015

kshnurov commented Sep 23, 2015

jesseleite commented Sep 23, 2015

jeancroy commented Sep 23, 2015

kshnurov commented Sep 23, 2015

jesseleite commented Sep 23, 2015

benogle commented Sep 23, 2015

jeancroy commented Sep 23, 2015

benogle commented Sep 23, 2015

jeancroy commented Sep 24, 2015

benogle commented Sep 24, 2015

benogle commented Sep 24, 2015

jesseleite commented Sep 24, 2015

jeancroy commented Oct 6, 2015

jesseleite commented Oct 6, 2015

AbdullahAli commented Nov 18, 2015

jeancroy commented Nov 18, 2015

jesseleite commented Nov 18, 2015

jeancroy commented Nov 18, 2015

jesseleite commented Nov 18, 2015

AbdullahAli commented Nov 18, 2015

jeancroy commented Nov 18, 2015

AbdullahAli commented Nov 18, 2015

jeancroy commented Nov 18, 2015