Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Rewrite scoring algorithm to support run of consecutive character, fix acronyms and add optimal selection of character. #22

Closed
wants to merge 77 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
9511715
Updated scoring algorithm, and merged @plrenaudin filter spec
jeancroy Jul 9, 2015
bc509dd
Removed SlahCount from baseNameScore (Test if this is source of diffe…
jeancroy Jul 9, 2015
b3d47c5
Increased BasePath bonus
jeancroy Jul 9, 2015
82dfcc3
Increase exact match bonus if happens right after a separator
jeancroy Jul 9, 2015
f779333
Introduce concept of optional characters, for example space can be ma…
jeancroy Jul 9, 2015
7539792
Added test for issue
jeancroy Jul 9, 2015
7c7f5d1
Added back preference for shallower path. (got removed after first bu…
jeancroy Jul 10, 2015
0e2c1a4
Compute lowercase once (thnaks to @walles)
jeancroy Jul 10, 2015
6dd4eaa
Last refactoring permit to drop -Infinity value and remove one test c…
jeancroy Jul 10, 2015
d2eb476
re-use precomputed lowercase
jeancroy Jul 10, 2015
09aa694
corrected math for lpos, in the suffix bonus case.
jeancroy Jul 10, 2015
6b44306
case sensitive exact match is now a bypass
jeancroy Jul 10, 2015
957da9b
Improved accuracy for CamelCase
jeancroy Jul 10, 2015
dec3278
account for case in CamelCase vs Substring matches
jeancroy Jul 10, 2015
67a0a0e
CamelCase Matches now prefer smaller haystack, matches that happens s…
jeancroy Jul 10, 2015
4290552
More robust basePath score
jeancroy Jul 10, 2015
8d1d713
Added more real-life test cases
jeancroy Jul 11, 2015
25d7b3c
Adjusted some weight so close call test have more legroom
jeancroy Jul 11, 2015
ab040aa
Query that have slashes also get bonus for matching base file.
jeancroy Jul 11, 2015
9ab128c
fix some comments
jeancroy Jul 11, 2015
e0ff69b
Merge remote-tracking branch 'origin/master'
jeancroy Jul 11, 2015
28f2248
Now using improved scorer to do aligments (matches)
jeancroy Jul 11, 2015
2c011a9
Option to allow error in query, disabled by default.
jeancroy Jul 11, 2015
8f42f93
Updated match to mirror behavior of filter
jeancroy Jul 12, 2015
99d3237
rework sequence merging (basePath, completePath) in match method
jeancroy Jul 12, 2015
8196b5e
added some mid difficulty highlight test case
jeancroy Jul 12, 2015
e811fa4
Introduce knowledge of consecutive characters in both match and the c…
jeancroy Jul 12, 2015
d6c2882
Implemented forward search for the number of consecutive chars.
jeancroy Jul 12, 2015
fa66d51
Improve Match:
jeancroy Jul 13, 2015
f0c492f
corrected exit condition of perfect camelCase match to account for ha…
jeancroy Jul 13, 2015
a862b89
added test for case-sensitive exact matches.
jeancroy Jul 13, 2015
e3c0fff
added test case to ensure correct result event in the non-exact subst…
jeancroy Jul 13, 2015
ba5d196
with the introduction of bonus relative to the number of consecutive …
jeancroy Jul 13, 2015
616346e
Remove un-needed (since last change) gap tracking variables
jeancroy Jul 13, 2015
03727bf
Merge branch 'after22'
jeancroy Jul 13, 2015
4b9550e
Merge remote-tracking branch 'origin/master'
jeancroy Jul 13, 2015
f9197a3
Merge highlight (match) feature using improved scoring algorithm.
jeancroy Jul 13, 2015
329e7c1
make score() follow filter() behavior more closely
jeancroy Jul 13, 2015
beee187
clean-up, fix comments
jeancroy Jul 14, 2015
c06cc8f
added more non-exact (fuzzy) test.
jeancroy Jul 14, 2015
cf8a43d
Replace test by an easier one for now.
jeancroy Jul 14, 2015
da12540
unified handling of snake_case and CamelCase.
jeancroy Jul 14, 2015
be99411
faster exit condition abbrPrefix
jeancroy Jul 14, 2015
1908a56
Faster abbrPrefix
jeancroy Jul 14, 2015
39e92f7
Bring back string_score as an optional faster but less accurate algor…
jeancroy Jul 14, 2015
e163e38
Speed: Math.max with 3 args is slow - use ternary, ternary
jeancroy Jul 15, 2015
1376514
Speed: use Object for multiple return value of abbrPrefix a bit faste…
jeancroy Jul 15, 2015
8993e06
Cleaner definition of sepmap
jeancroy Jul 15, 2015
93d283c
Show worst case scenario in benchmark
jeancroy Jul 15, 2015
0987c51
One extra test case needed to explain speed behavior
jeancroy Jul 15, 2015
8ac9178
show acronym quick exit in benchmark
jeancroy Jul 15, 2015
f129adc
Propose Worst case mitigation strategy
jeancroy Jul 15, 2015
b7c2422
Added a setting to control of the length subject considered for the s…
jeancroy Jul 16, 2015
7a063b4
Improve scoring of a match neighborhood quality.
jeancroy Jul 16, 2015
cb15663
remove the +1 offset between score and string position
jeancroy Jul 16, 2015
f47fc76
More uniform naming, use less memory
jeancroy Jul 16, 2015
0fcc59d
Merge pull request #2 from jeancroy/structural_change
jeancroy Jul 16, 2015
af723ee
Code review, missed one switch i<->j
jeancroy Jul 16, 2015
8bf3f60
"Worser" worst case scenario in benchmark
jeancroy Jul 16, 2015
99a8a3b
Strip space and space like character in score() and match() to mirror…
jeancroy Jul 17, 2015
887a577
test case for https://github.com/substantial/atomfiles/issues/43
jeancroy Jul 18, 2015
753c6c7
simplification:
jeancroy Jul 18, 2015
e6f103b
Faster test for Case-Sensitive Exact match
jeancroy Jul 18, 2015
a14bb8d
compute lowercase() upfront, enable mitigation by default
jeancroy Jul 19, 2015
da72fa9
when queryHasSlashes, basepath contain as many folder as query, start…
jeancroy Jul 20, 2015
98af35f
Clarify behavior of filter() on empty string or empty array
jeancroy Jul 21, 2015
a0099df
Added test for Suffix feature of exact match
jeancroy Jul 21, 2015
1cb9ce8
Allow to interchange forward and backward slashes in query.
jeancroy Aug 29, 2015
6cd2b3f
Prepare for review / V3
jeancroy Sep 17, 2015
46c8d2b
Corrected a bug with isMatch, reworked acronym weigth
jeancroy Sep 18, 2015
2f56dcb
Scoring improvement for single character query.
jeancroy Sep 19, 2015
8086648
Scoring improvement for single character query.
jeancroy Sep 19, 2015
2ba093a
Speed: Hit Count Optimisation & various improvements
jeancroy Sep 22, 2015
00ccd02
Delay computing candidate lowercase until IsMatch confirmed.
jeancroy Sep 23, 2015
f7cd989
faster end condition for scoreAcronyms
jeancroy Sep 23, 2015
a67fe76
clean up around scoreAcronyms added a new benchmark case
jeancroy Sep 23, 2015
149d937
fix indentation
jeancroy Sep 23, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 28 additions & 3 deletions spec/filter-spec.coffee
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@ rootPath = (segments...) ->
describe "filtering", ->
it "returns an array of the most accurate results", ->
candidates = ['Gruntfile','filter', 'bile', null, '', undefined]
expect(filter(candidates, 'file')).toEqual ['filter', 'Gruntfile']
expect(filter(candidates, 'file')).toEqual ['Gruntfile', 'filter']

describe "when the maxResults option is set", ->
it "limits the results to the result size", ->
candidates = ['Gruntfile', 'filter', 'bile']
expect(bestMatch(candidates, 'file')).toBe 'filter'
expect(bestMatch(candidates, 'file')).toBe 'Gruntfile'

describe "when the entries contains slashes", ->
it "weighs basename matches higher", ->
Expand Down Expand Up @@ -111,6 +111,31 @@ describe "filtering", ->
expect(bestMatch(['a_b_c', 'a_b'], 'ab')).toBe 'a_b'
expect(bestMatch(['z_a_b', 'a_b'], 'ab')).toBe 'a_b'
expect(bestMatch(['a_b_c', 'c_a_b'], 'ab')).toBe 'a_b_c'
expect(bestMatch(['Unin-stall', path.join('dir1', 'dir2', 'dir3', 'Installation')], 'install')).toBe path.join('dir1', 'dir2', 'dir3', 'Installation')
expect(bestMatch(['Uninstall', path.join('dir', 'Install')], 'install')).toBe path.join('dir', 'Install')

it "weighs substring higher than individual characters", ->
candidates = [
'Git Plus: Stage Hunk',
'Git Plus: Reset Head',
'Git Plus: Push',
'Git Plus: Show'
]
expect(bestMatch(candidates, 'push')).toBe 'Git Plus: Push'
expect(bestMatch(['a_b_c', 'somethingabc'], 'abc')).toBe 'somethingabc'

it "returns the result in order", ->
candidates = [
'Find And Replace: Selet All',
'Settings View: Uninstall Packages',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Settings View: View Installed Themes is missing from this list.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because current ranking was not optimal so I did not wanted to set current ranking as spec.
On the other side if I place "optimal" ranking as per @plrenaudin the build will fail...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand there's no "nice to have" spec, it's pass all or the build fail

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pity.

But even without that this version of the patch is still an improvement over the current situation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :) Added support for your usecase in both spec and scoring (implemented as an extra bonus if exact match happens rigth after token separator)

'Application: Install Update',
'install'
]
result = filter(candidates, 'install')
expect(result[0]).toBe candidates[3]
expect(result[1]).toBe candidates[2]
expect(result[2]).toBe candidates[1]
expect(result[3]).toBe candidates[0]

describe "when the entries are of differing directory depths", ->
it "places exact matches first, even if they're deeper", ->
Expand All @@ -136,4 +161,4 @@ describe "filtering", ->
path.join('app', 'models', 'cars', 'car.rb')
path.join('spec', 'cars.rb')
]
expect(bestMatch(candidates, 'car.rb')).toBe candidates[0]
expect(bestMatch(candidates, 'car.rb')).toBe candidates[0]
245 changes: 184 additions & 61 deletions src/scorer.coffee
Original file line number Diff line number Diff line change
@@ -1,22 +1,193 @@
# Original ported from:
#
# string_score.js: String Scoring Algorithm 0.1.10
# Score similarity between two string
#
# http://joshaven.com/string_score
# https://github.com/joshaven/string_score
# isMatch: Fast detection if all character of needle is in haystack
# score: Find string similarity using a Smith Waterman Gotoh algorithm
# Modified to account for programing scenarios (CamelCase folder/file.ext object.property)
#
# Copyright (C) 2009-2011 Joshaven Potter <yourtech@gmail.com>
# Special thanks to all of the contributors listed here https://github.com/joshaven/string_score
# MIT license: http://www.opensource.org/licenses/mit-license.php
# Copyright (C) 2015 Jean Christophe Roy and contributors
# MIT License: http://opensource.org/licenses/MIT
#
# Date: Tue Mar 1 2011
# Previous version of scorer used string_score from Joshaven Potter
# https://github.com/joshaven/string_score/


wm = 10 # base score of making a match
ws = 30 # bonus of making a separator match
wa = 20 # bonus of making an acronym match
wc = 10 # bonus for proper case

wo = -8 # penalty to open a gap
we = -2 # penalty to continue an open gap (inside a match)
wh = -0.1 # penalty for haystack size (outside match)

wst = 20 # bonus for match near start of string (fade one per position until 0)
wex = 10 # bonus per character of an exact match. If exact coincide with prefix, bonus will be 2*wex, then it'll fade to 1*wex as string happens later.

#Note: separator are likely to trigger both a
# "acronym" and "proper case" bonus in addition of their own bonus.


separators = ' .-_/\\'
PathSeparator = require('path').sep

separator_map = ->
sep_map = {}
k = -1
while ++k < separators.length
sep_map[separators[k]] = k

sep_map

sep_map = separator_map()

exports.score = score = (subject, query, ignore) ->

#bypass isMatch will allow inexact match, but will be slower
return 0 if !( subject and query and isMatch(query, subject) )

m = query.length + 1
n = subject.length + 1

#Init
vrow = new Array(n)
gapArow = new Array(n)
gapA = 0
gapB = 0
vmax = 0

#DEBUG
#VV = []

#Fill with 0
j = -1
while ++j < n
gapArow[j] = 0
vrow[j] = 0

i = 0 #1..m-1
while ++i < m
#foreach char of query
gapB = 0
vd = vrow[0]

#DEBUG
#VV[i] = []

j = 0 #1..n-1
while ++j < n
#foreach char of subject

# Score the options
gapA = gapArow[j] = Math.max(gapArow[j] + we, vrow[j] + wo)
gapB = Math.max(gapB + we, vrow[j - 1] + wo)
align = vd + char_score(query, subject, i - 1, j - 1)
vd = vrow[j]

#Get the best option
v = vrow[j] = Math.max(align, gapA, gapB, 0)

#DEBUG
#VV[i][j] = v

#Record best score
if v > vmax
vmax = v

#DEBUG
#console.log(query,subject)
#console.table(VV);


#haystack penalty
vmax = Math.max(vmax / 2, vmax + wh * (n - m))

#sustring bonus, start of string bonus
vmax += if (p = subject.toLowerCase().indexOf(query.toLowerCase())) > -1 then wex * m * (1.0 + 1.0 / (1.0 + p)) else 0

return vmax

char_score = (query, subject, i, j) ->
qi = query[i]
sj = subject[j]

if qi.toLowerCase() == sj.toLowerCase()

#Proper casing bonus
bonus = if qi == sj then wc else 0

#start of string bonus
bonus += Math.max(wst - j, 0)

#match IS a separator
if qi of sep_map
return ws + bonus

#match is first char ( place a virtual token separator before first char of string)
return wa + bonus if ( j == 0 or i == 0)

#get previous char
prev_s = subject[j - 1]
prev_q = query[i - 1]

#match FOLLOW a separator
return wa + bonus if ( prev_s of sep_map) or ( prev_q of sep_map )

#match IS Capital in camelCase (preceded by lowercase)
return wa + bonus if (sj == sj.toUpperCase() and prev_s == prev_s.toLowerCase())

#normal Match, add proper case bonus
return wm + bonus

#No match, best move will be to take a gap in either query or subject.
return -Infinity


isMatch = (query, subject) ->
m = query.length
n = subject.length

if !m or !n or m > n
return false

lq = query.toLowerCase()
ls = subject.toLowerCase()

i = -1
j = -1
k = n - 1

while ++i < m

qi = lq[i]

while ++j < n

if ls[j] == qi
break

else if j == k
return false


true

exports.basenameScore = (string, query, score) ->
index = string.length - 1
index-- while string[index] is PathSeparator # Skip trailing slashes

return 0 if score == 0
end = string.length - 1
end-- while string[end] is PathSeparator # Skip trailing slashes

basePos = string.lastIndexOf(PathSeparator, end)
baseScore = if (basePos == -1) then score else Math.max(score, exports.score(string.substring(basePos + 1, end+1), query))
score = 0.15*score + 0.85*baseScore

score

###

slashCount = 0
baseScore = 0
lastCharacter = index
base = null
while index >= 0
Expand All @@ -30,60 +201,12 @@ exports.basenameScore = (string, query, score) ->
base ?= string
index--

# Basename matches count for more.
if base is string
score *= 2
else if base
score += exports.score(base, query)

# Shallow files are scored higher
segmentCount = slashCount + 1
depth = Math.max(1, 10 - segmentCount)
score *= depth * 0.01
score

exports.score = (string, query) ->
return 1 if string is query

# Return a perfect score if the file name itself matches the query.
return 1 if queryIsLastPathSegment(string, query)

totalCharacterScore = 0
queryLength = query.length
stringLength = string.length

indexInQuery = 0
indexInString = 0

while indexInQuery < queryLength
character = query[indexInQuery++]
lowerCaseIndex = string.indexOf(character.toLowerCase())
upperCaseIndex = string.indexOf(character.toUpperCase())
minIndex = Math.min(lowerCaseIndex, upperCaseIndex)
minIndex = Math.max(lowerCaseIndex, upperCaseIndex) if minIndex is -1
indexInString = minIndex
return 0 if indexInString is -1

characterScore = 0.1
# Shallow files are scored higher
score += baseScore*( 3.0 + 3.0/(3.0+slashCount) )

# Same case bonus.
characterScore += 0.1 if string[indexInString] is character
###

if indexInString is 0 or string[indexInString - 1] is PathSeparator
# Start of string bonus
characterScore += 0.8
else if string[indexInString - 1] in ['-', '_', ' ']
# Start of word bonus
characterScore += 0.7

# Trim string to after current abbreviation match
string = string.substring(indexInString + 1, stringLength)

totalCharacterScore += characterScore

queryScore = totalCharacterScore / queryLength
((queryScore * (queryLength / stringLength)) + queryScore) / 2

queryIsLastPathSegment = (string, query) ->
if string[string.length - query.length - 1] is PathSeparator
string.lastIndexOf(query) is string.length - query.length