Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP HTML Reporter: Fuzzy search using fuse #1442

Closed
wants to merge 1 commit into from

Conversation

ventuno
Copy link
Member

@ventuno ventuno commented May 19, 2020

Sample implementation of: #1438 (Fuse.js-based).
Replaces #1441.

@trentmwillis
Copy link
Member

RE your comment from the other PR. I do agree the results aren't as good as the fuzzysearch implementation. Would you mind tweaking the settings a bit to see if you can tune it to get better results?

I'm more partial to this approach because I like the API a bit more and like that the package seems to still be maintained and developed (unlike fuzzysearch). That said, ultimately, the user experience is more important, so if we can't get good results with this library, we'll go the other route.

@ventuno
Copy link
Member Author

ventuno commented May 21, 2020

Thanks @trentmwillis. I agree that code-wise Fuse.js is the obvious choice. It's definitely a very active project and calling search without having to pass the targets each time is a much cleaner API. Moreover, in order to import fuzzysort we need to rely on an older version rollup-plugin-commonjs (but updating the way qunit.js is built is a different conversation).

I did a few tests, but tweaking the configuration parameters doesn't seem to be very helpful, let me take you briefly through my thought process. There are four parameters:

  1. Key Weight: this is useless in our case as we only search based on one key (module name);
  2. location (default: 0)/distance (default: 100)/threshold (default: 0.6): these are used to bind the search in the string between location and distance*threshold. This means that by default a match will only trigger a result if it's in the range [0, 60] of the search string.
    As none of the modules in our example is longer than 60 characters I don't believe these values are affecting at all our search results (and nothing in tweaking these values will improve results).

In general it looks like Fuse.js will favor results which are closer to the beginning of string, that's probably why in my tests "before" or "before (skip)" score better than "Module with Promise-aware beforeEach".

I hope this helps, let me know if you have other ideas, more than happy to brainstorm and try out more things.

@ventuno
Copy link
Member Author

ventuno commented May 29, 2020

@trentmwillis any thoughts on the above?

@trentmwillis
Copy link
Member

Hey @ventuno, sorry been busy recently. After looking into the problem a bit more, I think what we'd ideally want is the ability to get a separate "fuzziness" score from Fuse.js, which may be possible eventually but not right now: krisk/Fuse#397. If we had that, we could probably provide a better set of results and sorting, since we don't really care about where in the title a match is found (in other words, we don't care about distance really).

Given the current API, the only potential adjustments I think we could try would be to:

  1. Try to filter out low score matches (i.e., the results for <script> that seem way off)
  2. Use the includeMatches option to get the actual matched term and use that to group/sort the results (i.e., make the results for beforeEach seem less randomly ordered)

Let me know if you think either of those are worth pursuing.

@ventuno
Copy link
Member Author

ventuno commented Jun 1, 2020

Thanks @trentmwillis!

  1. I tried filtering the results before, but unlike fuzzysort the documentation doesn't provide a good reference value to filter out bad results, so it was pretty difficult to identify a value that wouldn't remove relevant results. In general, I don't care how many results I get as long as the most relevant ones are on top (think of "page 2" in a Google search);
  2. is actually an interesting idea. I played around with it a bit, see my thoughts below.

First, let's take a look at the result object, so we're on the same page (I redacted the content of item as it's just the module object):

{
  "item": { ... },
  "refIndex": 40,
  "matches": [
    {
      "indices": [
        [0, 5],
        [7, 16]
      ],
      "value": "before/beforeEach/afterEach/after",
      "key": "name"
}]}

Let's focus on matches. It's an array of objects, each object represents the matches for a given key, in our case as we only have one key (the module name) so we only have one item in the array of matches. The matches object has an array of indices which contains "tuples" indicating the [<start_index>, <end_index>] of a match. In the above example the library matched: "before/beforeEach/afterEach/after".

My first idea was to compare two results by finding the one with longest unique match, let's call this computeScore1:

function computeScore1(result) {
  return Math.max(...result.matches[0].indices.map(m => m[1] - m[0]));
}

This works well if we are searching entire words (e.g.: "beforeEach"), but doesn't work as well if compound words (e.g.: "promiseawarebeforeEach"), because we will still give more weight to the longest continuous match, so I introduced computeScore2 which by adding the overall match length improves searches for compound words:

function computeScore2(result) {
  return result.matches[0].indices.map(m => m[1] - m[0]).reduce((acc, curr) => acc+curr, 0);
}

Of course this approach has its own issues as many 1-character matches could overweight an overall smaller but continuous match. We could potentially "weight" each "tuple" so that longest ones are more relevant, but this gets complicated very quickly.

I used the above functions in filterModules (replace computeScore<N> with computeScore1 or computeScore2):

function filterModules( searchText ) {
  if ( searchText === "" ) {
    return config.modules;
  }
  return fuse.search( searchText )
    .sort( ( m1, m2 ) => {
      return computeScore<N>(m2) - computeScore<N>(m1);
    })
    .map(result => result.item);
}

And collected the results (see screenshots below):

Search "beforeEach" with computeScore1

beforeEach - computeScore1

Search "promiseawarebeforeEach" with computeScore1

promiseawarebeforeEach - computeScore1

Search "beforeEach" with computeScore2

beforeEach - computeScore2

Search "promiseawarebeforeEach" with computeScore2

promiseawarebeforeEach - computeScore2

Search "beforeEach" with fuzzysort

beforeEach - fuzzysort

Search "promiseawarebeforeEach" with fuzzysort

promiseawarebeforeEach - fuzzysort

Search "promiseawarebeforeEach" with "vanilla" Fuse.js

promiseawarebeforeEach - vanilla fuse

In general, as much as I like the idea and seems to yield promising results, I'm not sure if it's a step in the right direction: ideally I would like to pick a library that works out of the box. I'm also not sure if custom sorting of results will make sense with future versions of Fuse.js.
Thoughts?

@trentmwillis
Copy link
Member

Thanks for the detailed write-up. I agree that this does seem to yield better results but the added complexity is undesirable.

After thinking about it a bit more, I think we should go with the fuzzysearch. Even though it hasn't had a new release in sometime, it is likely not something we'll need to update (assuming it gives good results). Agree?

@ventuno
Copy link
Member Author

ventuno commented Jun 4, 2020

Thanks @trentmwillis I agree with you and it's probably better to go with fuzzysort instead (#1440). I use qunit at work so I'll be heavily using this and should know if anything is off.
This doesn't stop us from keeping an eye on Fuse.js (and krisk/Fuse#397) and switch to Fuse.js later as this PR will serve as a template for future testing. Closing for now.

@ventuno ventuno closed this Jun 4, 2020
@ventuno ventuno deleted the ftr-1438-fuse-3 branch June 4, 2020 00:43
@krisk
Copy link

krisk commented Jun 13, 2020

Happened to see this in my mentions. So, hope you don't mind my jumping in here 😄

In general it looks like Fuse.js will favor results which are closer to the beginning of string, that's probably why in my tests "before" or "before (skip)" score better than "Module with Promise-aware beforeEach".

Yes. That is correct.

I tried filtering the results before, but unlike fuzzysort the documentation doesn't provide a good reference value to filter out bad results, so it was pretty difficult to identify a value that wouldn't remove relevant results. In general, I don't care how many results I get as long as the most relevant ones are on top (think of "page 2" in a Google search);

Did you try setting includeScore? This would add the relevance score in each search result, which you could then use to further refine the results.

Hey @ventuno, sorry been busy recently. After looking into the problem a bit more, I think what we'd ideally want is the ability to get a separate "fuzziness" score from Fuse.js, which may be possible eventually but not right now: krisk/Fuse#397. If we had that, we could probably provide a better set of results and sorting, since we don't really care about where in the title a match is found (in other words, we don't care about distance really).

A little more complicated, but see here for details on Fuse.js scoring mechanism. Nonetheless, I do actually plan to add an option to Fuse.js to ignore distance. It seems to have become a desired feature.

cc: @ventuno, @trentmwillis

@ventuno
Copy link
Member Author

ventuno commented Jun 13, 2020

Thanks @krisk for commenting on this.

Did you try setting includeScore? This would add the relevance score in each search result, which you could then use to further refine the results.

I tried, but I had issues identifying a good "threshold" to use to filter out bad results.

Nonetheless, I do actually plan to add an option to Fuse.js to ignore distance. It seems to have become a desired feature.

We already merged the fuzzysort version, but feel free to ping me directly or this conversation and I can give Fuse.js another try.

@krisk
Copy link

krisk commented Jun 13, 2020

@ventuno you can use fuse.js@6.1.0-alpha.0. Set the option ignoreLocation to true should produce the desired behavior.

const list = [
  'beforeEach',
  'async beforeEach test',
  'assert.async in beforeEach',
  'Module with Promise-aware beforeEach',
  'Promise-aware return values without beforeEach/afterEach',
  'before',
  'before (skip)'
]

When searching "beforeEach". Default behavior (ignoreLocation is false):

[
  { item: 'beforeEach', refIndex: 0, score: 0 },
  {
    item: 'async beforeEach test',
    refIndex: 1,
    score: 0.1972392177586917
  },
  { item: 'before', refIndex: 5, score: 0.4 },
  {
    item: 'assert.async in beforeEach',
    refIndex: 2,
    score: 0.4493775633055149
  },
  { item: 'before (skip)', refIndex: 6, score: 0.5231863610884103 },
  {
    item: 'Module with Promise-aware beforeEach',
    refIndex: 3,
    score: 0.5916079783099616
  },
  {
    item: 'Promise-aware return values without beforeEach/afterEach',
    refIndex: 4,
    score: 0.699819425905295
  }
]

When ignoreLocation is true:

[
  { item: 'beforeEach', refIndex: 0, score: 0 },
  {
    item: 'async beforeEach test',
    refIndex: 1,
    score: 0.01857804455091699
  },
  {
    item: 'assert.async in beforeEach',
    refIndex: 2,
    score: 0.01857804455091699
  },
  {
    item: 'Module with Promise-aware beforeEach',
    refIndex: 3,
    score: 0.03162277660168379
  },
  {
    item: 'Promise-aware return values without beforeEach/afterEach',
    refIndex: 4,
    score: 0.045603691595129614
  },
  { item: 'before', refIndex: 5, score: 0.4 },
  { item: 'before (skip)', refIndex: 6, score: 0.5231863610884103 }
]

Let me how that works out for you. If it's good, I'll accelerate to beta and latest.

[Tracking krisk/Fuse#438]

@ventuno
Copy link
Member Author

ventuno commented Jun 13, 2020

Thanks @krisk. I replayed these changes with the recommended version and configuration in #1447 (search results screenshots are attached to the description).

The results for "beforeEach" are indeed a lot better. I'm still puzzled by the order for "promiseawarebeforeEach", see how "Promise-aware return values without beforeEach/afterEach" comes after "Module with Promise-aware afterEach" or "async beforeEach test".

Please let's continue this conversation in #1447, I wanted to tag you there, but GH won't let me (cc @trentmwillis).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants