WIP: better fuzzy matches #1238

ghost · 2015-10-21T21:14:36Z

Rough Demo for better results as described #1237

To see the difference, try helm-M-x with the default helm-candidate-number-limit of 100
and see what "elm" brings up compared to unpatched version.

thierryvolpiatto · 2015-10-22T05:14:26Z

Thanks for the PR, it is easy to see what have been done, comment etc...

IIUC the point of this PR is to take in account the separators in matching regexp.
This is interesting, we miss this indeed, but I guess it could be implemented differently and more efficiently (will see what I can do as soon as I have a journey to work on this, I have too few time actually for complex stuff).
Also in your PR in-buffer method (i.e helm-source-in-buffer) is not implemented.

ghost · 2015-10-22T10:00:40Z

Thanks for reviewing, clearly I should have given you a branch to begin with so review is easier. I'll remember that in the future.

As for "more efficiently" see my response above, I think you missed something crucial about
how this works. But, one thing I've tried which works very well is to force the regex to have
the first letter of the query match the first letter of the candidate. In this way the regex FSM
can exit very quickly when examining candidates and the speedup is noticable.
I think the performance is acceptabke even without it except for longer queries (4,5) when
the extra regex pass does become a little laggy without that change.

I'll look into helm-source-in-buffer because I have no idea what that is. If you mean candidates-in-buffer, this supports it but the improvement in matches is not as dramatic.

ghost · 2015-10-22T10:34:28Z

IIUC the point of this PR is to take in account the separators in matching regexp.

That's only half of it and part of the reason why flx does so much better then helm. The other half as that you must look through all the candidates to locate the promising candidtes first. It's not enough to find the first 100 "matches" and stop, becuase those matches are very likely to score very low.

For example, the top result helm currently returns for elm (from #1237) is:
helm-insert-latex-math.

That's a terrible fuzzy match.

ghost · 2015-10-22T16:10:33Z

@thierryvolpiatto Here's the source-in-buffer support you repeatedly mentioned. I'll note again that this also worked fine before this change, but gave even better results if candidates-in-buffer was disabled. Now, there's not need and I've reset candidates-in-buffer to t again.

To enable us to find really "good" candidates before the general pass, we pin a new slot on the source called all-candidates which holds the entire candidate list and use that for the first pass only. This should be faster and should apply to all types of sources, although it may mean
making small changes in several places to support it. Right now, anything that uses helm-comp-read should work.

This is still not meant to be merged, it's just an example to see what you think about doing
things this way.

For anyone following along, this is currently only enabled for the "Emacs Commands" source.

ghost · 2015-10-22T17:30:13Z

Even better matches now with a tweaked regex, works in describe{-variables,commands}
too now.

ghost · 2015-10-22T18:54:06Z

Better fuzzy matches implemented for all describe-* functions (in-source-buffer),
hcnl now brings up helm-candidate-number-limit, in case we get to tests later.

thierryvolpiatto · 2015-10-23T06:27:34Z

Alfred Kapote notifications@github.com writes:

@thierryvolpiatto Here's the better support for the source-in-buffer support you repeatedly
mentioned. I'll note again that this also worked fine before this change, but gave even better
results if candidates-in-buffer was disabled. Now, there's not need and I've reset candidates-in-buffer to t again.

To enable us to find really "good" candidates before the general pass, we pin a new slot on the source called all-candidates which holds the entire candidate list and
use that for the first pass only. This should be faster and should apply to all types of sources, although it may mean
making small changes in several places to support it. Right now, anything that uses helm-comp-read should work.

This is still not meant to be merged, it's just an example to see what you think about doing
things this way.

I don't think doing two passes is a good thing, it will alter
significantly performances, thus in your second pass the regexp you use
is not cached, isn't it?
When the regexp is not cached things become really slow.

Keep in mind that for fuzzy matching, the difficult part is finding the
good balance between "being fast/being smart".

Thierry
https://emacs-helm.github.io/helm/

thierryvolpiatto · 2015-10-23T06:33:25Z

Alfred Kapote notifications@github.com writes:

IIUC the point of this PR is to take in account the separators in matching regexp.
That's only half of it. The other half as that you must look through all the candidates
to locate some promising candidtes first. It's not enough to find a 100 "matches" and stop,
becuase those matches are very likely to score very low.

As an example, the top resul helm currently returns for elm (from #1237) is:
helm-insert-latex-math.

That's a terrible fuzzy match.

Depend, maybe yes, maybe no, if you focus on elm=emacs-lisp-mode, yes it
is a bad match, buf people may try to find something else too.
If you write something more precise you match immdiately, so what we
have currently is not so bad as you would say.

e.g "emlsp" => emacs-lisp-mode.

Thierry
https://emacs-helm.github.io/helm/

ghost · 2015-10-23T11:32:54Z

I don't think doing two passes is a good thing, it will alter significantly performances,

Yes, but slower isn't itself a problem only "too slow" and it should be an opt-in thing like fuzzy matchng anyway. My cpu isn't very fast and for me this is acceptable even in apropos with 10k+ entries.

thus in your second pass the regexp you use is not cached, isn't it?

I don't see why. Constructing the regex happens once after a query is updated, when the closure is constructed. Do you mean something else?

Keep in mind that for fuzzy matching, the difficult part is finding the good balance between "being
fast/being smart".

Absolutely, otherwise you could simply use the score function over all candidates. But, I would add
that the idea of "fuzzy" matching is to dwim, behave as the user expects. querying by initials
is a very common expectation, I think.

elm matches helm-insert-latex-math That's a terrible fuzzy match.

maybe yes, maybe no, people may try to find something else too.
If you write something more precise you match immdiately "emlsp" => emacs-lisp-mode.

Sure It works for some queries but also often enough it doesn't do what I expect. The problem is
specifically that for initials queries the results almost never bring up the best candidates even though
they express a very specific class of matches. Part of the idea is to get there with a minimum of keystrokes and a minimum of suprise requiring trial and error to find a working query.
That's why initials queries deserve to be better supported I think. I agree that for longer queries the current code works very well.

I'd like helm to guess what I'm thinking more of the time and I think this demo shows this could
be done with acceptable (I think) performance. For me, querying by initials is a major use case.

Also, I would love to do the same kind of "preferred-matching" for "EMacs-Lisp-Mode" but I think the regex for all combinations will be too slow.

ghost · 2015-10-23T13:32:48Z

@thierryvolpiatto, not sure if this is what you meant with "regex caching" but if it was, it's a good idea.

Once a scan of all candidates for QUERY returns less then LIMIT matches we can be certain that any match for QUERY_ we might find in a full scan is already present in the (much smaller list of) matches we previously found for QUERY. By caching results for previous queries we can avoid doing an extra full scan fairly often, usually just once or twice overall.

This doesn't make a huge difference afaict (it seems fast enough already) but maybe it will for some queries.

Please consider cherry-picking:

97f3394 unrelated doctring addition.
e3d54f1 unrelated optimization, hash table resizing is usually an expensive operation, a better initial guess on size can improve performance.

ghost · 2015-10-23T21:30:18Z

Cleaned up. Much smaller patch now and there's no longer a need to modify the existing sources.
If :fuzzy-match is set for the source it should just work.

lewang · 2015-10-24T00:39:49Z

helm.el

-(defun helm-match-from-candidates (cands matchfns match-part-fn limit source)
-  (let (matches)
+
+(defun helm--mapconcat-initials-pattern (pattern seperators)


Can you explain why this pattern is so restrictive?

If this is a preliminary filtering step, then inserting ".*" in between characters should be enough. Is there benchmarking to prove this is necessary?

The more candidates flx sees the better your results will be.

What you're suggesting I've already covered in the original issue #1237, please read it.
Also, Inserting greedy matching operators can and will drop the best match from the list in some cases.

The more candidates flx sees the better your results will be.

Right, but I've had to set the limit at 2500 at one point to get the matches I wanted, perf was acceptable but not great and ocassionaly caused long GCs (yes, I know about the GC tuning you recommend). Most of that work is completely unnecessary - ranking 2000 poor quality "matches" with a complicated lisp function to get the one I want is inefficient.

As for benchmarks, if you think this can all be handled by flx directly I'd love for you to finish that implementation and make all this work obsolete. For now, this is better than any other solution available for helm afaik.

So this is designed to not need flx? That's really interesting. I will follow along and see how it goes.

No, it works best as a companion to helm-flx. The idea is to keep the number of matches sent to flx small-ish while trying harder to have the best candidates included in those flx gets to see.

You could use the latest iteration of the regex to replace helm's stock one for fuzzy matching
and even without flx it should work quite well. I haven't tried it though.

ghost · 2015-10-24T02:36:26Z

Also, I would love to do the same kind of "preferred-matching" for "EMacs-Lisp-Mode" but I think
the regex for all combinations will be too slow.

Turns out this works just fine after all, so I've added a new knob to trade off speed for accuracy.

helm-preferred-max-group-length now lets you control how hard to try variations.
For the query abcd:
With a value of 1 will match a-b-c-d
With a value of 2 will also match ab-c-d a-bc-d, etc'
etc'

I don't notice a slowdown even with this on and it seems to work fantastically well.
Even with a very low (setq helm-candidate-number-limit 10) all the following queries bring up helm-candidate-number-limit as the first match in describe-variable:

hcnl
hec
hcn
henl
...

ghost · 2015-10-24T14:30:43Z

Now that the full candidate list is cached on first invocation instead of living in a slot on the source,
there's a mysterious bug when a query is typed quickly instead of one character at a time. I'll fix this later.

(fixed)

@thierryvolpiatto, I've cleaned this up, added support for source-in-buffer and as much caching as I could. Performance is good even in apropos, the fuzzy matches are vastly better IMO and afaik nothing is broken by it (I've tested find-files, describe-, M-x, apropos).

What else should I test?

helm-preferred-max-group-length controls the generation of the regexp for matching preferred candidates. query: "abcd" With a value of 1 will match a-b-c-d With a value of 2 will also match ab-c-d a-bc-d, etc' With a value of 3 will also match abc-d a-bcd etc'

thierryvolpiatto · 2015-10-25T06:00:22Z

MikeSanMichel2 notifications@github.com writes:

@thierryvolpiatto, I've cleaned this up, added support for
source-in-buffer and much caching as I could. Performance is good even
in apropos, the fuzzy matches are vastly better IMO and afaik nothing
is broken by it (I've tested find-files, describe-,M-x, apropos).

What else should I test?

I haven't had the time to look at all your code carefully, but from the
few I see, it still have errors and it changes way too much things that
I guess will create tons of bugreport, I am not sure also the
performance will not be altered.

As I said I think there is a better approach to fix this issue -- better
fuzzy matching and scoring on prefixes after a separator --.

So thanks for your work, we can keep it around and work on it
eventually, but it is not ready to merge in master, it would break helm
badly IMO.
Thus your account have disapeared now and your branch is gone...

Thierry
https://emacs-helm.github.io/helm/

thierryvolpiatto · 2015-10-25T06:02:25Z

I am closing this because I see you have now a ghost account, feel free to submit again.

ghost mentioned this pull request Oct 23, 2015

Adapt Helm to use flx for better fuzzy matching syl20bnr/spacemacs#3327

Closed

lewang reviewed Oct 24, 2015
View reviewed changes

ghost mentioned this pull request Oct 24, 2015

Providing better candidates to Score functions in fuzzy matching stage #1237

Closed

alkapote added 4 commits October 25, 2015 00:04

WIP: better fuzzy matching for query-by-initials

c080b69

Add helm--get-all-source-candidates-no-really-NO-REALLY

7e6cbb3

Cleanup signature and docstring

d183171

thierryvolpiatto closed this Oct 25, 2015

EphramPerdition mentioned this pull request Oct 26, 2015

Add "real" fuzzy search behavior #145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: better fuzzy matches #1238

WIP: better fuzzy matches #1238

ghost commented Oct 21, 2015

thierryvolpiatto commented Oct 22, 2015

ghost commented Oct 22, 2015

ghost commented Oct 22, 2015

ghost commented Oct 22, 2015

ghost commented Oct 22, 2015

ghost commented Oct 22, 2015

thierryvolpiatto commented Oct 23, 2015

thierryvolpiatto commented Oct 23, 2015

ghost commented Oct 23, 2015

ghost commented Oct 23, 2015

ghost commented Oct 23, 2015

lewang Oct 24, 2015

ghost Oct 24, 2015

lewang Oct 24, 2015

ghost Oct 24, 2015

ghost commented Oct 24, 2015

ghost commented Oct 24, 2015

thierryvolpiatto commented Oct 25, 2015

thierryvolpiatto commented Oct 25, 2015

WIP: better fuzzy matches #1238

WIP: better fuzzy matches #1238

Conversation

ghost commented Oct 21, 2015

thierryvolpiatto commented Oct 22, 2015

ghost commented Oct 22, 2015

ghost commented Oct 22, 2015

ghost commented Oct 22, 2015

ghost commented Oct 22, 2015

ghost commented Oct 22, 2015

thierryvolpiatto commented Oct 23, 2015

thierryvolpiatto commented Oct 23, 2015

ghost commented Oct 23, 2015

ghost commented Oct 23, 2015

ghost commented Oct 23, 2015

lewang Oct 24, 2015

Choose a reason for hiding this comment

ghost Oct 24, 2015

Choose a reason for hiding this comment

lewang Oct 24, 2015

Choose a reason for hiding this comment

ghost Oct 24, 2015

Choose a reason for hiding this comment

ghost commented Oct 24, 2015

ghost commented Oct 24, 2015

thierryvolpiatto commented Oct 25, 2015

thierryvolpiatto commented Oct 25, 2015