Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: better fuzzy matches #1238

Closed
wants to merge 4 commits into from
Closed

WIP: better fuzzy matches #1238

wants to merge 4 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Oct 21, 2015

Rough Demo for better results as described #1237

To see the difference, try helm-M-x with the default helm-candidate-number-limit of 100
and see what "elm" brings up compared to unpatched version.

@thierryvolpiatto
Copy link
Member

Thanks for the PR, it is easy to see what have been done, comment etc...

IIUC the point of this PR is to take in account the separators in matching regexp.
This is interesting, we miss this indeed, but I guess it could be implemented differently and more efficiently (will see what I can do as soon as I have a journey to work on this, I have too few time actually for complex stuff).
Also in your PR in-buffer method (i.e helm-source-in-buffer) is not implemented.

@ghost
Copy link
Author

ghost commented Oct 22, 2015

Thanks for reviewing, clearly I should have given you a branch to begin with so review is easier. I'll remember that in the future.

As for "more efficiently" see my response above, I think you missed something crucial about
how this works. But, one thing I've tried which works very well is to force the regex to have
the first letter of the query match the first letter of the candidate. In this way the regex FSM
can exit very quickly when examining candidates and the speedup is noticable.
I think the performance is acceptabke even without it except for longer queries (4,5) when
the extra regex pass does become a little laggy without that change.

I'll look into helm-source-in-buffer because I have no idea what that is. If you mean candidates-in-buffer, this supports it but the improvement in matches is not as dramatic.

@ghost
Copy link
Author

ghost commented Oct 22, 2015

IIUC the point of this PR is to take in account the separators in matching regexp.

That's only half of it and part of the reason why flx does so much better then helm. The other half as that you must look through all the candidates to locate the promising candidtes first. It's not enough to find the first 100 "matches" and stop, becuase those matches are very likely to score very low.

For example, the top result helm currently returns for elm (from #1237) is:
helm-insert-latex-math.

That's a terrible fuzzy match.

@ghost
Copy link
Author

ghost commented Oct 22, 2015

@thierryvolpiatto Here's the source-in-buffer support you repeatedly mentioned. I'll note again that this also worked fine before this change, but gave even better results if candidates-in-buffer was disabled. Now, there's not need and I've reset candidates-in-buffer to t again.

To enable us to find really "good" candidates before the general pass, we pin a new slot on the source called all-candidates which holds the entire candidate list and use that for the first pass only. This should be faster and should apply to all types of sources, although it may mean
making small changes in several places to support it. Right now, anything that uses helm-comp-read should work.

This is still not meant to be merged, it's just an example to see what you think about doing
things this way.

For anyone following along, this is currently only enabled for the "Emacs Commands" source.

@ghost
Copy link
Author

ghost commented Oct 22, 2015

Even better matches now with a tweaked regex, works in describe{-variables,commands}
too now
.

@ghost
Copy link
Author

ghost commented Oct 22, 2015

Better fuzzy matches implemented for all describe-* functions (in-source-buffer),
hcnl now brings up helm-candidate-number-limit, in case we get to tests later.

@thierryvolpiatto
Copy link
Member

Alfred Kapote notifications@github.com writes:

@thierryvolpiatto Here's the better support for the source-in-buffer support you repeatedly
mentioned. I'll note again that this also worked fine before this change, but gave even better
results if candidates-in-buffer was disabled. Now, there's not need and I've reset candidates-in-buffer to t again.

To enable us to find really "good" candidates before the general pass, we pin a new slot on the source called all-candidates which holds the entire candidate list and
use that for the first pass only. This should be faster and should apply to all types of sources, although it may mean
making small changes in several places to support it. Right now, anything that uses helm-comp-read should work.

This is still not meant to be merged, it's just an example to see what you think about doing
things this way.

I don't think doing two passes is a good thing, it will alter
significantly performances, thus in your second pass the regexp you use
is not cached, isn't it?
When the regexp is not cached things become really slow.

Keep in mind that for fuzzy matching, the difficult part is finding the
good balance between "being fast/being smart".

Thierry
https://emacs-helm.github.io/helm/

@thierryvolpiatto
Copy link
Member

Alfred Kapote notifications@github.com writes:

IIUC the point of this PR is to take in account the separators in matching regexp.
That's only half of it. The other half as that you must look through all the candidates
to locate some promising candidtes first. It's not enough to find a 100 "matches" and stop,
becuase those matches are very likely to score very low.

As an example, the top resul helm currently returns for elm (from #1237) is:
helm-insert-latex-math.

That's a terrible fuzzy match.

Depend, maybe yes, maybe no, if you focus on elm=emacs-lisp-mode, yes it
is a bad match, buf people may try to find something else too.
If you write something more precise you match immdiately, so what we
have currently is not so bad as you would say.

e.g "emlsp" => emacs-lisp-mode.

Thierry
https://emacs-helm.github.io/helm/

@ghost
Copy link
Author

ghost commented Oct 23, 2015

I don't think doing two passes is a good thing, it will alter significantly performances,

Yes, but slower isn't itself a problem only "too slow" and it should be an opt-in thing like fuzzy matchng anyway. My cpu isn't very fast and for me this is acceptable even in apropos with 10k+ entries.

thus in your second pass the regexp you use is not cached, isn't it?

I don't see why. Constructing the regex happens once after a query is updated, when the closure is constructed. Do you mean something else?

Keep in mind that for fuzzy matching, the difficult part is finding the good balance between "being
fast/being smart".

Absolutely, otherwise you could simply use the score function over all candidates. But, I would add
that the idea of "fuzzy" matching is to dwim, behave as the user expects. querying by initials
is a very common expectation, I think.

elm matches helm-insert-latex-math That's a terrible fuzzy match.

maybe yes, maybe no, people may try to find something else too.
If you write something more precise you match immdiately "emlsp" => emacs-lisp-mode.

Sure It works for some queries but also often enough it doesn't do what I expect. The problem is
specifically that for initials queries the results almost never bring up the best candidates even though
they express a very specific class of matches. Part of the idea is to get there with a minimum of keystrokes and a minimum of suprise requiring trial and error to find a working query.
That's why initials queries deserve to be better supported I think. I agree that for longer queries the current code works very well.

I'd like helm to guess what I'm thinking more of the time and I think this demo shows this could
be done with acceptable (I think) performance. For me, querying by initials is a major use case.

Also, I would love to do the same kind of "preferred-matching" for "EMacs-Lisp-Mode" but I think the regex for all combinations will be too slow.

@ghost
Copy link
Author

ghost commented Oct 23, 2015

@thierryvolpiatto, not sure if this is what you meant with "regex caching" but if it was, it's a good idea.

Once a scan of all candidates for QUERY returns less then LIMIT matches we can be certain that any match for QUERY_ we might find in a full scan is already present in the (much smaller list of) matches we previously found for QUERY. By caching results for previous queries we can avoid doing an extra full scan fairly often, usually just once or twice overall.

This doesn't make a huge difference afaict (it seems fast enough already) but maybe it will for some queries.

Please consider cherry-picking:

  • 97f3394 unrelated doctring addition.
  • e3d54f1 unrelated optimization, hash table resizing is usually an expensive operation, a better initial guess on size can improve performance.

@ghost
Copy link
Author

ghost commented Oct 23, 2015

Cleaned up. Much smaller patch now and there's no longer a need to modify the existing sources.
If :fuzzy-match is set for the source it should just work.

(defun helm-match-from-candidates (cands matchfns match-part-fn limit source)
(let (matches)

(defun helm--mapconcat-initials-pattern (pattern seperators)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this pattern is so restrictive?

If this is a preliminary filtering step, then inserting ".*" in between characters should be enough. Is there benchmarking to prove this is necessary?

The more candidates flx sees the better your results will be.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you're suggesting I've already covered in the original issue #1237, please read it.
Also, Inserting greedy matching operators can and will drop the best match from the list in some cases.

The more candidates flx sees the better your results will be.

Right, but I've had to set the limit at 2500 at one point to get the matches I wanted, perf was acceptable but not great and ocassionaly caused long GCs (yes, I know about the GC tuning you recommend). Most of that work is completely unnecessary - ranking 2000 poor quality "matches" with a complicated lisp function to get the one I want is inefficient.

As for benchmarks, if you think this can all be handled by flx directly I'd love for you to finish that implementation and make all this work obsolete. For now, this is better than any other solution available for helm afaik.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is designed to not need flx? That's really interesting. I will follow along and see how it goes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it works best as a companion to helm-flx. The idea is to keep the number of matches sent to flx small-ish while trying harder to have the best candidates included in those flx gets to see.

You could use the latest iteration of the regex to replace helm's stock one for fuzzy matching
and even without flx it should work quite well. I haven't tried it though.

@ghost
Copy link
Author

ghost commented Oct 24, 2015

Also, I would love to do the same kind of "preferred-matching" for "EMacs-Lisp-Mode" but I think
the regex for all combinations will be too slow.

Turns out this works just fine after all, so I've added a new knob to trade off speed for accuracy.

helm-preferred-max-group-length now lets you control how hard to try variations.
For the query abcd:
With a value of 1 will match a-b-c-d
With a value of 2 will also match ab-c-d a-bc-d, etc'
etc'

I don't notice a slowdown even with this on and it seems to work fantastically well.
Even with a very low (setq helm-candidate-number-limit 10) all the following queries bring up helm-candidate-number-limit as the first match in describe-variable:

hcnl
hec
hcn
henl
...

@ghost
Copy link
Author

ghost commented Oct 24, 2015

Now that the full candidate list is cached on first invocation instead of living in a slot on the source,
there's a mysterious bug when a query is typed quickly instead of one character at a time. I'll fix this later.

(fixed)

@thierryvolpiatto, I've cleaned this up, added support for source-in-buffer and as much caching as I could. Performance is good even in apropos, the fuzzy matches are vastly better IMO and afaik nothing is broken by it (I've tested find-files, describe-, M-x, apropos).

What else should I test?

helm-preferred-max-group-length controls the generation of the
regexp for matching preferred candidates.

query: "abcd"
With a value of 1 will match a-b-c-d
With a value of 2 will also match ab-c-d a-bc-d, etc'
With a value of 3 will also match abc-d a-bcd
etc'
@thierryvolpiatto
Copy link
Member

MikeSanMichel2 notifications@github.com writes:

@thierryvolpiatto, I've cleaned this up, added support for
source-in-buffer and much caching as I could. Performance is good even
in apropos, the fuzzy matches are vastly better IMO and afaik nothing
is broken by it (I've tested find-files, describe-,M-x, apropos).

What else should I test?

I haven't had the time to look at all your code carefully, but from the
few I see, it still have errors and it changes way too much things that
I guess will create tons of bugreport, I am not sure also the
performance will not be altered.

As I said I think there is a better approach to fix this issue -- better
fuzzy matching and scoring on prefixes after a separator --.

So thanks for your work, we can keep it around and work on it
eventually, but it is not ready to merge in master, it would break helm
badly IMO.
Thus your account have disapeared now and your branch is gone...

Thierry
https://emacs-helm.github.io/helm/

@thierryvolpiatto
Copy link
Member

I am closing this because I see you have now a ghost account, feel free to submit again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants