Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve approximate/fuzzy string matching in quick open dialog search #82200

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

samsface
Copy link
Contributor

@samsface samsface commented Sep 23, 2023

PR to achieve what this issue is asking for:
godotengine/godot-proposals#7771

This PR modifies the editor quick open dialog to use a fuzzy search algorithm inspired by the fuzzy file search in Visual Studio Code.

It has a few improvements for the user:

  • Highlights matching chars in strings to give the user feedback on how the search works. Most fuzzy search algorithms can seem random/bizarre without this feedback.
  • The fuzzy search scores and sorts matches by a heuristic that ranks matching sequence length over simple string similarity.
  • The fuzzy search algorithm is also specialized for files paths and will weigh the file name heaviest.
  • It allows the user to arbitrarily enter multiple tokens for their query.
  • Written decoupled from the dialog so we can apply this freely throughout the editor.

image

@djrain
Copy link

djrain commented Sep 23, 2023

I'm guessing there is some technical reason you didn't use highlights for matches? like this graphic mockup

search_mockup

@AThousandShips AThousandShips changed the title fuzzy search poc Fuzzy search proof of concept Sep 23, 2023
@RPicster
Copy link
Contributor

Would it be possible to expose the fuzzy search in the API so it would be possible to use in GDScript?

@samsface
Copy link
Contributor Author

Would it be possible to expose the fuzzy search in the API so it would be possible to use in GDScript?

Yep totally doable.

@samsface
Copy link
Contributor Author

samsface commented Sep 24, 2023

I'm guessing there is some technical reason you didn't use highlights for matches? like this graphic mockup

search_mockup

I wanted to do exactly this but tree view items don’t allow individual character background or font colors

I explored the idea of making the tree items all rich text views, but abandoned. We loose too much nice behaviors the default tree view items give us.

I also explored adding ascii controls characters to the text views. So like how console terminals can print colors/italics by printing special chars hidden to the user but tells the terminal to start printing all text as red from now or whatever. But this kinda breaks the contribution guidelines as I’d be altering core components to achieve something far away (in terms of the arch diagram distance).

Atm I’m hacking the underline from my screenshot by using a special Unicode char that underlines the previous letter. If anyone has an idea to achieve better highlights would be amazing.

Update on the last sentence. @djrain figured out a way to property highlight the matches and going with this solution.

editor/editor_quick_open.cpp Outdated Show resolved Hide resolved
editor/fuzzy_search.cpp Outdated Show resolved Hide resolved
editor/fuzzy_search.cpp Outdated Show resolved Hide resolved
editor/fuzzy_search.cpp Outdated Show resolved Hide resolved
editor/fuzzy_search.cpp Outdated Show resolved Hide resolved
editor/fuzzy_search.cpp Outdated Show resolved Hide resolved
editor/fuzzy_search.cpp Outdated Show resolved Hide resolved
editor/fuzzy_search.cpp Outdated Show resolved Hide resolved
editor/fuzzy_search.h Outdated Show resolved Hide resolved
editor/fuzzy_search.cpp Outdated Show resolved Hide resolved
@samsface samsface marked this pull request as ready for review October 4, 2023 18:42
@samsface samsface changed the title Fuzzy search proof of concept Improved fuzzy search in quick open dialog Oct 4, 2023
@samsface samsface changed the title Improved fuzzy search in quick open dialog Improve approximate/fuzzy string matching in quick open dialog search Oct 4, 2023
@a-johnston
Copy link

a-johnston commented Sep 24, 2024

I wanted to add this functionality as well and found this PR. I've rebased it to the latest master and made some changes here https://github.com/a-johnston/godot/tree/fuzzy-search if you'd care to incorporate or consider them. Or if it's easier I could commandeer/make a new pr.

High level changes from this branch are

  • The query is case sensitive iff it includes uppercase characters, otherwise it is case insensitive
  • Non-Levenshtein distance heuristic with a subsequence matching algorithm inspired by one of fzf's algorithms
    • The fzf inspired backwards and forwards pass biases for compact results later in the string
    • New heuristic score biases for longer matches, exact matches, matches along word boundaries, and filename matches
    • Typos are still handled by allowing a fixed number of characters of the subsequence to be missing (at the moment, 2)
  • All results which were not pruned for low score are included in the sorting step, rather than quitting early at max results
    • I found that for queries with common letters, ie "test", there were so many results that even with pruning (although I'm sure all these knobs can be tweaked more) that picking the first N non-pruned results could omit higher scoring results.
    • At least on my system, I did not notice any pause/stutter after removing the early stop.
  • Sorting breaks score ties on target string length, to help ensure the list stays consistently ordered for minor query changes

I tested the functionality of the changes in a project containing 1400+ files and at least for my project and queries I seemed to get overall high quality results, although I'm sure the various magic numbers could be tweaked further.

A missing a still results in the items within the "cameras" dir being the top results:
image

A typo where the a would go also still results in these items being prioritized:
image

Multiple terms can still be included (although unlike the example in this pr, they match in order):
image

@a-johnston
Copy link

a-johnston commented Sep 24, 2024

I tested the same 1400+ file project on a fairly underpowered linux laptop (i5-7200U) and still encountered no issues with allowing more results into the sorting step. I also ran into a case where a suboptimal match was being selected and scored (it still showed up in the results but not as high as I'd like) so I may try to extend the matcher to return the optimal match rather than the greedy one and then re-verify performance.

@samsface
Copy link
Contributor Author

samsface commented Sep 27, 2024

@a-johnston There's also another PR on going to reinvent the quick search dialog: #56772. I've been meaning to reach out to it's creator and ask could we merge our changes.

Just forked and tested your algorithm. In my opinion it's noticeably slower but it does seem to give better matches some of the time. Let me set up a suite of tests cases to benchmark for accuracy, speed and misspelling correction just so we can be scientists here and pick the optimum and also help out anyone who wants to modify or try a new algo in the future.

Also there seems to be some weird bug where the quick search is showing the result, dropping it and then showing it again. Did you notice that? It's even in my orignal branch after rebasing on master.

@a-johnston
Copy link

Ah I hadn't seen that pr. I especially like the idea of adding new behavior controls and new editor settings; it seems worthwhile also adding an option for fuzzy vs exact matching. I also haven't noticed the quick search bug you mention; does it happen every time you change the query or just when the dialogue initially opens?

As far as performance goes, I'm not surprised if it feels slower considering it is sorting all results above the cutoff rather than the first N. I wanted to output the time to filter and graph the score distribution for some of the queries in order to be more guided about setting scoring and filtering criteria but for some, probably silly, reason I couldn't get any new debug output to show up. I wouldn't be surprised if changes to the scoring and threshold could substantially speed it up without much degradation in quality. There's also the option to heapify and pop up to the number of max results times to avoid sorting the presumably long tail of worse results (especially for short queries). I started to do that earlier but I wasn't sure the best way to do so, and it seemed already fast enough on my systems/projects. Definitely room for improvement.

@samsface
Copy link
Contributor Author

I'll post back here when I have some data.

@a-johnston
Copy link

Turns out SortArray already had what I was hoping to do so I've updated my branch to use partial_sort. On my older laptop (i5-7200u), sorting and filtering on the 1400+ file project now feels basically immediate whereas before there was definitely a slight delay.

@samsface
Copy link
Contributor Author

Cool, will make sure to grab that change in my tests.

@samsface
Copy link
Contributor Author

samsface commented Sep 28, 2024

Here's the results on a 10k file project I have. Its dir tree is part of the unit test data now. Godot was compiled with scons production=yes.

Overall the new fzf algorithm is actually a bit faster. Especially with the short query optimization.

# Algorithm Query Test File StdDev in Millis Total Time in Millis Top Result
0 fzf sm.png project_dir_tree.txt 0 8 ./junk/sam.png
1 lev sm.png project_dir_tree.txt 0.3 74.1 ./junk/sam.png
2 fzf ham project_dir_tree.txt 0 11 ./entity/hamer/data.gd
3 lev ham project_dir_tree.txt 0.6 53.2 ./entity/game_trap/ha_missed_me.wav
4 fzf push background project_dir_tree.txt 0 6 ./menu/widgets/background_hint.gd
5 lev push background project_dir_tree.txt 0.3 61.1 ./entity/background_zone1/background/push.png
6 fzf push/background project_dir_tree.txt 0 6 ./campaign/throne/junk/background.png
7 lev push/background project_dir_tree.txt 0.4 61.2 ./entity/background_zone1/background/push.png
8 fzf wav missed me ha project_dir_tree.txt 0 7
9 lev wav missed me ha project_dir_tree.txt 0 61 ./entity/game_trap/ha_missed_me.wav

But it's not giving (in my opinion) results a user would expect when searching with multiple query tokens. For example, 4 - 9 in the comparison table above .

@a-johnston
Copy link

a-johnston commented Sep 28, 2024

Thanks for putting that benchmark together! I'm not surprised those queries do poorly since it expects the tokens to be in order, so I would expect it to improve for background push instead of push background for example, but I can understand why one might prefer writing the query that way. I'll try updating it to allow each token to match any point in the string, as long as it does not overlap an existing match.

edit- I just noticed it looks like a few string unit tests I added and the partial_sort use didn't make it in when you cherry picked. If it's easier, I can open a pr against your branch to collaborate

auto dataset_path = line[1];
auto expected_result = line[2];

bench(query, dataset_path, expected_result, "fzf");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the test runner accepts optional arguments to only run certain files; would it be possible to keep the benchmark as a standalone test that is disabled by default, and then a separate test which verifies that an obvious query match is the top result?

@samsface
Copy link
Contributor Author

samsface commented Sep 29, 2024

edit- I just noticed it looks like a few string unit tests I added and the partial_sort use didn't make it in when you cherry picked. If it's easier, I can open a pr against your branch to collaborate

@a-johnston Yeah I should of mentioned that, sorry. I didn't merge those because I noticed the issue of out of order queries and wanted to address that first. Yeah please open a PR against my branch so we can keep the tests and stuff. Not saying this is the best way but I had the same issue with the algorithm I used and addressed it by scoring parts of the path individually and then summing the score but heavily weighting the last part of the path. This was inspired by visual studio code's approach.

I know the test runner accepts optional arguments to only run certain files; would it be possible to keep the benchmark as a standalone test that is disabled by default, and then a separate test which verifies that an obvious query match is the top result?

Yeah that was my exact idea too. We can leave the benchmark off by default but add a bunch of smaller tests to verify the search is working.

@akien-mga akien-mga mentioned this pull request Oct 2, 2024
@samsface
Copy link
Contributor Author

samsface commented Oct 2, 2024

@a-johnston I have an idea to try tweak your algorithm to better score out of sequence tokens. Are you working on anything or should I try?

@a-johnston
Copy link

@a-johnston I have an idea to try tweak your algorithm to better score out of sequence tokens. Are you working on anything or should I try?

Feel free to give it a shot. I do have some stuff in progress and thoughts about what approaches might be best but was sorta distracted from this the last few days. I'm leaving for a trip Friday morning so I'm hoping to have a shareable commit tonight or tomorrow.

@a-johnston
Copy link

@samsface I finally updated my branch and if I have time tonight I'll rebase it onto yours and pr it there. I ended up changing almost all of how it works for better or worse; it now considers multiple subsequences and only matches the best one which does not conflict with prior token matches, so I removed the fzf inspired back-then-forward search. I also removed the special case for short queries and it didn't seem to affect much.

@samsface
Copy link
Contributor Author

samsface commented Oct 4, 2024

@samsface I finally updated my branch and if I have time tonight I'll rebase it onto yours and pr it there. I ended up changing almost all of how it works for better or worse; it now considers multiple subsequences and only matches the best one which does not conflict with prior token matches, so I removed the fzf inspired back-then-forward search. I also removed the special case for short queries and it didn't seem to affect much.

I really liked the short query optimization. For me, it made the first few keystrokes feel really responsive.

@a-johnston
Copy link

I really liked the short query optimization. For me, it made the first few keystrokes feel really responsive.

It would be interesting to see what difference it makes in the benchmark. Originally it wasn't added for performance but because short queries were more likely to have low relevance subsequence matches later in the string. Since the current implementation always does one linear scan of the target string, it no longer helps with relevance. It might help allow a full scan to be skipped in favor of a partial scan, but it does have the downside that a target it doesn't match ends up being searched twice. In any case, that part of the most recent commit can be reverted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants