Improve approximate/fuzzy string matching in quick open dialog search #82200

samsface · 2023-09-23T17:17:45Z

PR to achieve what this issue is asking for:
godotengine/godot-proposals#7771

This PR modifies the editor quick open dialog to use a fuzzy search algorithm inspired by the fuzzy file search in Visual Studio Code.

It has a few improvements for the user:

Highlights matching chars in strings to give the user feedback on how the search works. Most fuzzy search algorithms can seem random/bizarre without this feedback.
The fuzzy search scores and sorts matches by a heuristic that ranks matching sequence length over simple string similarity.
The fuzzy search algorithm is also specialized for files paths and will weigh the file name heaviest.
It allows the user to arbitrarily enter multiple tokens for their query.
Written decoupled from the dialog so we can apply this freely throughout the editor.

djrain · 2023-09-23T19:00:25Z

I'm guessing there is some technical reason you didn't use highlights for matches? like this graphic mockup

RPicster · 2023-09-23T19:24:19Z

Would it be possible to expose the fuzzy search in the API so it would be possible to use in GDScript?

samsface · 2023-09-23T21:44:22Z

Would it be possible to expose the fuzzy search in the API so it would be possible to use in GDScript?

Yep totally doable.

samsface · 2023-09-24T10:04:08Z

I'm guessing there is some technical reason you didn't use highlights for matches? like this graphic mockup

I wanted to do exactly this but tree view items don’t allow individual character background or font colors

I explored the idea of making the tree items all rich text views, but abandoned. We loose too much nice behaviors the default tree view items give us.

I also explored adding ascii controls characters to the text views. So like how console terminals can print colors/italics by printing special chars hidden to the user but tells the terminal to start printing all text as red from now or whatever. But this kinda breaks the contribution guidelines as I’d be altering core components to achieve something far away (in terms of the arch diagram distance).

Atm I’m hacking the underline from my screenshot by using a special Unicode char that underlines the previous letter. If anyone has an idea to achieve better highlights would be amazing.

Update on the last sentence. @djrain figured out a way to property highlight the matches and going with this solution.

editor/editor_quick_open.cpp

editor/fuzzy_search.cpp

editor/fuzzy_search.h

editor/fuzzy_search.cpp

a-johnston · 2024-09-24T01:01:57Z

I wanted to add this functionality as well and found this PR. I've rebased it to the latest master and made some changes here https://github.com/a-johnston/godot/tree/fuzzy-search if you'd care to incorporate or consider them. Or if it's easier I could commandeer/make a new pr.

High level changes from this branch are

The query is case sensitive iff it includes uppercase characters, otherwise it is case insensitive
Non-Levenshtein distance heuristic with a subsequence matching algorithm inspired by one of fzf's algorithms
- The fzf inspired backwards and forwards pass biases for compact results later in the string
- New heuristic score biases for longer matches, exact matches, matches along word boundaries, and filename matches
- Typos are still handled by allowing a fixed number of characters of the subsequence to be missing (at the moment, 2)
All results which were not pruned for low score are included in the sorting step, rather than quitting early at max results
- I found that for queries with common letters, ie "test", there were so many results that even with pruning (although I'm sure all these knobs can be tweaked more) that picking the first N non-pruned results could omit higher scoring results.
- At least on my system, I did not notice any pause/stutter after removing the early stop.
Sorting breaks score ties on target string length, to help ensure the list stays consistently ordered for minor query changes

I tested the functionality of the changes in a project containing 1400+ files and at least for my project and queries I seemed to get overall high quality results, although I'm sure the various magic numbers could be tweaked further.

A missing a still results in the items within the "cameras" dir being the top results:

A typo where the a would go also still results in these items being prioritized:

Multiple terms can still be included (although unlike the example in this pr, they match in order):

a-johnston · 2024-09-24T21:25:25Z

I tested the same 1400+ file project on a fairly underpowered linux laptop (i5-7200U) and still encountered no issues with allowing more results into the sorting step. I also ran into a case where a suboptimal match was being selected and scored (it still showed up in the results but not as high as I'd like) so I may try to extend the matcher to return the optimal match rather than the greedy one and then re-verify performance.

samsface · 2024-09-27T17:12:38Z

@a-johnston There's also another PR on going to reinvent the quick search dialog: #56772. I've been meaning to reach out to it's creator and ask could we merge our changes.

Just forked and tested your algorithm. In my opinion it's noticeably slower but it does seem to give better matches some of the time. Let me set up a suite of tests cases to benchmark for accuracy, speed and misspelling correction just so we can be scientists here and pick the optimum and also help out anyone who wants to modify or try a new algo in the future.

Also there seems to be some weird bug where the quick search is showing the result, dropping it and then showing it again. Did you notice that? It's even in my orignal branch after rebasing on master.

a-johnston · 2024-09-27T20:50:36Z

Ah I hadn't seen that pr. I especially like the idea of adding new behavior controls and new editor settings; it seems worthwhile also adding an option for fuzzy vs exact matching. I also haven't noticed the quick search bug you mention; does it happen every time you change the query or just when the dialogue initially opens?

As far as performance goes, I'm not surprised if it feels slower considering it is sorting all results above the cutoff rather than the first N. I wanted to output the time to filter and graph the score distribution for some of the queries in order to be more guided about setting scoring and filtering criteria but for some, probably silly, reason I couldn't get any new debug output to show up. I wouldn't be surprised if changes to the scoring and threshold could substantially speed it up without much degradation in quality. There's also the option to heapify and pop up to the number of max results times to avoid sorting the presumably long tail of worse results (especially for short queries). I started to do that earlier but I wasn't sure the best way to do so, and it seemed already fast enough on my systems/projects. Definitely room for improvement.

samsface · 2024-09-27T21:57:44Z

I'll post back here when I have some data.

a-johnston · 2024-09-27T23:40:44Z

Turns out SortArray already had what I was hoping to do so I've updated my branch to use partial_sort. On my older laptop (i5-7200u), sorting and filtering on the 1400+ file project now feels basically immediate whereas before there was definitely a slight delay.

samsface · 2024-09-28T07:37:12Z

Cool, will make sure to grab that change in my tests.

samsface · 2024-09-28T13:03:44Z

Here's the results on a 10k file project I have. Its dir tree is part of the unit test data now. Godot was compiled with scons production=yes.

Overall the new fzf algorithm is actually a bit faster. Especially with the short query optimization.

#	Algorithm	Query	Test File	StdDev in Millis	Total Time in Millis	Top Result
0	fzf	sm.png	project_dir_tree.txt	0	8	./junk/sam.png
1	lev	sm.png	project_dir_tree.txt	0.3	74.1	./junk/sam.png
2	fzf	ham	project_dir_tree.txt	0	11	./entity/hamer/data.gd
3	lev	ham	project_dir_tree.txt	0.6	53.2	./entity/game_trap/ha_missed_me.wav
4	fzf	push background	project_dir_tree.txt	0	6	./menu/widgets/background_hint.gd
5	lev	push background	project_dir_tree.txt	0.3	61.1	./entity/background_zone1/background/push.png
6	fzf	push/background	project_dir_tree.txt	0	6	./campaign/throne/junk/background.png
7	lev	push/background	project_dir_tree.txt	0.4	61.2	./entity/background_zone1/background/push.png
8	fzf	wav missed me ha	project_dir_tree.txt	0	7
9	lev	wav missed me ha	project_dir_tree.txt	0	61	./entity/game_trap/ha_missed_me.wav

But it's not giving (in my opinion) results a user would expect when searching with multiple query tokens. For example, 4 - 9 in the comparison table above .

a-johnston · 2024-09-28T22:04:52Z

Thanks for putting that benchmark together! I'm not surprised those queries do poorly since it expects the tokens to be in order, so I would expect it to improve for background push instead of push background for example, but I can understand why one might prefer writing the query that way. I'll try updating it to allow each token to match any point in the string, as long as it does not overlap an existing match.

edit- I just noticed it looks like a few string unit tests I added and the partial_sort use didn't make it in when you cherry picked. If it's easier, I can open a pr against your branch to collaborate

a-johnston · 2024-09-28T22:14:00Z

tests/core/string/test_fuzzy_search.h

+		auto dataset_path = line[1];
+		auto expected_result = line[2];
+
+		bench(query, dataset_path, expected_result, "fzf");


I know the test runner accepts optional arguments to only run certain files; would it be possible to keep the benchmark as a standalone test that is disabled by default, and then a separate test which verifies that an obvious query match is the top result?

samsface · 2024-09-29T06:44:36Z

edit- I just noticed it looks like a few string unit tests I added and the partial_sort use didn't make it in when you cherry picked. If it's easier, I can open a pr against your branch to collaborate

@a-johnston Yeah I should of mentioned that, sorry. I didn't merge those because I noticed the issue of out of order queries and wanted to address that first. Yeah please open a PR against my branch so we can keep the tests and stuff. Not saying this is the best way but I had the same issue with the algorithm I used and addressed it by scoring parts of the path individually and then summing the score but heavily weighting the last part of the path. This was inspired by visual studio code's approach.

I know the test runner accepts optional arguments to only run certain files; would it be possible to keep the benchmark as a standalone test that is disabled by default, and then a separate test which verifies that an obvious query match is the top result?

Yeah that was my exact idea too. We can leave the benchmark off by default but add a bunch of smaller tests to verify the search is working.

samsface · 2024-10-02T20:04:21Z

@a-johnston I have an idea to try tweak your algorithm to better score out of sequence tokens. Are you working on anything or should I try?

a-johnston · 2024-10-02T20:45:18Z

@a-johnston I have an idea to try tweak your algorithm to better score out of sequence tokens. Are you working on anything or should I try?

Feel free to give it a shot. I do have some stuff in progress and thoughts about what approaches might be best but was sorta distracted from this the last few days. I'm leaving for a trip Friday morning so I'm hoping to have a shareable commit tonight or tomorrow.

a-johnston · 2024-10-04T02:11:23Z

@samsface I finally updated my branch and if I have time tonight I'll rebase it onto yours and pr it there. I ended up changing almost all of how it works for better or worse; it now considers multiple subsequences and only matches the best one which does not conflict with prior token matches, so I removed the fzf inspired back-then-forward search. I also removed the special case for short queries and it didn't seem to affect much.

samsface · 2024-10-04T09:08:48Z

@samsface I finally updated my branch and if I have time tonight I'll rebase it onto yours and pr it there. I ended up changing almost all of how it works for better or worse; it now considers multiple subsequences and only matches the best one which does not conflict with prior token matches, so I removed the fzf inspired back-then-forward search. I also removed the special case for short queries and it didn't seem to affect much.

I really liked the short query optimization. For me, it made the first few keystrokes feel really responsive.

a-johnston · 2024-10-04T15:49:52Z

I really liked the short query optimization. For me, it made the first few keystrokes feel really responsive.

It would be interesting to see what difference it makes in the benchmark. Originally it wasn't added for performance but because short queries were more likely to have low relevance subsequence matches later in the string. Since the current implementation always does one linear scan of the target string, it no longer helps with relevance. It might help allow a full scan to be skipped in favor of a partial scan, but it does have the downside that a target it doesn't match ends up being searched twice. In any case, that part of the most recent commit can be reverted.

AThousandShips changed the title ~~fuzzy search poc~~ Fuzzy search proof of concept Sep 23, 2023

YuriSizov added enhancement feature proposal topic:editor usability labels Sep 23, 2023

YuriSizov added this to the 4.x milestone Sep 23, 2023

AThousandShips reviewed Sep 24, 2023

View reviewed changes

editor/fuzzy_search.cpp Outdated Show resolved Hide resolved

samsface force-pushed the fuzzy-search branch from e6a8aae to f664725 Compare September 25, 2023 12:30

samsface marked this pull request as ready for review October 4, 2023 18:42

samsface changed the title ~~Fuzzy search proof of concept~~ Improved fuzzy search in quick open dialog Oct 4, 2023

samsface changed the title ~~Improved fuzzy search in quick open dialog~~ Improve approximate/fuzzy string matching in quick open dialog search Oct 4, 2023

Cammymoop mentioned this pull request Feb 23, 2024

Add tokenized search support to Quick Open dialog and FileSystem filter #88660

Merged

samsface added 6 commits September 27, 2024 19:45

fuzzy search poc

45e7edb

format fix

fbdcece

remove implicit returns

56fb210

add draw boxes and tidy up code

8322110

speed up sorting

81e4ed0

fix divide by zero error

d95b3e9

benching

7edf8de

samsface force-pushed the fuzzy-search branch from b8e5ea4 to 7edf8de Compare September 28, 2024 09:08

samsface requested review from a team as code owners September 28, 2024 09:08

test by txt file

3b122ea

a-johnston reviewed Sep 28, 2024

View reviewed changes

akien-mga mentioned this pull request Oct 2, 2024

Redesign Quick Open #56772

Merged

a-johnston mentioned this pull request Oct 17, 2024

Add fuzzy string matching to quick open search #98278

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve approximate/fuzzy string matching in quick open dialog search #82200

Improve approximate/fuzzy string matching in quick open dialog search #82200

samsface commented Sep 23, 2023 •

edited

Loading

djrain commented Sep 23, 2023 •

edited

Loading

RPicster commented Sep 23, 2023

samsface commented Sep 23, 2023

samsface commented Sep 24, 2023 •

edited

Loading

a-johnston commented Sep 24, 2024 •

edited

Loading

a-johnston commented Sep 24, 2024 •

edited

Loading

samsface commented Sep 27, 2024 •

edited

Loading

a-johnston commented Sep 27, 2024

samsface commented Sep 27, 2024

a-johnston commented Sep 27, 2024

samsface commented Sep 28, 2024

samsface commented Sep 28, 2024 •

edited

Loading

a-johnston commented Sep 28, 2024 •

edited

Loading

a-johnston Sep 28, 2024

samsface commented Sep 29, 2024 •

edited

Loading

samsface commented Oct 2, 2024 •

edited

Loading

a-johnston commented Oct 2, 2024

a-johnston commented Oct 4, 2024

samsface commented Oct 4, 2024

a-johnston commented Oct 4, 2024

Improve approximate/fuzzy string matching in quick open dialog search #82200

Are you sure you want to change the base?

Improve approximate/fuzzy string matching in quick open dialog search #82200

Conversation

samsface commented Sep 23, 2023 • edited Loading

djrain commented Sep 23, 2023 • edited Loading

RPicster commented Sep 23, 2023

samsface commented Sep 23, 2023

samsface commented Sep 24, 2023 • edited Loading

a-johnston commented Sep 24, 2024 • edited Loading

a-johnston commented Sep 24, 2024 • edited Loading

samsface commented Sep 27, 2024 • edited Loading

a-johnston commented Sep 27, 2024

samsface commented Sep 27, 2024

a-johnston commented Sep 27, 2024

samsface commented Sep 28, 2024

samsface commented Sep 28, 2024 • edited Loading

a-johnston commented Sep 28, 2024 • edited Loading

a-johnston Sep 28, 2024

Choose a reason for hiding this comment

samsface commented Sep 29, 2024 • edited Loading

samsface commented Oct 2, 2024 • edited Loading

a-johnston commented Oct 2, 2024

a-johnston commented Oct 4, 2024

samsface commented Oct 4, 2024

a-johnston commented Oct 4, 2024

samsface commented Sep 23, 2023 •

edited

Loading

djrain commented Sep 23, 2023 •

edited

Loading

samsface commented Sep 24, 2023 •

edited

Loading

a-johnston commented Sep 24, 2024 •

edited

Loading

a-johnston commented Sep 24, 2024 •

edited

Loading

samsface commented Sep 27, 2024 •

edited

Loading

samsface commented Sep 28, 2024 •

edited

Loading

a-johnston commented Sep 28, 2024 •

edited

Loading

samsface commented Sep 29, 2024 •

edited

Loading

samsface commented Oct 2, 2024 •

edited

Loading