-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add string fuzzy matching #1023
Conversation
The search algorithm that we use for the documentation just adds a few weights to the current matching algorithm and seems to work well in practice (and is fast enough for interactive usage): https://github.com/leanprover-community/doc-gen/blob/2b8cb2a3e7471c109f031eda59e5090b1c9b7fe3/search.js#L1-L23 There are certainly some rough edge cases since it's greedy and doesn't find an optimal score, but if the optimal version is already slow on the core Lean code base then it is certainly unusable on any larger project like mathlib. |
93702b7
to
ab07463
Compare
Some benchmark I did: Routine
Setup
Resultsuser time in seconds as measured by zsh's
|
Nice data! And I'm surprised at the speed of your machine, my Ryzen 5600X @ 4.2GHz takes 1.2s on the following quick stand-alone variant of "Single", spending ~70% of the time inside
I didn't take a very close look at the code yet, but here is a quick ~20% speedup fix for the "Single" benchmark:
After that, perf report looks like this for me:
So we are now spending more time in
In summary, I'm optimistic that we can at least halve the execution time here, perhaps even more. The big question is of course, will it be enough? If we can get it up to reasonable speeds, I believe the improvement in search quality would absolute be worth it. I didn't do a thorough evaluation yet, but here is just one sample query on mathlib-docs where I'd hope this implementation would do a better job: |
I'm a bit confused about the algorithm implementation though, is this still dynamic programming? I would expect some kind of array like in the LLVM impl that avoids identical recursive calls - your current implementation might be exponential on e.g. matching |
I guess that's true, the idea is still the same as with dynamic programming, but the implementation is less efficient. |
You probably mean matching aⁿ on a²ⁿ, matching a single a is merely quadratic. 😄
The lists are also the first thing that jumped into my eye. Like Wojciech has said before, it's also worth a try to skip the precomputation entirely and compute character roles etc. on demand. Another low-hanging optimization is that Could you also please commit (or post) the benchmarking code you've been using? |
I just re-implemented the fuzzy matching non-recursively using arrays and to my surprise it was slower than before. I pushed it on fuzzy-matching-arrays, if you want to take a look. The case when |
My benchmark code is just (with varying patterns and iterations): def main : IO Unit := do
let mathlibDecls ← IO.FS.lines (System.FilePath.mk "data/mathlib-decls")
for i in [:100] do
IO.println <| mathlibDecls.filter (fun s => fuzzyMatch "categ_th" s) (Output is redirected to |
I also tried to pre-initialize the array with |
That is surprising (though I suppose it's definitely possible for the recursive version to require less than Let's move on to After that, I didn't see a clear bottleneck left: A good amount of time is still spent on allocations, but not overwhelmingly much. Spending 8% total/ 20% relative just on I could make some elaborated guesses at further potential optimizations like avoiding more |
There is, of course, always the optimization of decreasing the input size. For example, while @gebner has convinced me in the previous thread that we should not ignore namespaces in e.g. the workspace symbols request, it still isn't clear to me if that is the correct default even just for relevance of results, not just speed. A possible compromise could be to include them only if a |
There's very few definitions that you can find without looking at the namespace (think of
To elaborate on this proposal, how about matching name components separately? That is, say the query is |
Yes, I think this is the natural conclusion. But doing so anywhere within the compound name seems problematic when combined with the "only on |
Right, I don't think the "only on In my proposal,
I don't think the last-component-matches-last-component restriction affects expressivity too much. If you want to search for declarations in a namespace you can just append a dot (e.g., Lifting the restriction is useful for It's also useful if you search for theorems and want a theorem about You also type To sum up, I think the last-component-matches-last-component could be sensible for declarations. The "only on |
Okay, if we want to do component-wise matching, I'd say let's leave out the restriction for now and see if we need it for either performance or relevance (for which it could simply be a score adjustment). But for what it's worth, Lars' original benchmarks are no slower than ~220ms on my machine (which isn't the slowest, sure) now with my branch, so maybe this is good enough to be merged for now and evaluate performance & relevance in practice? |
For me it's between 60-340ms with your branch (and 60-460ms with the PR), which I think is still acceptable for interactive usage. I agree we should merge this now, it's much better than what's currently in Lean and we can iterate over it later. |
Of course, here are the results:
The new version is slightly slower, but in the same range. The |
Currently the fuzzy matching algorithm is used in two places: the workspace symbols request and the (id, dot, option, ...) completion request. In the workspace symbols request, all available declarations are tried to match, while in the completion request, the previous step-wise matching is used. This benefits performance and, I would say, also usability. (Imagine using a similar naming scheme as an imported namespace, resulting in a lot of completion items.) |
The threshold for displaying matched items is currently set to |
Makes sense. I suppose there is also the option of mixing the two solutions - a recursive backward search that uses the array for memoization. Not sure if it's worth investigating that direction. |
I also thought of this, but haven't had time to try it yet. |
return true | ||
let mut aIt := a.mkIterator | ||
for i in [:b.bsize] do | ||
if aIt.curr.toLower == (b.get i).toLower then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that b.get i
returns arbitrary
(= 'a'
) if i
is not at a UTF-8 char boundary, so this should reject any Unicode queries. Which is fine I think, and we definitely don't want to pay any extra overhead for a use case we don't even know is relevant at all in practice. In the future, we should introduce a new primitive String.getByte : String.Pos -> UInt8
, which would allow us to further optimize the ASCII case and at least accept Unicode queries, even if the results might not be quite as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Iterator.curr
uses get
as well, doesn't this mean that any characters outside the UTF-8 boundary are effectively equal to 'a'
, 'A'
and any other characters outside the boundaries here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but note that next
only visits char boundaries :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So next
would have to reject the move when the target is not a valid UTF-8 character? But this code increments the position by 1 for non-UTF-8 characters.
/* invalid UTF-8 encoded string */
return lean_box(i+1);
(from lean_string_utf8_next
)
I know you already approved the PR, but I just tried to use the combination of backward search and array memoization. It is slightly slower for short patterns, but provides a speedup for longer patterns (compared to the current array implementation). Benchmark
|
Interesting. I was actually thinking about this approach as well just now - if we think of the recursion as paths through the array from the bottom right corner to the top left, the only "join points" of paths, i.e. where we should memoize, are the coordinates where needle and haystack agree, no? So could we elide loading and storing the cache at all other coordinates? |
If we start in the same cell, the two paths |
I tried this just out of curiosity, but using a |
Fuzzy matching based on the dynamic programming algorithm used in clang/LLVM. Works well, but is painfully slow when used for the workspace symbols request.
closes #960