-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow completing buffer words with unicode #392
Conversation
If this seems like a possible solution, I would also like to add some rust test cases for this regex. In my experience running tests with cargo works very nicely in the github CI pipelines. Also, I forgot to mention that this does not address #388 - I think that one requires changing the logic for trigger characters on the lua side and cannot be done in rust. |
I noticed a big refactoring (#389) is on the way - maybe this should wait until that is complete. |
This looks great, thanks! To address #388, it'd probably make sense to move most of the regex bits to Rust but that'll be a future PR. I imagine that since the keyword regex doesn't include non-ascii letters, the fuzzy matching still won't work well on non-ascii items. I'm open to adding tests, been meaning to for a while :) |
Issue ===== When using the "buffer" provider in a buffer with the following contents: ``` työmaa | <- cursor ``` When typing "ty" the word "työmaa" is not matched, even though it's part of the buffer and should be shown as a suggestion. > Trivia: työmaa is Finnish and can be translated as "construction site" > in English. 🙂 The point is to have a word with non-ASCII characters. Solution ======== Change the regex pattern to match any word character, including unicode characters. The change allows matching the word "työmaa" when typing "ty" in a buffer. Considerations ============== Looking at the Unicode section on rust regular expressions https://docs.rs/regex/latest/regex/#unicode For maximum performance, it's recommended to stick to ASCII characters when possible. However, I'm using the following regex in blink-ripgrep.nvim to fetch completions for all the files in the project: ```lua -- completions are fetched using rg (ripgrep), which is written in rust -- and I suppose also uses the regex crate. return { "rg", "--no-config", "--json", "--context=" .. (opts.context_size or 5), "--word-regexp", "--max-filesize=" .. (opts.max_filesize or "1M"), "--ignore-case", "--", prefix .. "[\\w_-]+", -- 👈🏻 notice the usage of \w vim.fn.fnameescape(vim.fs.root(0, ".git") or vim.fn.getcwd()), } ``` https://github.com/mikavilpas/blink-cmp-rg.nvim/blob/dbbfb4d94432f82757bc38facbf87566f6bbd67c/lua/blink-ripgrep/init.lua?plain=1#L72 The pattern is evaluated against all lines in every file in the current project, and I have found performance to be very good. I also ran the proposed pattern in a large codebase I work in. I ran this in a project with 4154 files, totalling 815283 lines of code (calculated with the `tokei` cli application, v. 12.1.2). `rg` is able to search for this regex pattern with the following results: ```sh $ hyperfine 'rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null' Benchmark 1: rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null Time (mean ± σ): 194.2 ms ± 7.7 ms [User: 120.8 ms, System: 316.0 ms] Range (min … max): 185.6 ms … 214.7 ms 14 runs ``` My guess is that since the buffer provider is only used in the current buffer, the performance impact should be minimal.
Nice! I explored adding some simple tests yesterday, but hit some issues running the tests with Running `/opt/homebrew/bin/sccache /Users/mikavilpas/.rustup/toolchains/nightly-aarch64-apple-darwin/bin/rustc --crate-name blink_cmp_fuzzy --edition=2021 lua/blink/cmp/fuzzy/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --diagnostic-width=162 --emit=dep-info,link -C embed-bitcode=no -C debuginfo=2 -C split-debuginfo=unpacked --test --check-cfg 'cfg(docsrs)' --check-cfg 'cfg(feature, values())' -C metadata=42401737eca38356 -C extra-filename=-42401737eca38356 --out-dir /Users/mikavilpas/git/blink.cmp/target/debug/deps -C incremental=/Users/mikavilpas/git/blink.cmp/target/debug/incremental -L dependency=/Users/mikavilpas/git/blink.cmp/target/debug/deps --extern frizbee=/Users/mikavilpas/git/blink.cmp/target/debug/deps/libfrizbee-b1c5712b42b5fd4d.rlib --extern heed=/Users/mikavilpas/git/blink.cmp/target/debug/deps/libheed-c2470d32b5c0d47b.rlib --extern lazy_static=/Users/mikavilpas/git/blink.cmp/target/debug/deps/liblazy_static-4bc53d643ec30e40.rlib --extern mlua=/Users/mikavilpas/git/blink.cmp/target/debug/deps/libmlua-c7c7bf5d922ec258.rlib --extern regex=/Users/mikavilpas/git/blink.cmp/target/debug/deps/libregex-c8f4a5a6cd3ab516.rlib --extern serde=/Users/mikavilpas/git/blink.cmp/target/debug/deps/libserde-c106b9172bbcf292.rlib -C link-arg=-undefined -C link-arg=dynamic_lookup -L native=/Users/mikavilpas/git/blink.cmp/target/debug/build/lmdb-master-sys-8d040ff7be1c0c6c/out`
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.82s
Running `/Users/mikavilpas/git/blink.cmp/target/debug/deps/blink_cmp_fuzzy-42401737eca38356`
dyld[40776]: symbol not found in flat namespace '_luaopen_base'
error: test failed, to rerun pass `--lib`
Caused by:
process didn't exit successfully: `/Users/mikavilpas/git/blink.cmp/target/debug/deps/blink_cmp_fuzzy-42401737eca38356` (signal: 6, SIGABRT: process abort signal) I have a hunch that this might be caused by my system lua being 5.4 because I installed luarocks using homebrew It might be worth it to merge this as-is now, and worry about testing later. Do you have a preference? edit: also rebased to the latest main |
Yep sounds good. Maybe we could use nix for the test environment |
Issue
When using the "buffer" provider in a buffer with the following contents:
When typing "ty" the word "työmaa" is not matched, even though it's part of the buffer and should be shown as a suggestion.
Solution
Change the regex pattern to match any word character, including unicode characters.
The change allows matching the word "työmaa" when typing "ty" in a buffer.
Considerations
Looking at the Unicode section on rust regular expressions https://docs.rs/regex/latest/regex/#unicode
For maximum performance, it's recommended to stick to ASCII characters when possible. However, I'm using the following regex in blink-ripgrep.nvim to fetch completions for all the files in the project:
https://github.com/mikavilpas/blink-cmp-rg.nvim/blob/dbbfb4d94432f82757bc38facbf87566f6bbd67c/lua/blink-ripgrep/init.lua?plain=1#L72
The pattern is evaluated against all lines in every file in the current project, and I have found performance to be very good.
I also ran the proposed pattern in a large codebase I work in. I ran this in a project with 4154 files, totalling 815283 lines of code (calculated with the
tokei
cli application, v. 12.1.2).rg
is able to search for this regex pattern with the following results:My guess is that since the buffer provider is only used in the current buffer, the performance impact should be minimal.