feat: allow completing buffer words with unicode #392

mikavilpas · 2024-11-27T15:58:16Z

Issue

When using the "buffer" provider in a buffer with the following contents:

työmaa
| <- cursor

When typing "ty" the word "työmaa" is not matched, even though it's part of the buffer and should be shown as a suggestion.

Trivia: työmaa is Finnish and can be translated as "construction site"
in English. 🙂 The point is to have a word with non-ASCII characters.

Solution

Change the regex pattern to match any word character, including unicode characters.

The change allows matching the word "työmaa" when typing "ty" in a buffer.

Considerations

Looking at the Unicode section on rust regular expressions https://docs.rs/regex/latest/regex/#unicode

For maximum performance, it's recommended to stick to ASCII characters when possible. However, I'm using the following regex in blink-ripgrep.nvim to fetch completions for all the files in the project:

-- completions are fetched using rg (ripgrep), which is written in rust
-- and I suppose also uses the regex crate.
return {
  "rg",
  "--no-config",
  "--json",
  "--context=" .. (opts.context_size or 5),
  "--word-regexp",
  "--max-filesize=" .. (opts.max_filesize or "1M"),
  "--ignore-case",
  "--",
  prefix .. "[\\w_-]+", -- 👈🏻 notice the usage of \w
  vim.fn.fnameescape(vim.fs.root(0, ".git") or vim.fn.getcwd()),
}

https://github.com/mikavilpas/blink-cmp-rg.nvim/blob/dbbfb4d94432f82757bc38facbf87566f6bbd67c/lua/blink-ripgrep/init.lua?plain=1#L72

The pattern is evaluated against all lines in every file in the current project, and I have found performance to be very good.

I also ran the proposed pattern in a large codebase I work in. I ran this in a project with 4154 files, totalling 815283 lines of code (calculated with the tokei cli application, v. 12.1.2).

rg is able to search for this regex pattern with the following results:

$ hyperfine 'rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null'
Benchmark 1: rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null
  Time (mean ± σ):     194.2 ms ±   7.7 ms    [User: 120.8 ms, System: 316.0 ms]
  Range (min … max):   185.6 ms … 214.7 ms    14 runs

My guess is that since the buffer provider is only used in the current buffer, the performance impact should be minimal.

mikavilpas · 2024-11-27T16:04:22Z

If this seems like a possible solution, I would also like to add some rust test cases for this regex. In my experience running tests with cargo works very nicely in the github CI pipelines.

Also, I forgot to mention that this does not address #388 - I think that one requires changing the logic for trigger characters on the lua side and cannot be done in rust.

mikavilpas · 2024-11-27T17:59:49Z

I noticed a big refactoring (#389) is on the way - maybe this should wait until that is complete.

Saghen · 2024-11-27T20:51:49Z

This looks great, thanks! To address #388, it'd probably make sense to move most of the regex bits to Rust but that'll be a future PR. I imagine that since the keyword regex doesn't include non-ascii letters, the fuzzy matching still won't work well on non-ascii items. I'm open to adding tests, been meaning to for a while :)

Issue ===== When using the "buffer" provider in a buffer with the following contents: ``` työmaa | <- cursor ``` When typing "ty" the word "työmaa" is not matched, even though it's part of the buffer and should be shown as a suggestion. > Trivia: työmaa is Finnish and can be translated as "construction site" > in English. 🙂 The point is to have a word with non-ASCII characters. Solution ======== Change the regex pattern to match any word character, including unicode characters. The change allows matching the word "työmaa" when typing "ty" in a buffer. Considerations ============== Looking at the Unicode section on rust regular expressions https://docs.rs/regex/latest/regex/#unicode For maximum performance, it's recommended to stick to ASCII characters when possible. However, I'm using the following regex in blink-ripgrep.nvim to fetch completions for all the files in the project: ```lua -- completions are fetched using rg (ripgrep), which is written in rust -- and I suppose also uses the regex crate. return { "rg", "--no-config", "--json", "--context=" .. (opts.context_size or 5), "--word-regexp", "--max-filesize=" .. (opts.max_filesize or "1M"), "--ignore-case", "--", prefix .. "[\\w_-]+", -- 👈🏻 notice the usage of \w vim.fn.fnameescape(vim.fs.root(0, ".git") or vim.fn.getcwd()), } ``` https://github.com/mikavilpas/blink-cmp-rg.nvim/blob/dbbfb4d94432f82757bc38facbf87566f6bbd67c/lua/blink-ripgrep/init.lua?plain=1#L72 The pattern is evaluated against all lines in every file in the current project, and I have found performance to be very good. I also ran the proposed pattern in a large codebase I work in. I ran this in a project with 4154 files, totalling 815283 lines of code (calculated with the `tokei` cli application, v. 12.1.2). `rg` is able to search for this regex pattern with the following results: ```sh $ hyperfine 'rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null' Benchmark 1: rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null Time (mean ± σ): 194.2 ms ± 7.7 ms [User: 120.8 ms, System: 316.0 ms] Range (min … max): 185.6 ms … 214.7 ms 14 runs ``` My guess is that since the buffer provider is only used in the current buffer, the performance impact should be minimal.

mikavilpas · 2024-11-29T06:01:03Z

Nice! I explored adding some simple tests yesterday, but hit some issues running the tests with cargo test -v:

     Running `/opt/homebrew/bin/sccache /Users/mikavilpas/.rustup/toolchains/nightly-aarch64-apple-darwin/bin/rustc --crate-name blink_cmp_fuzzy --edition=2021 lua/blink/cmp/fuzzy/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --diagnostic-width=162 --emit=dep-info,link -C embed-bitcode=no -C debuginfo=2 -C split-debuginfo=unpacked --test --check-cfg 'cfg(docsrs)' --check-cfg 'cfg(feature, values())' -C metadata=42401737eca38356 -C extra-filename=-42401737eca38356 --out-dir /Users/mikavilpas/git/blink.cmp/target/debug/deps -C incremental=/Users/mikavilpas/git/blink.cmp/target/debug/incremental -L dependency=/Users/mikavilpas/git/blink.cmp/target/debug/deps --extern frizbee=/Users/mikavilpas/git/blink.cmp/target/debug/deps/libfrizbee-b1c5712b42b5fd4d.rlib --extern heed=/Users/mikavilpas/git/blink.cmp/target/debug/deps/libheed-c2470d32b5c0d47b.rlib --extern lazy_static=/Users/mikavilpas/git/blink.cmp/target/debug/deps/liblazy_static-4bc53d643ec30e40.rlib --extern mlua=/Users/mikavilpas/git/blink.cmp/target/debug/deps/libmlua-c7c7bf5d922ec258.rlib --extern regex=/Users/mikavilpas/git/blink.cmp/target/debug/deps/libregex-c8f4a5a6cd3ab516.rlib --extern serde=/Users/mikavilpas/git/blink.cmp/target/debug/deps/libserde-c106b9172bbcf292.rlib -C link-arg=-undefined -C link-arg=dynamic_lookup -L native=/Users/mikavilpas/git/blink.cmp/target/debug/build/lmdb-master-sys-8d040ff7be1c0c6c/out`
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.82s
     Running `/Users/mikavilpas/git/blink.cmp/target/debug/deps/blink_cmp_fuzzy-42401737eca38356`
dyld[40776]: symbol not found in flat namespace '_luaopen_base'
error: test failed, to rerun pass `--lib`

Caused by:
  process didn't exit successfully: `/Users/mikavilpas/git/blink.cmp/target/debug/deps/blink_cmp_fuzzy-42401737eca38356` (signal: 6, SIGABRT: process abort signal)

I have a hunch that this might be caused by my system lua being 5.4 because I installed luarocks using homebrew ☹️ I think luajit (5.1?) needs to be available at runtime, which it probably is inside the neovim process, but don't know how to configure this for testing.

It might be worth it to merge this as-is now, and worry about testing later. Do you have a preference?

edit: also rebased to the latest main

Saghen · 2024-11-29T06:30:05Z

Yep sounds good. Maybe we could use nix for the test environment

mikavilpas force-pushed the utf8 branch from 804d85c to a46815e Compare November 29, 2024 06:02

Saghen merged commit e1d3e9d into Saghen:main Nov 29, 2024

mikavilpas deleted the utf8 branch November 29, 2024 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow completing buffer words with unicode #392

feat: allow completing buffer words with unicode #392

mikavilpas commented Nov 27, 2024

mikavilpas commented Nov 27, 2024

mikavilpas commented Nov 27, 2024

Saghen commented Nov 27, 2024 •

edited

Loading

mikavilpas commented Nov 29, 2024 •

edited

Loading

Saghen commented Nov 29, 2024

feat: allow completing buffer words with unicode #392

feat: allow completing buffer words with unicode #392

Conversation

mikavilpas commented Nov 27, 2024

Issue

Solution

Considerations

mikavilpas commented Nov 27, 2024

mikavilpas commented Nov 27, 2024

Saghen commented Nov 27, 2024 • edited Loading

mikavilpas commented Nov 29, 2024 • edited Loading

Saghen commented Nov 29, 2024

Saghen commented Nov 27, 2024 •

edited

Loading

mikavilpas commented Nov 29, 2024 •

edited

Loading