Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow completing buffer words with unicode #392

Merged
merged 1 commit into from
Nov 29, 2024

Conversation

mikavilpas
Copy link
Contributor

Issue

When using the "buffer" provider in a buffer with the following contents:

työmaa
| <- cursor

When typing "ty" the word "työmaa" is not matched, even though it's part of the buffer and should be shown as a suggestion.

Trivia: työmaa is Finnish and can be translated as "construction site"
in English. 🙂 The point is to have a word with non-ASCII characters.

Solution

Change the regex pattern to match any word character, including unicode characters.

The change allows matching the word "työmaa" when typing "ty" in a buffer.

Considerations

Looking at the Unicode section on rust regular expressions https://docs.rs/regex/latest/regex/#unicode

For maximum performance, it's recommended to stick to ASCII characters when possible. However, I'm using the following regex in blink-ripgrep.nvim to fetch completions for all the files in the project:

-- completions are fetched using rg (ripgrep), which is written in rust
-- and I suppose also uses the regex crate.
return {
  "rg",
  "--no-config",
  "--json",
  "--context=" .. (opts.context_size or 5),
  "--word-regexp",
  "--max-filesize=" .. (opts.max_filesize or "1M"),
  "--ignore-case",
  "--",
  prefix .. "[\\w_-]+", -- 👈🏻 notice the usage of \w
  vim.fn.fnameescape(vim.fs.root(0, ".git") or vim.fn.getcwd()),
}

https://github.com/mikavilpas/blink-cmp-rg.nvim/blob/dbbfb4d94432f82757bc38facbf87566f6bbd67c/lua/blink-ripgrep/init.lua?plain=1#L72

The pattern is evaluated against all lines in every file in the current project, and I have found performance to be very good.

I also ran the proposed pattern in a large codebase I work in. I ran this in a project with 4154 files, totalling 815283 lines of code (calculated with the tokei cli application, v. 12.1.2).

rg is able to search for this regex pattern with the following results:

$ hyperfine 'rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null'
Benchmark 1: rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null
  Time (mean ± σ):     194.2 ms ±   7.7 ms    [User: 120.8 ms, System: 316.0 ms]
  Range (min … max):   185.6 ms … 214.7 ms    14 runs

My guess is that since the buffer provider is only used in the current buffer, the performance impact should be minimal.

@mikavilpas
Copy link
Contributor Author

If this seems like a possible solution, I would also like to add some rust test cases for this regex. In my experience running tests with cargo works very nicely in the github CI pipelines.

Also, I forgot to mention that this does not address #388 - I think that one requires changing the logic for trigger characters on the lua side and cannot be done in rust.

@mikavilpas
Copy link
Contributor Author

I noticed a big refactoring (#389) is on the way - maybe this should wait until that is complete.

@Saghen
Copy link
Owner

Saghen commented Nov 27, 2024

This looks great, thanks! To address #388, it'd probably make sense to move most of the regex bits to Rust but that'll be a future PR. I imagine that since the keyword regex doesn't include non-ascii letters, the fuzzy matching still won't work well on non-ascii items. I'm open to adding tests, been meaning to for a while :)

Issue
=====

When using the "buffer" provider in a buffer with the following
contents:

```
työmaa
| <- cursor
```

When typing "ty" the word "työmaa" is not matched, even though it's part
of the buffer and should be shown as a suggestion.

> Trivia: työmaa is Finnish and can be translated as "construction site"
> in English. 🙂 The point is to have a word with non-ASCII characters.

Solution
========

Change the regex pattern to match any word character, including unicode
characters.

The change allows matching the word "työmaa" when typing "ty" in a
buffer.

Considerations
==============

Looking at the Unicode section on rust regular expressions
https://docs.rs/regex/latest/regex/#unicode

For maximum performance, it's recommended to stick to ASCII
characters when possible. However, I'm using the following regex in
blink-ripgrep.nvim to fetch completions for all the files in the
project:

```lua
-- completions are fetched using rg (ripgrep), which is written in rust
-- and I suppose also uses the regex crate.
return {
  "rg",
  "--no-config",
  "--json",
  "--context=" .. (opts.context_size or 5),
  "--word-regexp",
  "--max-filesize=" .. (opts.max_filesize or "1M"),
  "--ignore-case",
  "--",
  prefix .. "[\\w_-]+", -- 👈🏻 notice the usage of \w
  vim.fn.fnameescape(vim.fs.root(0, ".git") or vim.fn.getcwd()),
}
```

https://github.com/mikavilpas/blink-cmp-rg.nvim/blob/dbbfb4d94432f82757bc38facbf87566f6bbd67c/lua/blink-ripgrep/init.lua?plain=1#L72

The pattern is evaluated against all lines in every file in the current
project, and I have found performance to be very good.

I also ran the proposed pattern in a large codebase I work in. I ran
this in a project with 4154 files, totalling 815283 lines of code
(calculated with the `tokei` cli application, v. 12.1.2).

`rg` is able to search for this regex pattern with the following
results:

```sh
$ hyperfine 'rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null'
Benchmark 1: rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null
  Time (mean ± σ):     194.2 ms ±   7.7 ms    [User: 120.8 ms, System: 316.0 ms]
  Range (min … max):   185.6 ms … 214.7 ms    14 runs
```

My guess is that since the buffer provider is only used in the current
buffer, the performance impact should be minimal.
@mikavilpas
Copy link
Contributor Author

mikavilpas commented Nov 29, 2024

Nice! I explored adding some simple tests yesterday, but hit some issues running the tests with cargo test -v:

     Running `/opt/homebrew/bin/sccache /Users/mikavilpas/.rustup/toolchains/nightly-aarch64-apple-darwin/bin/rustc --crate-name blink_cmp_fuzzy --edition=2021 lua/blink/cmp/fuzzy/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --diagnostic-width=162 --emit=dep-info,link -C embed-bitcode=no -C debuginfo=2 -C split-debuginfo=unpacked --test --check-cfg 'cfg(docsrs)' --check-cfg 'cfg(feature, values())' -C metadata=42401737eca38356 -C extra-filename=-42401737eca38356 --out-dir /Users/mikavilpas/git/blink.cmp/target/debug/deps -C incremental=/Users/mikavilpas/git/blink.cmp/target/debug/incremental -L dependency=/Users/mikavilpas/git/blink.cmp/target/debug/deps --extern frizbee=/Users/mikavilpas/git/blink.cmp/target/debug/deps/libfrizbee-b1c5712b42b5fd4d.rlib --extern heed=/Users/mikavilpas/git/blink.cmp/target/debug/deps/libheed-c2470d32b5c0d47b.rlib --extern lazy_static=/Users/mikavilpas/git/blink.cmp/target/debug/deps/liblazy_static-4bc53d643ec30e40.rlib --extern mlua=/Users/mikavilpas/git/blink.cmp/target/debug/deps/libmlua-c7c7bf5d922ec258.rlib --extern regex=/Users/mikavilpas/git/blink.cmp/target/debug/deps/libregex-c8f4a5a6cd3ab516.rlib --extern serde=/Users/mikavilpas/git/blink.cmp/target/debug/deps/libserde-c106b9172bbcf292.rlib -C link-arg=-undefined -C link-arg=dynamic_lookup -L native=/Users/mikavilpas/git/blink.cmp/target/debug/build/lmdb-master-sys-8d040ff7be1c0c6c/out`
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.82s
     Running `/Users/mikavilpas/git/blink.cmp/target/debug/deps/blink_cmp_fuzzy-42401737eca38356`
dyld[40776]: symbol not found in flat namespace '_luaopen_base'
error: test failed, to rerun pass `--lib`

Caused by:
  process didn't exit successfully: `/Users/mikavilpas/git/blink.cmp/target/debug/deps/blink_cmp_fuzzy-42401737eca38356` (signal: 6, SIGABRT: process abort signal)

I have a hunch that this might be caused by my system lua being 5.4 because I installed luarocks using homebrew ☹️ I think luajit (5.1?) needs to be available at runtime, which it probably is inside the neovim process, but don't know how to configure this for testing.

It might be worth it to merge this as-is now, and worry about testing later. Do you have a preference?

edit: also rebased to the latest main

@Saghen Saghen merged commit e1d3e9d into Saghen:main Nov 29, 2024
@Saghen
Copy link
Owner

Saghen commented Nov 29, 2024

Yep sounds good. Maybe we could use nix for the test environment

@mikavilpas mikavilpas deleted the utf8 branch November 29, 2024 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants