feat: allow completing buffer words with unicode · Saghen/blink.cmp@804d85c

Commit

feat: allow completing buffer words with unicode

Issue
=====

When using the "buffer" provider in a buffer with the following
contents:

```
työmaa
| <- cursor
```

When typing "ty" the word "työmaa" is not matched, even though it's part
of the buffer and should be shown as a suggestion.

> Trivia: työmaa is Finnish and can be translated as "construction site"
> in English. 🙂 The point is to have a word with non-ASCII characters.

Solution
========

Change the regex pattern to match any word character, including unicode
characters.

The change allows matching the word "työmaa" when typing "ty" in a
buffer.

Considerations
==============

Looking at the Unicode section on rust regular expressions
https://docs.rs/regex/latest/regex/#unicode

For maximum performance, it's recommended to stick to ASCII
characters when possible. However, I'm using the following regex in
blink-ripgrep.nvim to fetch completions for all the files in the
project:

```lua
-- completions are fetched using rg (ripgrep), which is written in rust
-- and I suppose also uses the regex crate.
return {
  "rg",
  "--no-config",
  "--json",
  "--context=" .. (opts.context_size or 5),
  "--word-regexp",
  "--max-filesize=" .. (opts.max_filesize or "1M"),
  "--ignore-case",
  "--",
  prefix .. "[\\w_-]+", -- 👈🏻 notice the usage of \w
  vim.fn.fnameescape(vim.fs.root(0, ".git") or vim.fn.getcwd()),
}
```

https://github.com/mikavilpas/blink-cmp-rg.nvim/blob/dbbfb4d94432f82757bc38facbf87566f6bbd67c/lua/blink-ripgrep/init.lua?plain=1#L72

The pattern is evaluated against all lines in every file in the current
project, and I have found performance to be very good.

I also ran the proposed pattern in a large codebase I work in. I ran
this in a project with 4154 files, totalling 815283 lines of code
(calculated with the `tokei` cli application, v. 12.1.2).

`rg` is able to search for this regex pattern with the following
results:

```sh
$ hyperfine 'rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null'
Benchmark 1: rg --word-regexp -- "\w[\w0-9_\\-]{2,32}" > /dev/null
  Time (mean ± σ):     194.2 ms ±   7.7 ms    [User: 120.8 ms, System: 316.0 ms]
  Range (min … max):   185.6 ms … 214.7 ms    14 runs
```

My guess is that since the buffer provider is only used in the current
buffer, the performance impact should be minimal.

Loading branch information

mikavilpas committed Nov 27, 2024

1 parent 12d9ecd commit 804d85c

lua/blink/cmp/fuzzy/lib.rs

-Original file line number
+Diff line change
@@ Expand Up / @@ -12,7 +12,7 @@ mod fuzzy; @@
     mod lsp_item;
     lazy_static! {
-        static ref REGEX: Regex = Regex::new(r"[A-Za-z][A-Za-z0-9_\\-]{2,32}").unwrap();
+        static ref REGEX: Regex = Regex::new(r"\w[\w0-9_\\-]{2,32}").unwrap();
         static ref FRECENCY: RwLock<Option<FrecencyTracker>> = RwLock::new(None);
     }
@@ Expand Down @@

0 comments on commit `804d85c`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `804d85c`

Commit

There are no files selected for viewing

0 comments on commit 804d85c

0 comments on commit `804d85c`