-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated tokenizer to better matching when search for code snippets #32261
Conversation
…earch case insensitive) Signed-off-by: Bruno Sofiato <bruno.sofiato@gmail.com>
20b22db
to
312f706
Compare
I think you can change the version, for users upgrade from stable release to next release, they will still only upgrade index only one time. |
tests/gitea-repositories-meta/org42/search-by-path.git/objects/info/commit-graph
Outdated
Show resolved
Hide resolved
OK :) |
Done !!! |
bleve's version should also be upgraded. |
Done |
last Call @go-gitea/technical-oversight-committee |
* giteaofficial/main: Add new index for action to resolve the performance problem (go-gitea#32333) Include file extension checks in attachment API (go-gitea#32151) Updated tokenizer to better matching when search for code snippets (go-gitea#32261) Correctly query the primary button in a form (go-gitea#32438) # Conflicts: # web_src/js/utils/dom.ts
…matching when search for code snippets (go-gitea#32261) This PR improves the accuracy of Gitea's code search. Currently, Gitea does not consider statements such as `onsole.log("hello")` as hits when the user searches for `log`. The culprit is how both ES and Bleve are tokenizing the file contents (in both cases, `console.log` is a whole token). In ES' case, we changed the tokenizer to [simple_pattern_split](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html#:~:text=The%20simple_pattern_split%20tokenizer%20uses%20a,the%20tokenization%20is%20generally%20faster.). In such a case, tokens are words formed by digits and letters. In Bleve's case, it employs a [letter](https://blevesearch.com/docs/Tokenizers/) tokenizer. Resolves go-gitea#32220 --------- Signed-off-by: Bruno Sofiato <bruno.sofiato@gmail.com>
…matching when search for code snippets (go-gitea#32261) This PR improves the accuracy of Gitea's code search. Currently, Gitea does not consider statements such as `onsole.log("hello")` as hits when the user searches for `log`. The culprit is how both ES and Bleve are tokenizing the file contents (in both cases, `console.log` is a whole token). In ES' case, we changed the tokenizer to [simple_pattern_split](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html#:~:text=The%20simple_pattern_split%20tokenizer%20uses%20a,the%20tokenization%20is%20generally%20faster.). In such a case, tokens are words formed by digits and letters. In Bleve's case, it employs a [letter](https://blevesearch.com/docs/Tokenizers/) tokenizer. Resolves go-gitea#32220 --------- Signed-off-by: Bruno Sofiato <bruno.sofiato@gmail.com>
This PR improves the accuracy of Gitea's code search.
Currently, Gitea does not consider statements such as
onsole.log("hello")
as hits when the user searches forlog
. The culprit is how both ES and Bleve are tokenizing the file contents (in both cases,console.log
is a whole token).In ES' case, we changed the tokenizer to simple_pattern_split. In such a case, tokens are words formed by digits and letters. In Bleve's case, it employs a letter tokenizer.
Resolves #32220