Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated tokenizer to better matching when search for code snippets #32261

Merged
merged 5 commits into from
Nov 6, 2024

Conversation

bsofiato
Copy link
Contributor

@bsofiato bsofiato commented Oct 15, 2024

This PR improves the accuracy of Gitea's code search.

Currently, Gitea does not consider statements such as onsole.log("hello") as hits when the user searches for log. The culprit is how both ES and Bleve are tokenizing the file contents (in both cases, console.log is a whole token).

In ES' case, we changed the tokenizer to simple_pattern_split. In such a case, tokens are words formed by digits and letters. In Bleve's case, it employs a letter tokenizer.

Resolves #32220

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Oct 15, 2024
@pull-request-size pull-request-size bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 15, 2024
@github-actions github-actions bot added the modifies/go Pull requests that update Go code label Oct 15, 2024
…earch case insensitive)

Signed-off-by: Bruno Sofiato <bruno.sofiato@gmail.com>
@bsofiato bsofiato force-pushed the feature/sane-tokenizer branch from 20b22db to 312f706 Compare October 23, 2024 14:38
@lunny
Copy link
Member

lunny commented Oct 23, 2024

I think you can change the version, for users upgrade from stable release to next release, they will still only upgrade index only one time.

@lunny lunny added the type/enhancement An improvement of existing functionality label Oct 23, 2024
@lunny lunny added this to the 1.23.0 milestone Oct 23, 2024
@bsofiato
Copy link
Contributor Author

I think you can change the version, for users upgrade from stable release to next release, they will still only upgrade index only one time.

OK :)

@bsofiato
Copy link
Contributor Author

I think you can change the version, for users upgrade from stable release to next release, they will still only upgrade index only one time.

OK :)

Done !!!

@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Oct 24, 2024
@lunny
Copy link
Member

lunny commented Oct 24, 2024

bleve's version should also be upgraded.

@bsofiato
Copy link
Contributor Author

bleve's version should also be upgraded.

Done

@lunny
Copy link
Member

lunny commented Nov 6, 2024

last Call @go-gitea/technical-oversight-committee

@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Nov 6, 2024
@lunny lunny added the reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. label Nov 6, 2024
@lunny lunny enabled auto-merge (squash) November 6, 2024 06:31
@lunny lunny merged commit f64fbd9 into go-gitea:main Nov 6, 2024
26 checks passed
@GiteaBot GiteaBot removed the reviewed/wait-merge This pull request is part of the merge queue. It will be merged soon. label Nov 6, 2024
zjjhot added a commit to zjjhot/gitea that referenced this pull request Nov 7, 2024
* giteaofficial/main:
  Add new index for action to resolve the performance problem (go-gitea#32333)
  Include file extension checks in attachment API (go-gitea#32151)
  Updated tokenizer to better matching when search for code snippets (go-gitea#32261)
  Correctly query the primary button in a form (go-gitea#32438)

# Conflicts:
#	web_src/js/utils/dom.ts
matera-bs pushed a commit to matera-ar/gitea that referenced this pull request Nov 14, 2024
…matching when search for code snippets (go-gitea#32261)

This PR improves the accuracy of Gitea's code search.

Currently, Gitea does not consider statements such as
`onsole.log("hello")` as hits when the user searches for `log`. The
culprit is how both ES and Bleve are tokenizing the file contents (in
both cases, `console.log` is a whole token).

In ES' case, we changed the tokenizer to
[simple_pattern_split](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html#:~:text=The%20simple_pattern_split%20tokenizer%20uses%20a,the%20tokenization%20is%20generally%20faster.).
In such a case, tokens are words formed by digits and letters. In
Bleve's case, it employs a
[letter](https://blevesearch.com/docs/Tokenizers/) tokenizer.

Resolves go-gitea#32220

---------

Signed-off-by: Bruno Sofiato <bruno.sofiato@gmail.com>
matera-bs pushed a commit to matera-ar/gitea that referenced this pull request Dec 17, 2024
…matching when search for code snippets (go-gitea#32261)

This PR improves the accuracy of Gitea's code search.

Currently, Gitea does not consider statements such as
`onsole.log("hello")` as hits when the user searches for `log`. The
culprit is how both ES and Bleve are tokenizing the file contents (in
both cases, `console.log` is a whole token).

In ES' case, we changed the tokenizer to
[simple_pattern_split](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepatternsplit-tokenizer.html#:~:text=The%20simple_pattern_split%20tokenizer%20uses%20a,the%20tokenization%20is%20generally%20faster.).
In such a case, tokens are words formed by digits and letters. In
Bleve's case, it employs a
[letter](https://blevesearch.com/docs/Tokenizers/) tokenizer.

Resolves go-gitea#32220

---------

Signed-off-by: Bruno Sofiato <bruno.sofiato@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. modifies/go Pull requests that update Go code size/M Denotes a PR that changes 30-99 lines, ignoring generated files. type/enhancement An improvement of existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use a more sane tokenizer for source code search
4 participants