-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CharGroupTokenizerFactory should allow users to set the maximum character limit for a word #56676
Comments
Pinging @elastic/es-search (:Search/Analysis) |
Hi @ADBalici, thanks for opening this. Since most other tokenizers have a configurable |
Hey @cbuescher! Thanks for replying! Me and my team are working in a bioinformatics company. The use case is to augment the text in scientific publications with specific units of knowledge, to allow users to retrieve relevant documents. One such KU (knowledge unit) is a biomedical interaction (say between a protein and a chemical).
We would like our users to search for relevant documents containing this sort of interactions. As it is in the biomedical industry, some interactions can get quite complex and their textual representation a bit long. We have developed our own plugin to help us index what we need. We are simply supplying all KUs separated by a specific char group, and we have just found out that the tokenizer is truncating the tokens. We have solved the problem by creating our own CharGroupTokenizerFactory alternative that allows us to control the length, but it would still be valuable to have this option out there for other people. Our team can even work at this feature if it's accepted. Hope this helps! |
That was my first guess about this use case indeed.
Great, I'll label this as "help wanted" to indicate this is a good issue to pick up. Feel free to open a PR, if you need guidance I think looking at e.g. how |
@cbuescher opened a PR just now :D |
Adds `max_token_length` option to the CharGroupTokenizer. Updates documentation as well to reflect the changes. Closes #56676
Elasticsearch Version: 7.6.1
It appears that the only supported setting for the
CharGroupTokinzer
istokenize_on_chars
. This is fine for most users as long as the resulting words (after the split) are less than 256 characters long. If longer, words will be truncated.This behaviour is the cause of a default setting in
org.apache.lucene.analysis.util.CharTokenizer
:However, Lucene allows for overriding this default value, which is something that should be done here as well.
The text was updated successfully, but these errors were encountered: