Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CharGroupTokenizerFactory should allow users to set the maximum character limit for a word #56676

Closed
ADBalici opened this issue May 13, 2020 · 5 comments
Labels
>enhancement help wanted adoptme :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@ADBalici
Copy link
Contributor

Elasticsearch Version: 7.6.1

It appears that the only supported setting for the CharGroupTokinzer is tokenize_on_chars. This is fine for most users as long as the resulting words (after the split) are less than 256 characters long. If longer, words will be truncated.

This behaviour is the cause of a default setting in org.apache.lucene.analysis.util.CharTokenizer:

public static final int DEFAULT_MAX_WORD_LEN = 255;

However, Lucene allows for overriding this default value, which is something that should be done here as well.

@ADBalici ADBalici added >enhancement needs:triage Requires assignment of a team area label labels May 13, 2020
@cbuescher cbuescher added the :Search Relevance/Analysis How text is split into tokens label May 13, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Analysis)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label May 13, 2020
@cbuescher cbuescher removed Team:Search Meta label for search team needs:triage Requires assignment of a team area label labels May 13, 2020
@cbuescher
Copy link
Member

Hi @ADBalici, thanks for opening this. Since most other tokenizers have a configurable max_token_length setting, I think it makes sense to allow it here as well. Out of curiosity though: whats the use case for having such long tokens? Having a good overview of these helps thinking about reasonable defaults and limits in the future.

@ADBalici
Copy link
Contributor Author

Hey @cbuescher! Thanks for replying!

Me and my team are working in a bioinformatics company. The use case is to augment the text in scientific publications with specific units of knowledge, to allow users to retrieve relevant documents.

One such KU (knowledge unit) is a biomedical interaction (say between a protein and a chemical).
These interactions are extracted offline and they have a textual representation that needs to be indexed i.e.:

# represents the fact that chemical with id_1 interacts in a specific way with chemical with id_2
regulates(chemical(id_1), chemical(id_2))

We would like our users to search for relevant documents containing this sort of interactions.

As it is in the biomedical industry, some interactions can get quite complex and their textual representation a bit long.

We have developed our own plugin to help us index what we need. We are simply supplying all KUs separated by a specific char group, and we have just found out that the tokenizer is truncating the tokens.

We have solved the problem by creating our own CharGroupTokenizerFactory alternative that allows us to control the length, but it would still be valuable to have this option out there for other people.

Our team can even work at this feature if it's accepted.

Hope this helps!

@cbuescher cbuescher added the help wanted adoptme label May 13, 2020
@cbuescher
Copy link
Member

in the biomedical industry, some interactions can get quite complex and their textual representation a bit long.

That was my first guess about this use case indeed.

Our team can even work at this feature

Great, I'll label this as "help wanted" to indicate this is a good issue to pick up. Feel free to open a PR, if you need guidance I think looking at e.g. how WhitespaceTokenizerFactory handles and passes on the max_token_length setting should help a great deal. No need to pick this up yourself though, just if you're interested.

@ADBalici
Copy link
Contributor Author

@cbuescher opened a PR just now :D

cbuescher pushed a commit that referenced this issue May 20, 2020
Adds `max_token_length` option to the CharGroupTokenizer.
Updates documentation as well to reflect the changes.

Closes #56676
@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement help wanted adoptme :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

4 participants