CharGroupTokenizerFactory should allow users to set the maximum character limit for a word #56676

ADBalici · 2020-05-13T11:59:50Z

Elasticsearch Version: 7.6.1

It appears that the only supported setting for the CharGroupTokinzer is tokenize_on_chars. This is fine for most users as long as the resulting words (after the split) are less than 256 characters long. If longer, words will be truncated.

This behaviour is the cause of a default setting in org.apache.lucene.analysis.util.CharTokenizer:

public static final int DEFAULT_MAX_WORD_LEN = 255;

However, Lucene allows for overriding this default value, which is something that should be done here as well.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-05-13T15:32:12Z

Pinging @elastic/es-search (:Search/Analysis)

cbuescher · 2020-05-13T15:35:12Z

Hi @ADBalici, thanks for opening this. Since most other tokenizers have a configurable max_token_length setting, I think it makes sense to allow it here as well. Out of curiosity though: whats the use case for having such long tokens? Having a good overview of these helps thinking about reasonable defaults and limits in the future.

ADBalici · 2020-05-13T16:18:57Z

Hey @cbuescher! Thanks for replying!

Me and my team are working in a bioinformatics company. The use case is to augment the text in scientific publications with specific units of knowledge, to allow users to retrieve relevant documents.

One such KU (knowledge unit) is a biomedical interaction (say between a protein and a chemical).
These interactions are extracted offline and they have a textual representation that needs to be indexed i.e.:

# represents the fact that chemical with id_1 interacts in a specific way with chemical with id_2
regulates(chemical(id_1), chemical(id_2))

We would like our users to search for relevant documents containing this sort of interactions.

As it is in the biomedical industry, some interactions can get quite complex and their textual representation a bit long.

We have developed our own plugin to help us index what we need. We are simply supplying all KUs separated by a specific char group, and we have just found out that the tokenizer is truncating the tokens.

We have solved the problem by creating our own CharGroupTokenizerFactory alternative that allows us to control the length, but it would still be valuable to have this option out there for other people.

Our team can even work at this feature if it's accepted.

Hope this helps!

cbuescher · 2020-05-13T16:30:11Z

in the biomedical industry, some interactions can get quite complex and their textual representation a bit long.

That was my first guess about this use case indeed.

Our team can even work at this feature

Great, I'll label this as "help wanted" to indicate this is a good issue to pick up. Feel free to open a PR, if you need guidance I think looking at e.g. how WhitespaceTokenizerFactory handles and passes on the max_token_length setting should help a great deal. No need to pick this up yourself though, just if you're interested.

ADBalici · 2020-05-16T17:26:04Z

@cbuescher opened a PR just now :D

Adds `max_token_length` option to the CharGroupTokenizer. Updates documentation as well to reflect the changes. Closes #56676

ADBalici added >enhancement needs:triage Requires assignment of a team area label labels May 13, 2020

cbuescher added the :Search Relevance/Analysis How text is split into tokens label May 13, 2020

elasticmachine added the Team:Search Meta label for search team label May 13, 2020

cbuescher removed Team:Search Meta label for search team needs:triage Requires assignment of a team area label labels May 13, 2020

cbuescher added the help wanted adoptme label May 13, 2020

ADBalici mentioned this issue May 16, 2020

Add max_token_length setting to the CharGroupTokenizer #56860

Merged

cbuescher closed this as completed in da31b4b May 20, 2020

cbuescher pushed a commit that referenced this issue May 20, 2020

Add max_token_length setting to the CharGroupTokenizer (#56860)

19a336e

Adds `max_token_length` option to the CharGroupTokenizer. Updates documentation as well to reflect the changes. Closes #56676

russcam mentioned this issue Jul 23, 2020

7.9.0 Meta ticket elastic/elasticsearch-net#4872

Closed

29 tasks

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CharGroupTokenizerFactory should allow users to set the maximum character limit for a word #56676

CharGroupTokenizerFactory should allow users to set the maximum character limit for a word #56676

ADBalici commented May 13, 2020

elasticmachine commented May 13, 2020

cbuescher commented May 13, 2020

ADBalici commented May 13, 2020

cbuescher commented May 13, 2020

ADBalici commented May 16, 2020

CharGroupTokenizerFactory should allow users to set the maximum character limit for a word #56676

CharGroupTokenizerFactory should allow users to set the maximum character limit for a word #56676

Comments

ADBalici commented May 13, 2020

elasticmachine commented May 13, 2020

cbuescher commented May 13, 2020

ADBalici commented May 13, 2020

cbuescher commented May 13, 2020

ADBalici commented May 16, 2020