Allow separator config in tokenizer #260

igaul · 2021-10-01T17:08:49Z

igaul
Oct 1, 2021

Hi,

I ran into an issue where I needed periods to be recognized as characters and not spaces. Currently I just removed the hardcoded separator from tokenizer crate, but I think it would be helpful to pass a configuration (with validation) from MS to tokenizer.

I'm not sure how desired this feature is, but I would PR if interested to keep my project on main. Have a good weekend!

gmourier · 2021-11-07T13:23:31Z

gmourier
Nov 7, 2021
Maintainer

Hi @igaul 👋

Indeed it would be very interesting to provide a way to customize soft separators as well as hard separators to adjust the tokenization according to the dataset.

Would it make sense to be able to do this by index in your opinion? I imagine the case where a separator is important for documents within an index while it would make much less sense for documents in another index.

Can you tell me more about your business use case? 🤓

Thanks for your answer!

0 replies

igaul · 2021-11-07T23:55:22Z

igaul
Nov 7, 2021
Author

Hey @gmourier

By index would be great, maybe with some index metadata to ensure same settings, I'm simply running a separate instance of ms for this ( documents with many acronyms including periods ).
Aside from some kind of versioning design, implementing looked fairly easy following how stop_words is passed ( or at least my notes say so :) )

0 replies

FrittenKeeZ · 2022-04-26T18:51:36Z

FrittenKeeZ
Apr 26, 2022

I'm currently working with a list of tags consisting of mostly code languages and frameworks, and I'd really like to make the switch from Elastic to Meili, but the current tokenization severely impacts the search.
While many javascript frameworks use period . in their names like Node.js and Vue.js, there's also C#, C++ and .NET not matching properly with Meili.
Being able to adjust soft and hard spaces, plus the general tokenizer to something other than ., for a specific index, would be an amazing addition!

0 replies

gmourier · 2022-05-04T14:20:01Z

gmourier
May 4, 2022
Maintainer

Related to #160

0 replies

macraig · 2023-08-02T12:57:56Z

macraig
Aug 2, 2023
Maintainer

Hello everyone 👋

We just released a 🧪 prototype that allows customizing tokenization and we'd love your feedback.

How to get the prototype?

Using docker, use the following command:

docker run -p 7700:7700 -v $(pwd)/meili_data:/meili_data getmeili/meilisearch:prototype-tokenizer-customization-2

From source, compile Meilisearch on the prototype-tokenizer-customization-2 tag

How to use the prototype?

You can find all the details in the PR.

⚠️ We do NOT recommend using this prototype in production. This is for test purposes only.

Feedback and bug reporting when using this prototype are encouraged! Thanks in advance for your involvement. It means a lot to us ❤️

0 replies

macraig · 2023-08-29T07:46:44Z

macraig
Aug 29, 2023
Maintainer

Hello everyone 👋

We have just released the first RC (release candidate) of Meilisearch containing this new feature!

You can test it by using:

The release assets
The Meilisearch Docker image

docker run -it --rm -p 7700:7700 -v $(pwd)/meili_data:/meili_data getmeili/meilisearch:v1.4.0-rc.0

You are welcome to leave your feedback in this discussion.

If you encounter any bugs, please report them here.
Thanks in advance for your help and your involvement in Meilisearch ❤️

🎉 Official and stable release containing this change will be available on September 25th, 2023

⚠️ RC (release candidates) are not recommended for production

0 replies

macraig · 2023-09-26T08:04:47Z

macraig
Sep 26, 2023
Maintainer

Hey folks 👋

v1.4.0 has been released! 🦓 You can now customize tokenization by adding or removing tokens from the list of separator tokens and non-separator tokens. ✨

Note:

📚 Separator tokens
📚 Non-separator tokens

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meilisearch

Allow separator config in tokenizer #260

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Meilisearch

Allow separator config in tokenizer #260

igaul Oct 1, 2021

Replies: 7 comments

gmourier Nov 7, 2021 Maintainer

igaul Nov 7, 2021 Author

FrittenKeeZ Apr 26, 2022

gmourier May 4, 2022 Maintainer

macraig Aug 2, 2023 Maintainer

How to get the prototype?

How to use the prototype?

macraig Aug 29, 2023 Maintainer

macraig Sep 26, 2023 Maintainer

igaul
Oct 1, 2021

gmourier
Nov 7, 2021
Maintainer

igaul
Nov 7, 2021
Author

FrittenKeeZ
Apr 26, 2022

gmourier
May 4, 2022
Maintainer

macraig
Aug 2, 2023
Maintainer

macraig
Aug 29, 2023
Maintainer

macraig
Sep 26, 2023
Maintainer