Allow separator config in tokenizer #260
Replies: 7 comments
-
Hi @igaul 👋 Indeed it would be very interesting to provide a way to customize soft separators as well as hard separators to adjust the tokenization according to the dataset. Would it make sense to be able to do this by index in your opinion? I imagine the case where a separator is important for documents within an index while it would make much less sense for documents in another index. Can you tell me more about your business use case? 🤓 Thanks for your answer! |
Beta Was this translation helpful? Give feedback.
-
Hey @gmourier By index would be great, maybe with some index metadata to ensure same settings, I'm simply running a separate instance of ms for this ( documents with many acronyms including periods ). |
Beta Was this translation helpful? Give feedback.
-
I'm currently working with a list of tags consisting of mostly code languages and frameworks, and I'd really like to make the switch from Elastic to Meili, but the current tokenization severely impacts the search. |
Beta Was this translation helpful? Give feedback.
-
Related to #160 |
Beta Was this translation helpful? Give feedback.
-
Hello everyone 👋 We just released a 🧪 prototype that allows customizing tokenization and we'd love your feedback. How to get the prototype?Using docker, use the following command:
From source, compile Meilisearch on the How to use the prototype?You can find all the details in the PR. Feedback and bug reporting when using this prototype are encouraged! Thanks in advance for your involvement. It means a lot to us ❤️ |
Beta Was this translation helpful? Give feedback.
-
Hello everyone 👋 We have just released the first RC (release candidate) of Meilisearch containing this new feature! You can test it by using:
You are welcome to leave your feedback in this discussion. If you encounter any bugs, please report them here. 🎉 Official and stable release containing this change will be available on September 25th, 2023 |
Beta Was this translation helpful? Give feedback.
-
Hey folks 👋 v1.4.0 has been released! 🦓 You can now customize tokenization by adding or removing tokens from the list of separator tokens and non-separator tokens. ✨ Note: |
Beta Was this translation helpful? Give feedback.
-
Hi,
I ran into an issue where I needed periods to be recognized as characters and not spaces. Currently I just removed the hardcoded separator from tokenizer crate, but I think it would be helpful to pass a configuration (with validation) from MS to tokenizer.
I'm not sure how desired this feature is, but I would PR if interested to keep my project on main. Have a good weekend!
Beta Was this translation helpful? Give feedback.
All reactions