-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase limit: number of positions (~ words) per attribute #1770
Comments
|
Milli 0.18.0 is out containing this change 🎉 |
I was a bit surprised to find this in my evaluation of MeiliSearch. 1000 words is a three-page document, that is a very low limit and I was wondering what kind of use-case MeiliSearch was targeting (conversations?). 65535 seems much more reasonable so I am looking forward to this release! |
Hello @remram44! QuestionWhat sort of effort, code wise, would it be to remove the 1,000 word field limit? It’s the primary reason I stay on Typesense. I know there are workarounds such as splitting into multiple fields, but I’d just like to understand a bit more, behind the decision to limit it and what sort of architectural changes would be needed to remove it. Thanks. AnswerI have several answers to this, depending on the view point. RelevancyMeilisearch is a search engine, the goal is to return the most relevant documents corresponding to a given search request, and so, we want to keep the most relevant words in each document. The predicate is: "deepest a word is in an attribute, less this word is relevant.". The current version considers that any words positioned after the position 1000 are too few relevant to be taken into account in the search. Because more words are more noise, raising this limit could lead to a loss of relevancy. Performances & MemoryMeilisearch has to be the fastest as possible to respond, we pre-compute a lot of things during the indexing of documents. Raising this limit will lead to a bigger disk usage and a longer indexing time. Moreover, because we have more data, the search time could be impacted. Technical limitDoes Meilisearch have a technical limit? The arbitrary limit of 1000Why do we have this limit of 1000 positions per attribute? |
Wow, that is an extremely bad fit for document search. Can I ask where this assumption comes from? MeiliSearch seems optimized for a use case that I do not understand (if it even exists). What kind of content has progressively decreasing relevance? |
If you think this is not the relevancy you expect, you can remove This is something that some users need. For example with the following dataset: [
{ "id": 1, "title": "Harry Potter and the Half-Blood Prince", "description": "A story about a wizzars" },
{ "id": 2, "title": "Fantastic Beasts and Where to Find Them", "descrption": "A movie in the universe of Harry Potter" }
] If you type |
This seems unrelated, it's about favoring some attributes over other attributes according to an order. Not about favoring some words over other words in a single attribute. |
The depth considered by MeiliSearch is in the same attribute but also between the attributes. With [
{ "id": 1, "description": "Harry Potter and his friends live a lof of adventures." },
{ "id": 2, "descrption": "A movie in the universe of Harry Potter" }
] Doc 1 is considered more relevant than doc 2. My example is really trivial but it can be useful when you have attributes with a lot of words. Again, it depends on your own usecase so if it's something you don't want you can remove |
You mean removing "words" from the ranking rules? |
No Sorry for that! |
Why do you have a "words" ranking rule if the ranking by words is controlled by the "attributes" rule? The more I dig the more MeiliSearch is inscrutable. I would have liked a portable, memory-safe solution but I am staying with TypeSense, nothing in here makes sense to me. |
Hello @remram44, the "words" criterion targets the word of the user query and will remove/ignore the last query words 1 by 1 to fill the response. For instance, we have a query The "attribute" criterion will rank documents depending on the position of the matching word in the document:
We could rename this criterion I hope my explanations were clear, and I encourage you to try meilisearch and see if it can fit your needs despite our weaknesses. Anyway, Thanks a lot for your feedback! |
Very excited for this enhancement! It allows me to come back to MeiliSearch (I've been using typesense since I encountered the 1,000 word limit) One suggestion: I spent half a day debugging my search queries to try and figure out why I couldn't find a document, turned out it was because MeiliSearch silently dropped all the data beyond the 1,000th word. When getting the update status, it would be great if the response contained something to show that data was ignored. {
"warnings": [
{ "id": "<document_id>", "truncatedAttributes": [ "<attribute_id>" ] }
]
} |
Hello @Sembiance! I opened a ticket in the |
Thanks :) |
@ManyTheFish Thanks a lot for the explanation. I'm not sure if this is useful to anyone else, but this brings up another issue due to "attribute" technically handling two different cases which should be separated into two different rules. To me it seems like there should be an "attribute" ranking, which is based on the match being in the higher attribute to boost the result, and then another possible "matchPosition" or "wordPosition" within the attribute, which boosts depending on position closer to the beginning of an attribute. This has become an issue that I can't separate the two rules, since I'm indexing long documents into multiple docs, where the "matchPosition" doesn't matter to me.. where I would technically want to leave it out as a rule (ignoring it), but still sort by important of attribute (such as the title being more important than the description of an article). Maybe this has already been discussed, but I was unable to find any mention of in the product discussions. It seems like a useful addition/improvement, for those of us indexing long documents into multiple MeiliSearch documents referring to the same piece of content. When indexing these long documents, the position of the match within the attribute is not useful at all, and should be ignored. |
Hey @mikerogerz! Thanks for your complete response! I think that nothing will be changed before 2022, but we will do at least a "public response" of "why did we choose or not to split the attribute ranking rule in 2?". Thanks a lot for your feedback! 👍 |
Hello @mikerogerz, I just opened a ticket in the product repository so that you ensure we take into consideration your feedback -> meilisearch/product#329 |
Hey @curquiza I really appreciate it. I'll follow that discussion to see how it progresses. |
Related to this tiny spec: meilisearch/specifications#80
The current number of positions per attribute is currently 1000.
See the docs page
This limit will be increased to 65 535.
@meilisearch/docs-team. The whole explanation remains unchanged, only the
1000
word should be replaced by65 535
.TODO:
The text was updated successfully, but these errors were encountered: