-
Notifications
You must be signed in to change notification settings - Fork 81
Fix prefix level position docids database #300
Conversation
// iter over all prefixes in the prefix fst. | ||
let mut word_stream = prefix_fst.stream(); | ||
while let Some(prefix_bytes) = word_stream.next() { | ||
let prefix = str::from_utf8(prefix_bytes).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the guarantee that the prefix if valid utf8? If you have the guarantee that it is, can you comment it above, and replace the unwrap with an expect with a usefull error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarinPostma the prefix comes from the prefix FST which already check if the prefix is valid, but I can easily replace this unwrap
by
let prefix = str::from_utf8(prefix_bytes).unwrap(); | |
let prefix = str::from_utf8(prefix_bytes)?; |
Which is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well if it isn't a recoverable error, crashing is ok. You told me that this should have been checked before, so I imagine that something went bad if it didn't. I think that expect
is a reasonable solution here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarinPostma at the end, this error will be categorized as an Internal error, which, I think, is a good fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that means that the db is corrupted right? Internal error will not stop further usage of the db, Wich will be in an invalid state. Also, the error message will not be super helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well no because the write txn will never be committed, but you're right, the error message will not be super helpful. I'll change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarinPostma done
Isn't it a big performance hit? |
The prefix DBs extraction doesn't take time compared to the other DB. |
829004e
to
357c0d5
Compare
357c0d5
to
fcc20e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
let prefix = str::from_utf8(prefix_bytes).expect(&format!( | ||
"prefix {:?}, comming from prefix FST, is not a valid UTF-8 string", | ||
prefix_bytes | ||
)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this will force a string creation at every single loop, don't do that.
Please map_err
with a format!
, include the related error and use the ?
operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the ?
instead of unwrap
s as it is better to use "pure" functions instead of functions that can panic everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kerollmops @MarinPostma I can suggest:
let prefix = str::from_utf8(prefix_bytes).expect(&format!( | |
"prefix {:?}, comming from prefix FST, is not a valid UTF-8 string", | |
prefix_bytes | |
)); | |
let prefix = str::from_utf8(prefix_bytes).map_err(|_| { | |
SerializationError::Decoding { db_name: Some(WORDS_PREFIXES_FST_KEY) } | |
})?; |
We return an error (Internal) and we know which database is "corrupted".
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! Not sure to understand everything here.
We now check that the prefix is encoded in UTF-8, if it's not the case we raise an error but MeiliSearch doesn't crash and keeps working, right?
I have some questions to better understand 🤓.
- What does it imply for the user?
- What can he do about that?
- Do we have this type of behavior anywhere else? (Having a corrupted DB but continuing to operate for a search request, we talked about that some times ago, is it related?)
- Why can this only happen on large datasets?
The prefix search was inverted when we generated the DB. Instead of searching if word had a prefix in prefix fst, we were searching if the word was a prefix of a prefix contained in the prefix fst. The indexer, now, iterate over prefix contained in the fst and search them by prefix in the word-level-position-docids database, aggregating matches in a sorter. Fix #299
fcc20e9
to
cdeb07f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add the test from the issue #299, please?
We don't index real datasets in our tests, and tinies test sets don't seem to trigger prefix-databases generation. 🤔 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's OK for me then. This prefix bug can only be triggered with relatively big datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors merge
302: Update milli to v0.9.0 r=curquiza a=curquiza Updating the minor and not patch since #300 seems to be breaking: it involves a re-indexation to get the fix, so it involves an additional step from the users, not only downloading the latest version. Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
The prefix search was inverted when we generated the DB.
Instead of searching if word had a prefix in prefix fst,
we were searching if the word was a prefix of a prefix contained in the prefix fst.
The indexer, now, iterate over prefix contained in the fst
and search them by prefix in the word-level-position-docids database,
aggregating matches in a sorter.
Fix #299