Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Fix prefix level position docids database #300

Merged
merged 1 commit into from
Aug 4, 2021

Conversation

ManyTheFish
Copy link
Member

The prefix search was inverted when we generated the DB.
Instead of searching if word had a prefix in prefix fst,
we were searching if the word was a prefix of a prefix contained in the prefix fst.
The indexer, now, iterate over prefix contained in the fst
and search them by prefix in the word-level-position-docids database,
aggregating matches in a sorter.

Fix #299

// iter over all prefixes in the prefix fst.
let mut word_stream = prefix_fst.stream();
while let Some(prefix_bytes) = word_stream.next() {
let prefix = str::from_utf8(prefix_bytes).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the guarantee that the prefix if valid utf8? If you have the guarantee that it is, can you comment it above, and replace the unwrap with an expect with a usefull error message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarinPostma the prefix comes from the prefix FST which already check if the prefix is valid, but I can easily replace this unwrap by

Suggested change
let prefix = str::from_utf8(prefix_bytes).unwrap();
let prefix = str::from_utf8(prefix_bytes)?;

Which is better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if it isn't a recoverable error, crashing is ok. You told me that this should have been checked before, so I imagine that something went bad if it didn't. I think that expect is a reasonable solution here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarinPostma at the end, this error will be categorized as an Internal error, which, I think, is a good fallback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that means that the db is corrupted right? Internal error will not stop further usage of the db, Wich will be in an invalid state. Also, the error message will not be super helpful.

Copy link
Member Author

@ManyTheFish ManyTheFish Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well no because the write txn will never be committed, but you're right, the error message will not be super helpful. I'll change that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarinPostma
Copy link
Contributor

Isn't it a big performance hit?

@ManyTheFish
Copy link
Member Author

@MarinPostma

Isn't it a big performance hit?

The prefix DBs extraction doesn't take time compared to the other DB.
Moreover, in the previous implementation, we were iterating over the whole database searching for prefixes, now we only fetch the parts of the database that are prefixed by the words in the prefix FST.

@ManyTheFish ManyTheFish force-pushed the fix-prefix-level-position-database branch from 829004e to 357c0d5 Compare August 4, 2021 08:56
@ManyTheFish ManyTheFish force-pushed the fix-prefix-level-position-database branch from 357c0d5 to fcc20e9 Compare August 4, 2021 10:08
MarinPostma
MarinPostma previously approved these changes Aug 4, 2021
Copy link
Contributor

@MarinPostma MarinPostma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 107 to 110
let prefix = str::from_utf8(prefix_bytes).expect(&format!(
"prefix {:?}, comming from prefix FST, is not a valid UTF-8 string",
prefix_bytes
));
Copy link
Member

@Kerollmops Kerollmops Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this will force a string creation at every single loop, don't do that.
Please map_err with a format!, include the related error and use the ? operator.

Copy link
Member

@Kerollmops Kerollmops Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the ? instead of unwraps as it is better to use "pure" functions instead of functions that can panic everywhere.

Copy link
Member Author

@ManyTheFish ManyTheFish Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kerollmops @MarinPostma I can suggest:

Suggested change
let prefix = str::from_utf8(prefix_bytes).expect(&format!(
"prefix {:?}, comming from prefix FST, is not a valid UTF-8 string",
prefix_bytes
));
let prefix = str::from_utf8(prefix_bytes).map_err(|_| {
SerializationError::Decoding { db_name: Some(WORDS_PREFIXES_FST_KEY) }
})?;

We return an error (Internal) and we know which database is "corrupted".
What do you think?

Copy link
Member

@gmourier gmourier Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! Not sure to understand everything here.

We now check that the prefix is encoded in UTF-8, if it's not the case we raise an error but MeiliSearch doesn't crash and keeps working, right?

I have some questions to better understand 🤓.

  • What does it imply for the user?
  • What can he do about that?
  • Do we have this type of behavior anywhere else? (Having a corrupted DB but continuing to operate for a search request, we talked about that some times ago, is it related?)
  • Why can this only happen on large datasets?

The prefix search was inverted when we generated the DB.
Instead of searching if word had a prefix in prefix fst,
we were searching if the word was a prefix of a prefix contained in the prefix fst.
The indexer, now, iterate over prefix contained in the fst
and search them by prefix in the word-level-position-docids database,
aggregating matches in a sorter.

Fix #299
Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add the test from the issue #299, please?

@ManyTheFish
Copy link
Member Author

Would it be possible to add the test from the issue #299, please?

We don't index real datasets in our tests, and tinies test sets don't seem to trigger prefix-databases generation. 🤔

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's OK for me then. This prefix bug can only be triggered with relatively big datasets.

@curquiza curquiza added the DB breaking The related changes break the DB label Aug 4, 2021
@curquiza curquiza mentioned this pull request Aug 4, 2021
Copy link
Member

@curquiza curquiza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors merge

@bors
Copy link
Contributor

bors bot commented Aug 4, 2021

@bors bors bot merged commit 89b9b61 into main Aug 4, 2021
@bors bors bot deleted the fix-prefix-level-position-database branch August 4, 2021 17:21
bors bot added a commit that referenced this pull request Aug 5, 2021
302: Update milli to v0.9.0 r=curquiza a=curquiza

Updating the minor and not patch since #300 seems to be breaking: it involves a re-indexation to get the fix, so it involves an additional step from the users, not only downloading the latest version.

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
DB breaking The related changes break the DB
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Relevancy behavior
5 participants