Fix prefix level position docids database #300

ManyTheFish · 2021-08-03T15:35:53Z

The prefix search was inverted when we generated the DB.
Instead of searching if word had a prefix in prefix fst,
we were searching if the word was a prefix of a prefix contained in the prefix fst.
The indexer, now, iterate over prefix contained in the fst
and search them by prefix in the word-level-position-docids database,
aggregating matches in a sorter.

Fix #299

MarinPostma · 2021-08-03T16:26:46Z

milli/src/update/words_level_positions.rs

+        // iter over all prefixes in the prefix fst.
+        let mut word_stream = prefix_fst.stream();
+        while let Some(prefix_bytes) = word_stream.next() {
+            let prefix = str::from_utf8(prefix_bytes).unwrap();


Whats the guarantee that the prefix if valid utf8? If you have the guarantee that it is, can you comment it above, and replace the unwrap with an expect with a usefull error message?

@MarinPostma the prefix comes from the prefix FST which already check if the prefix is valid, but I can easily replace this unwrap by

Suggested change

let prefix = str::from_utf8(prefix_bytes).unwrap();

let prefix = str::from_utf8(prefix_bytes)?;

Which is better.

Well if it isn't a recoverable error, crashing is ok. You told me that this should have been checked before, so I imagine that something went bad if it didn't. I think that expect is a reasonable solution here

@MarinPostma at the end, this error will be categorized as an Internal error, which, I think, is a good fallback.

But that means that the db is corrupted right? Internal error will not stop further usage of the db, Wich will be in an invalid state. Also, the error message will not be super helpful.

well no because the write txn will never be committed, but you're right, the error message will not be super helpful. I'll change that.

@MarinPostma done

MarinPostma · 2021-08-04T08:23:02Z

Isn't it a big performance hit?

ManyTheFish · 2021-08-04T08:47:49Z

@MarinPostma

Isn't it a big performance hit?

The prefix DBs extraction doesn't take time compared to the other DB.
Moreover, in the previous implementation, we were iterating over the whole database searching for prefixes, now we only fetch the parts of the database that are prefixed by the words in the prefix FST.

MarinPostma

👍

Kerollmops · 2021-08-04T11:04:22Z

milli/src/update/words_level_positions.rs

+            let prefix = str::from_utf8(prefix_bytes).expect(&format!(
+                "prefix {:?}, comming from prefix FST, is not a valid UTF-8 string",
+                prefix_bytes
+            ));


Doing this will force a string creation at every single loop, don't do that.
Please map_err with a format!, include the related error and use the ? operator.

Use the ? instead of unwraps as it is better to use "pure" functions instead of functions that can panic everywhere.

@Kerollmops @MarinPostma I can suggest:

Suggested change

let prefix = str::from_utf8(prefix_bytes).expect(&format!(

"prefix {:?}, comming from prefix FST, is not a valid UTF-8 string",

prefix_bytes

));

let prefix = str::from_utf8(prefix_bytes).map_err(|_| {

SerializationError::Decoding { db_name: Some(WORDS_PREFIXES_FST_KEY) }

})?;

We return an error (Internal) and we know which database is "corrupted".
What do you think?

Hello! Not sure to understand everything here.

We now check that the prefix is encoded in UTF-8, if it's not the case we raise an error but MeiliSearch doesn't crash and keeps working, right?

I have some questions to better understand 🤓.

What does it imply for the user?

What can he do about that?

Do we have this type of behavior anywhere else? (Having a corrupted DB but continuing to operate for a search request, we talked about that some times ago, is it related?)

Why can this only happen on large datasets?

The prefix search was inverted when we generated the DB. Instead of searching if word had a prefix in prefix fst, we were searching if the word was a prefix of a prefix contained in the prefix fst. The indexer, now, iterate over prefix contained in the fst and search them by prefix in the word-level-position-docids database, aggregating matches in a sorter. Fix #299

Kerollmops

Would it be possible to add the test from the issue #299, please?

ManyTheFish · 2021-08-04T12:20:42Z

Would it be possible to add the test from the issue #299, please?

We don't index real datasets in our tests, and tinies test sets don't seem to trigger prefix-databases generation. 🤔

Kerollmops

That's OK for me then. This prefix bug can only be triggered with relatively big datasets.

curquiza

bors merge

bors · 2021-08-04T17:21:55Z

Build succeeded:

302: Update milli to v0.9.0 r=curquiza a=curquiza Updating the minor and not patch since #300 seems to be breaking: it involves a re-indexation to get the fix, so it involves an additional step from the users, not only downloading the latest version. Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>

ManyTheFish requested a review from MarinPostma August 3, 2021 15:36

MarinPostma suggested changes Aug 4, 2021

View reviewed changes

ManyTheFish force-pushed the fix-prefix-level-position-database branch from 829004e to 357c0d5 Compare August 4, 2021 08:56

ManyTheFish requested a review from MarinPostma August 4, 2021 08:59

ManyTheFish force-pushed the fix-prefix-level-position-database branch from 357c0d5 to fcc20e9 Compare August 4, 2021 10:08

MarinPostma previously approved these changes Aug 4, 2021

View reviewed changes

Kerollmops suggested changes Aug 4, 2021

View reviewed changes

ManyTheFish dismissed MarinPostma’s stale review via cdeb07f August 4, 2021 12:12

ManyTheFish force-pushed the fix-prefix-level-position-database branch from fcc20e9 to cdeb07f Compare August 4, 2021 12:12

ManyTheFish requested review from Kerollmops and MarinPostma August 4, 2021 12:12

Kerollmops suggested changes Aug 4, 2021

View reviewed changes

Kerollmops approved these changes Aug 4, 2021

View reviewed changes

curquiza added the DB breaking The related changes break the DB label Aug 4, 2021

curquiza mentioned this pull request Aug 4, 2021

Update milli to v0.9.0 #302

Merged

curquiza approved these changes Aug 4, 2021

View reviewed changes

bors bot merged commit 89b9b61 into main Aug 4, 2021

bors bot deleted the fix-prefix-level-position-database branch August 4, 2021 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix prefix level position docids database #300

Fix prefix level position docids database #300

ManyTheFish commented Aug 3, 2021

MarinPostma Aug 3, 2021

ManyTheFish Aug 4, 2021

MarinPostma Aug 4, 2021

ManyTheFish Aug 4, 2021

MarinPostma Aug 4, 2021

ManyTheFish Aug 4, 2021 •

edited

Loading

ManyTheFish Aug 4, 2021

MarinPostma commented Aug 4, 2021

ManyTheFish commented Aug 4, 2021

MarinPostma left a comment

Kerollmops Aug 4, 2021 •

edited

Loading

Kerollmops Aug 4, 2021 •

edited

Loading

ManyTheFish Aug 4, 2021 •

edited

Loading

gmourier Aug 4, 2021 •

edited

Loading

Kerollmops left a comment •

edited

Loading

ManyTheFish commented Aug 4, 2021

Kerollmops left a comment •

edited

Loading

curquiza left a comment

bors bot commented Aug 4, 2021

	let prefix = str::from_utf8(prefix_bytes).unwrap();
	let prefix = str::from_utf8(prefix_bytes)?;

Fix prefix level position docids database #300

Fix prefix level position docids database #300

Conversation

ManyTheFish commented Aug 3, 2021

MarinPostma Aug 3, 2021

Choose a reason for hiding this comment

ManyTheFish Aug 4, 2021

Choose a reason for hiding this comment

MarinPostma Aug 4, 2021

Choose a reason for hiding this comment

ManyTheFish Aug 4, 2021

Choose a reason for hiding this comment

MarinPostma Aug 4, 2021

Choose a reason for hiding this comment

ManyTheFish Aug 4, 2021 • edited Loading

Choose a reason for hiding this comment

ManyTheFish Aug 4, 2021

Choose a reason for hiding this comment

MarinPostma commented Aug 4, 2021

ManyTheFish commented Aug 4, 2021

MarinPostma left a comment

Choose a reason for hiding this comment

Kerollmops Aug 4, 2021 • edited Loading

Choose a reason for hiding this comment

Kerollmops Aug 4, 2021 • edited Loading

Choose a reason for hiding this comment

ManyTheFish Aug 4, 2021 • edited Loading

Choose a reason for hiding this comment

gmourier Aug 4, 2021 • edited Loading

Choose a reason for hiding this comment

Kerollmops left a comment • edited Loading

Choose a reason for hiding this comment

ManyTheFish commented Aug 4, 2021

Kerollmops left a comment • edited Loading

Choose a reason for hiding this comment

curquiza left a comment

Choose a reason for hiding this comment

bors bot commented Aug 4, 2021

ManyTheFish Aug 4, 2021 •

edited

Loading

Kerollmops Aug 4, 2021 •

edited

Loading

Kerollmops Aug 4, 2021 •

edited

Loading

ManyTheFish Aug 4, 2021 •

edited

Loading

gmourier Aug 4, 2021 •

edited

Loading

Kerollmops left a comment •

edited

Loading

Kerollmops left a comment •

edited

Loading