Enriched documents batch reader #561

Kerollmops · 2022-06-20T11:52:02Z

~~This PR is based on #555 and must be rebased on main after it has been merged to ease the review.~~
This PR contains the work in #555 and can be merged on main as soon as reviewed and approved.

Create an EnrichedDocumentsBatchReader that contains the external documents id.
Extract the primary key name and make it accessible in the EnrichedDocumentsBatchReader.
Use the external id from the EnrichedDocumentsBatchReader in the Transform::read_documents.
Remove the update_primary_key from the transform.rs file.
Really generate the auto-generated documents ids.
Insert the (auto-generated) document ids in the document while processing it in Transform::read_documents.

Kerollmops · 2022-06-21T12:45:18Z

I just ran a benchmark to check that we improve the time we take to index documents compared to what we lost in #555.

https://github.com/meilisearch/milli/actions/runs/2540950078

loiclec

It is a fairly big PR for me, so I did my best, but overall I think it looks good :) There are mostly two potential issues:

should we verify that the floats inside the geo search fields are valid longitude/latitudes, for example by checking that they are not NaN, inf, etc.?
I think there might be a bug in the code that finds the nested primary key

Apart from that, I had two minor comments:

we'll need to reintroduce the functionality provided by serde_impl.rs, as I don't think it will be acceptable, performance-wise, to deserialise the whole array of added documents at once
the “guessed” primary ID uses the whole DocumentsBatchIndex, but the Meilisearch documentation says that the guess should only use the first document. This is not a change in behaviour introduced by this PR though, so it's not blocking as far as I am concerned

milli/src/documents/serde_impl.rs

milli/src/update/index_documents/enrich.rs

loiclec · 2022-07-12T08:14:55Z

Thank you for the changes! :)

Regarding the nested primary keys again. Could you check that the function works correctly when we have the document:

{ "a" : { "b" : { "c" :  1 }}}

and the primary keys a.b or a.b.c.d?
It is my understanding that both would select the value Number(1).

Kerollmops · 2022-07-12T12:50:27Z

Thank you very much for this bug-finding again @loiclec, I don't know how you find them but that's very impressive 👀 💪

Kerollmops · 2022-07-12T13:26:12Z

I just rebased this PR on top of the main branch but I would like @loiclec's review as I think you did change the transform file, but please just review this file if it's too much 😃 🙏

loiclec

Thanks! I can't spot anything that's wrong, although I mostly looked at the transform file and (relatively) quickly read through the changes since the last review.

irevoire

Overall it's a super cool PR!
I think we can simplify the transform even more with your enriched batch reader 😁

milli/src/update/index_documents/enrich.rs

irevoire · 2022-07-13T13:41:32Z

milli/src/update/index_documents/transform.rs

@@ -205,8 +191,7 @@ impl<'a, 'i> Transform<'a, 'i> {
            // it, transform it into a string and validate it, and then update it in the
            // document. If none is found, and we were told to generate missing document ids, then
            // we create the missing field, and update the new document.
-            let mut uuid_buffer = [0; uuid::fmt::Hyphenated::LENGTH];
-            let external_id = if primary_key_id_nested {
+            if primary_key_id_nested {


I think we can get rid of this part too and delete the flatten_from_field_mapping function!
We'll flatten the document later with the function that works with the field id map 😁

Also, in case we need to keep this if, the comment above is outdated

Hum... Thank you indeed, it maybe emerged from the rebase I did on main.

I have removed this section and the code that supported it :)

Implemented in meilisearch/milli#561

Kerollmops · 2022-07-13T16:17:09Z

I think we can simplify the transform even more with your enriched batch reader 😁

Thank you for your review 🙏, would it be possible for you to do that next week, as I am on vacation, please?

Otherwise it is not possible to iterate over all documents while using the fields index at the same time.

milli/src/documents/builder.rs

milli/src/update/index_documents/enrich.rs

Co-authored-by: Many the fish <many@meilisearch.com>

loiclec · 2022-07-20T14:31:22Z

@ManyTheFish I committed your suggested change to add a comment. Was there anything else to do?

changed applied or issue will be open

curquiza · 2022-07-21T07:08:35Z

Checked with @loiclec, we can finally merge!!!

bors merge

bors · 2022-07-21T07:52:39Z

Build succeeded:

Implemented in meilisearch/milli#561

* Bump openapi spec version to v0.29 * Update 0001-script-based-tokenizer.md (#159) Change tokenizer specs to better fit Charabia implementation * Update the geosearch error (#161) Implemented in meilisearch/milli#561 * Auto-batching - Enable feature by default and remove unwanted options (#162) * Update specs according to new auto-batching behavior * update batchUid to make it internal and hidden from a task resource representation * Remove the batchUid mentions from the task API * Update open-api.yaml Co-authored-by: Guillaume Mourier <guillaume@meilisearch.com> * update future possibilities Co-authored-by: Guillaume Mourier <guillaume@meilisearch.com> * Search API — Filters - Introduce IN and EXISTS and describe filter capabilities in more precisely (#163) * Write a specification for the new (and old) search filters EXISTS IN NOT (new behaviour) != (new behaviour) * Apply suggestions from code review Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: Guillaume Mourier <guillaume@meilisearch.com> Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: Guillaume Mourier <guillaume@meilisearch.com> * Add missing settings object in the task details field of a settingsUpdate task type (#164) * Remove `name` from indexes resource definition (#165) * Misc — Soft deleted documents (Performance optimization) (#168) * create a spec for the soft deleted documents * Rename spec file, minor adjustements * Replace You and We by A user and Meilisearch Co-authored-by: Guillaume Mourier <guillaume@meilisearch.com> * Add Stats Seen event (#169) * Add examples component for each summarized task type (#170) * Version API — Catch up (#171) * Add version-api.md * Add PR number as a spec file prefix * Add health-api.md (#172) * Search API — Add `matchingStrategy` parameter with `last` / `all` strategies (#173) * Introduce a proposal to boot the specification * Update telemetry * Replace wordMatchingStrategy by matchingStrategy * fix missing backtick md Co-authored-by: Many the fish <legendre.maxime.isn@gmail.com> Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com> Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com> Co-authored-by: cvermand <33010418+bidoubiwa@users.noreply.github.com>

Kerollmops force-pushed the enriched-documents-batch-reader branch 2 times, most recently from b02c7e8 to cb490c3 Compare June 21, 2022 10:15

Kerollmops added the no breaking The related changes are not breaking (DB nor API) label Jun 21, 2022

Kerollmops mentioned this pull request Jun 21, 2022

Improve the auto-batching error reporting #555

Closed

Kerollmops force-pushed the enriched-documents-batch-reader branch from fc1c80d to b6ab40b Compare June 22, 2022 08:29

loiclec mentioned this pull request Jul 5, 2022

Change update file format from (OBKV + Fields Ids Map) to a grenad of Json values #576

Closed

irevoire added the performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption label Jul 5, 2022

loiclec self-requested a review July 11, 2022 10:20

loiclec reviewed Jul 11, 2022

View reviewed changes

milli/src/documents/serde_impl.rs Show resolved Hide resolved

milli/src/update/index_documents/enrich.rs Show resolved Hide resolved

milli/src/update/index_documents/enrich.rs Outdated Show resolved Hide resolved

milli/src/update/index_documents/enrich.rs Show resolved Hide resolved

Kerollmops force-pushed the enriched-documents-batch-reader branch from e9c406e to 805c513 Compare July 11, 2022 16:38

Kerollmops added 17 commits July 12, 2022 14:52

Do not allocate when parsing CSV headers

048e174

Update grenad to 0.4.2

eb63af1

Rework the DocumentsBatchBuilder/Reader to use grenad

419ce39

Fix the tests for the new DocumentsBatchBuilder/Reader

e8297ad

Fix the fuzz tests

6d0498d

Fix the cli for the new DocumentsBatchBuilder/Reader structs

a4ceef9

Fix http-ui to fit with the new DocumentsBatchBuilder/Reader structs

f29114f

Fix the benchmarks

a97d4d6

Introduce the validate_documents_batch function

bdc4263

Improve the .gitignore of the fuzz crate

cefffde

Introduce the validate_documents_batch function

0146175

Move the Object type in the lib.rs file and use it everywhere

fcfc4ca

Fix the indexation tests

399eec5

Support the auto-generated ids when validating documents

2ceeb51

Make sur that we do not accept floats as documents ids

19eb3b4

Make the nested primary key work

8ebf5ee

Do not leak an internal grenad Error

dc3f092

Kerollmops force-pushed the enriched-documents-batch-reader branch from 903a3e0 to 448114c Compare July 12, 2022 13:22

Kerollmops marked this pull request as ready for review July 12, 2022 13:22

Kerollmops requested a review from loiclec July 12, 2022 13:23

Kerollmops requested a review from irevoire July 12, 2022 13:30

loiclec previously approved these changes Jul 13, 2022

View reviewed changes

irevoire previously requested changes Jul 13, 2022

View reviewed changes

irevoire added a commit to meilisearch/specifications that referenced this pull request Jul 13, 2022

Update the geosearch error

b7b7a23

Implemented in meilisearch/milli#561

irevoire mentioned this pull request Jul 13, 2022

Geosearch - Add error variants for invalid_geo_field error meilisearch/specifications#161

Merged

ManyTheFish self-requested a review July 18, 2022 09:31

Simplify Transform::read_documents, enabled by enriched documents reader

ab1571c

loiclec dismissed their stale review via 3201801 July 18, 2022 11:40

Change DocumentsBatchReader to access cursor and index at same time

fc9f3f3

Otherwise it is not possible to iterate over all documents while using the fields index at the same time.

loiclec force-pushed the enriched-documents-batch-reader branch from 3201801 to fc9f3f3 Compare July 18, 2022 14:08

ManyTheFish suggested changes Jul 20, 2022

View reviewed changes

milli/src/documents/builder.rs Show resolved Hide resolved

milli/src/update/index_documents/enrich.rs Show resolved Hide resolved

Add a code comment, as suggested in PR review

41a0ce0

Co-authored-by: Many the fish <many@meilisearch.com>

loiclec mentioned this pull request Jul 20, 2022

validate_document_id function trims the id, but maybe shouldn't #593

Closed

ManyTheFish approved these changes Jul 20, 2022

View reviewed changes

bors bot merged commit 941af58 into main Jul 21, 2022

bors bot deleted the enriched-documents-batch-reader branch July 21, 2022 07:52

This was referenced Aug 24, 2022

Bad performance or crash during indexation meilisearch/meilisearch#2132

Closed

Add new error code: invalid_geo_field meilisearch/meilisearch#2707

Closed

gmourier pushed a commit to meilisearch/specifications that referenced this pull request Oct 3, 2022

Update the geosearch error (#161)

515224d

Implemented in meilisearch/milli#561

gmourier pushed a commit to meilisearch/specifications that referenced this pull request Oct 3, 2022

Update the geosearch error (#161)

953cbd1

Implemented in meilisearch/milli#561

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enriched documents batch reader #561

Enriched documents batch reader #561

Kerollmops commented Jun 20, 2022 •

edited

Loading

Kerollmops commented Jun 21, 2022 •

edited

Loading

loiclec left a comment

loiclec commented Jul 12, 2022

Kerollmops commented Jul 12, 2022

Kerollmops commented Jul 12, 2022

loiclec left a comment

irevoire left a comment

irevoire Jul 13, 2022

Kerollmops Jul 13, 2022

loiclec Jul 18, 2022

Kerollmops commented Jul 13, 2022

loiclec commented Jul 20, 2022

curquiza commented Jul 21, 2022

bors bot commented Jul 21, 2022

Enriched documents batch reader #561

Enriched documents batch reader #561

Conversation

Kerollmops commented Jun 20, 2022 • edited Loading

Kerollmops commented Jun 21, 2022 • edited Loading

loiclec left a comment

Choose a reason for hiding this comment

loiclec commented Jul 12, 2022

Kerollmops commented Jul 12, 2022

Kerollmops commented Jul 12, 2022

loiclec left a comment

Choose a reason for hiding this comment

irevoire left a comment

Choose a reason for hiding this comment

irevoire Jul 13, 2022

Choose a reason for hiding this comment

Kerollmops Jul 13, 2022

Choose a reason for hiding this comment

loiclec Jul 18, 2022

Choose a reason for hiding this comment

Kerollmops commented Jul 13, 2022

loiclec commented Jul 20, 2022

curquiza commented Jul 21, 2022

bors bot commented Jul 21, 2022

Kerollmops commented Jun 20, 2022 •

edited

Loading

Kerollmops commented Jun 21, 2022 •

edited

Loading