Range lookup during terms indexing for pages with large amounts of terms #213

poltak · 2017-11-25T04:53:22Z

Discussion in #187 prompted me to look more into the indexing algo to find slowdown on larger pages (this page is a good example with ~6.5k terms: https://en.wikipedia.org/wiki/United_States - takes me ~90s to index the terms with 6k pages in my DB). After putting some timers around the place, it was pretty clear the main slowdown was from the lookup of existing terms (this line). Specifically it does N lookups for N terms.

For pages with large amount of terms, it can take a while to do all the queries and get all this data back. This change makes a decision based on the # of terms needed to index whether to do a single range lookup over the entire terms index vs doing the N individual lookups. For large pages this speeds up the process by a lot (the above wiki page goes from ~90s to ~20s for me) - for smaller pages it increases time, hence the decision based on # of terms.

Currently it simply does the old N lookups behaviour on pages with < 1000 terms, and range lookup otherwise. This 1000 number is completely arbitrary, so will play around with it and try to get a more meaningful value but it's definitely a big improvement for now on larger pages!

- rather than N individual lookups, now doing a single range lookup over all terms in the DB, taking the value if exists else setting it to `null`

- for pages with smaller amount of terms, it's a lot faster to do N individual lookups rather than a range lookup over the terms index - now simply switching between the lookup types depending if the # terms > 1000 (arbitrary choice for now; will need to play around with this value)

blackforestboi · 2017-11-25T15:11:36Z

For documentation I put my comment from #187 (comment) in here:

Converted this to a single range lookup over the terms index

Could this not lead to a problem as the whole term index is loaded into memory? Or how is that lookup internally handled? If yes, maybe we need to batch it?

poltak · 2017-11-26T02:33:19Z

Range lookup uses a Node Stream-like API to read the data. These are usually very space-efficient as you're only reading in chunks of the source data (terms index) into memory at once, when they become available (by emitting a 'data' event, which you attach an event listener to). In this case the "chunk" is a single key value pair from the terms index, and in the event listener we decide whether to save that value or ignore it allowing it to be scheduled for garbage collection straight away. The whole term index is never loaded into memory. The space complexity of both the single range lookup and N single lookups (batched) should be the same: # of the terms for the current page.

blackforestboi · 2017-11-29T00:26:15Z

Simply wow. This gave the whole thing such a boost.
30s vs 15s for 30 (some very large wikipedia) pages on a new installation

Looking forward to give it a test ride with a larger set!

Great work and great find!

blackforestboi · 2017-11-30T10:20:18Z

Could we use this range lookup also when updating terms and therefore have faster re-indexing for pages already in the DB?

Also can we use the range lookup when streaming the search results? Maybe it helps us to make the search terms static and delivers a good tradeoff?

poltak · 2017-12-05T02:36:21Z

Could we use this range lookup also when updating terms and therefore have faster re-indexing for pages already in the DB?

There's no difference between indexing/adding and re-indexing/updating of a page's terms in the search index adding logic. So yeah range lookup will be used to find existing terms if term input size is large enough (that current magic number of 1000).

Also can we use the range lookup when streaming the search results?

Range lookups are already used in the search algorithms where it makes sense to: those that need to perform lookups over the timestamp indexes - bookmarks and visits. The timestamp-involved searches are essentially looking for "ranges" of time, hence these are perfect. Doesn't make sense for searches on the term-based indexes however (terms, URL, title, etc.), as most searches are only dealing with a small number of terms - searches for 1000+ terms (to use the magic number again) are unlikely/fairly out-of-scope.

- spent a bit of time playing with this with different sized DBs - still a bit of a magic(arbitrary) number however, apart from empty DB, 3000 gives much better results

poltak · 2017-12-08T06:40:24Z

Spent a bit of time today playing around with the two lookup strategies with different sized inputs and DB sizes. Prompted me to increase the number a bit from 1000 to 3000 (so: range lookup if # of terms to lookup > 3000, else N single lookups). Note it's still a bit of a magic (arbitrary) number as both strategy times should also be dependent on the # of terms in DB, but apart from when the DB is empty, a higher number like 3000 generally seems to give better results than 1000 (as doing 3000 single lookups is still fairly fast). Maybe later we can come up with a more smart way of switching between the strategies which also factors in DB size.

For ref:

standard N lookups strategy should take something like O(m * log n) time where n is size of terms index and m is # of terms to lookup
range lookup strategy should be O(n) time where n size of terms index (no coupling to # of terms to lookup)

blackforestboi · 2017-12-09T19:38:04Z

can we merge this as well? Or do you still want some code-review?

- also renamed return value from `data` to `results`; feel it conveys meaning better - fixed up return type to be `Promise` of prev type (minor mistake)

poltak · 2017-12-10T04:15:38Z

Done a bit of a look over this today. Minor change to add some clarity to a condition. Can't see any reason not to bring in now, as should be a nice improvement for bigger pages. Feel free to get anyone else to look over it if you want, however this is only a fairly small change code-wise.

blackforestboi · 2017-12-10T22:18:34Z

Ok, I'll merge it then for now.

poltak added 2 commits November 25, 2017 11:31

Move terms indexing lookups to work as a range lookup

b62923a

- rather than N individual lookups, now doing a single range lookup over all terms in the DB, taking the value if exists else setting it to `null`

poltak mentioned this pull request Nov 25, 2017

Look into making search index values static #187

Open

Increase # terms needed to do range lookup strategy

9b8cae1

- spent a bit of time playing with this with different sized DBs - still a bit of a magic(arbitrary) number however, apart from empty DB, 3000 gives much better results

Add comment for range term lookup data event for clarity

348cb3f

- also renamed return value from `data` to `results`; feel it conveys meaning better - fixed up return type to be `Promise` of prev type (minor mistake)

blackforestboi merged commit 9b0975c into master Dec 10, 2017

poltak deleted the fix/terms-range-lookup branch December 11, 2017 03:08

poltak mentioned this pull request Feb 6, 2018

Initial page visit indexing #268

Merged

3 tasks

blackforestboi changed the title ~~Range lookup during terms indexing for pages with large amounts of terms~~ MTNI-244 ⁃ Range lookup during terms indexing for pages with large amounts of terms Apr 19, 2018

blackforestboi changed the title ~~MTNI-244 ⁃ Range lookup during terms indexing for pages with large amounts of terms~~ Range lookup during terms indexing for pages with large amounts of terms Apr 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Range lookup during terms indexing for pages with large amounts of terms #213

Range lookup during terms indexing for pages with large amounts of terms #213

poltak commented Nov 25, 2017 •

edited by blackforestboi

Loading

blackforestboi commented Nov 25, 2017

poltak commented Nov 26, 2017

blackforestboi commented Nov 29, 2017

blackforestboi commented Nov 30, 2017 •

edited

Loading

poltak commented Dec 5, 2017

poltak commented Dec 8, 2017

blackforestboi commented Dec 9, 2017

poltak commented Dec 10, 2017

blackforestboi commented Dec 10, 2017

Range lookup during terms indexing for pages with large amounts of terms #213

Range lookup during terms indexing for pages with large amounts of terms #213

Conversation

poltak commented Nov 25, 2017 • edited by blackforestboi Loading

blackforestboi commented Nov 25, 2017

poltak commented Nov 26, 2017

blackforestboi commented Nov 29, 2017

blackforestboi commented Nov 30, 2017 • edited Loading

poltak commented Dec 5, 2017

poltak commented Dec 8, 2017

blackforestboi commented Dec 9, 2017

poltak commented Dec 10, 2017

blackforestboi commented Dec 10, 2017

poltak commented Nov 25, 2017 •

edited by blackforestboi

Loading

blackforestboi commented Nov 30, 2017 •

edited

Loading