Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range lookup during terms indexing for pages with large amounts of terms #213

Merged
merged 4 commits into from
Dec 10, 2017

Conversation

poltak
Copy link
Member

@poltak poltak commented Nov 25, 2017

Discussion in #187 prompted me to look more into the indexing algo to find slowdown on larger pages (this page is a good example with ~6.5k terms: https://en.wikipedia.org/wiki/United_States - takes me ~90s to index the terms with 6k pages in my DB). After putting some timers around the place, it was pretty clear the main slowdown was from the lookup of existing terms (this line). Specifically it does N lookups for N terms.

For pages with large amount of terms, it can take a while to do all the queries and get all this data back. This change makes a decision based on the # of terms needed to index whether to do a single range lookup over the entire terms index vs doing the N individual lookups. For large pages this speeds up the process by a lot (the above wiki page goes from ~90s to ~20s for me) - for smaller pages it increases time, hence the decision based on # of terms.

Currently it simply does the old N lookups behaviour on pages with < 1000 terms, and range lookup otherwise. This 1000 number is completely arbitrary, so will play around with it and try to get a more meaningful value but it's definitely a big improvement for now on larger pages!

- rather than N individual lookups, now doing a single range lookup over all terms in the DB, taking the value if exists else setting it to `null`
- for pages with smaller amount of terms, it's a lot faster to do N individual lookups rather than a range lookup over the terms index
- now simply switching between the lookup types depending if the # terms > 1000 (arbitrary choice for now; will need to play around with this value)
@blackforestboi
Copy link
Member

For documentation I put my comment from #187 (comment) in here:

Converted this to a single range lookup over the terms index

Could this not lead to a problem as the whole term index is loaded into memory? Or how is that lookup internally handled? If yes, maybe we need to batch it?

@poltak
Copy link
Member Author

poltak commented Nov 26, 2017

Range lookup uses a Node Stream-like API to read the data. These are usually very space-efficient as you're only reading in chunks of the source data (terms index) into memory at once, when they become available (by emitting a 'data' event, which you attach an event listener to). In this case the "chunk" is a single key value pair from the terms index, and in the event listener we decide whether to save that value or ignore it allowing it to be scheduled for garbage collection straight away. The whole term index is never loaded into memory. The space complexity of both the single range lookup and N single lookups (batched) should be the same: # of the terms for the current page.

@blackforestboi
Copy link
Member

Simply wow. This gave the whole thing such a boost.
30s vs 15s for 30 (some very large wikipedia) pages on a new installation

Looking forward to give it a test ride with a larger set!

Great work and great find!

@blackforestboi
Copy link
Member

blackforestboi commented Nov 30, 2017

Could we use this range lookup also when updating terms and therefore have faster re-indexing for pages already in the DB?

Also can we use the range lookup when streaming the search results? Maybe it helps us to make the search terms static and delivers a good tradeoff?

@poltak
Copy link
Member Author

poltak commented Dec 5, 2017

Could we use this range lookup also when updating terms and therefore have faster re-indexing for pages already in the DB?

There's no difference between indexing/adding and re-indexing/updating of a page's terms in the search index adding logic. So yeah range lookup will be used to find existing terms if term input size is large enough (that current magic number of 1000).

Also can we use the range lookup when streaming the search results?

Range lookups are already used in the search algorithms where it makes sense to: those that need to perform lookups over the timestamp indexes - bookmarks and visits. The timestamp-involved searches are essentially looking for "ranges" of time, hence these are perfect. Doesn't make sense for searches on the term-based indexes however (terms, URL, title, etc.), as most searches are only dealing with a small number of terms - searches for 1000+ terms (to use the magic number again) are unlikely/fairly out-of-scope.

- spent a bit of time playing with this with different sized DBs
- still a bit of a magic(arbitrary) number however, apart from empty DB, 3000 gives much better results
@poltak
Copy link
Member Author

poltak commented Dec 8, 2017

Spent a bit of time today playing around with the two lookup strategies with different sized inputs and DB sizes. Prompted me to increase the number a bit from 1000 to 3000 (so: range lookup if # of terms to lookup > 3000, else N single lookups). Note it's still a bit of a magic (arbitrary) number as both strategy times should also be dependent on the # of terms in DB, but apart from when the DB is empty, a higher number like 3000 generally seems to give better results than 1000 (as doing 3000 single lookups is still fairly fast). Maybe later we can come up with a more smart way of switching between the strategies which also factors in DB size.

For ref:

  • standard N lookups strategy should take something like O(m * log n) time where n is size of terms index and m is # of terms to lookup
  • range lookup strategy should be O(n) time where n size of terms index (no coupling to # of terms to lookup)

@blackforestboi
Copy link
Member

can we merge this as well? Or do you still want some code-review?

- also renamed return value from `data` to `results`; feel it conveys meaning better
- fixed up return type to be `Promise` of prev type (minor mistake)
@poltak
Copy link
Member Author

poltak commented Dec 10, 2017

Done a bit of a look over this today. Minor change to add some clarity to a condition. Can't see any reason not to bring in now, as should be a nice improvement for bigger pages. Feel free to get anyone else to look over it if you want, however this is only a fairly small change code-wise.

@blackforestboi
Copy link
Member

Ok, I'll merge it then for now.

@blackforestboi blackforestboi merged commit 9b0975c into master Dec 10, 2017
@poltak poltak deleted the fix/terms-range-lookup branch December 11, 2017 03:08
@poltak poltak mentioned this pull request Feb 6, 2018
3 tasks
@blackforestboi blackforestboi changed the title Range lookup during terms indexing for pages with large amounts of terms MTNI-244 ⁃ Range lookup during terms indexing for pages with large amounts of terms Apr 19, 2018
@blackforestboi blackforestboi changed the title MTNI-244 ⁃ Range lookup during terms indexing for pages with large amounts of terms Range lookup during terms indexing for pages with large amounts of terms Apr 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants