-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Range lookup during terms indexing for pages with large amounts of terms #213
Conversation
- rather than N individual lookups, now doing a single range lookup over all terms in the DB, taking the value if exists else setting it to `null`
- for pages with smaller amount of terms, it's a lot faster to do N individual lookups rather than a range lookup over the terms index - now simply switching between the lookup types depending if the # terms > 1000 (arbitrary choice for now; will need to play around with this value)
For documentation I put my comment from #187 (comment) in here:
Could this not lead to a problem as the whole term index is loaded into memory? Or how is that lookup internally handled? If yes, maybe we need to batch it? |
Range lookup uses a Node Stream-like API to read the data. These are usually very space-efficient as you're only reading in chunks of the source data (terms index) into memory at once, when they become available (by emitting a |
Simply wow. This gave the whole thing such a boost. Looking forward to give it a test ride with a larger set! Great work and great find! |
Could we use this range lookup also when updating terms and therefore have faster re-indexing for pages already in the DB? Also can we use the range lookup when streaming the search results? Maybe it helps us to make the search terms static and delivers a good tradeoff? |
There's no difference between indexing/adding and re-indexing/updating of a page's terms in the search index adding logic. So yeah range lookup will be used to find existing terms if term input size is large enough (that current magic number of 1000).
Range lookups are already used in the search algorithms where it makes sense to: those that need to perform lookups over the timestamp indexes - bookmarks and visits. The timestamp-involved searches are essentially looking for "ranges" of time, hence these are perfect. Doesn't make sense for searches on the term-based indexes however (terms, URL, title, etc.), as most searches are only dealing with a small number of terms - searches for 1000+ terms (to use the magic number again) are unlikely/fairly out-of-scope. |
- spent a bit of time playing with this with different sized DBs - still a bit of a magic(arbitrary) number however, apart from empty DB, 3000 gives much better results
Spent a bit of time today playing around with the two lookup strategies with different sized inputs and DB sizes. Prompted me to increase the number a bit from 1000 to 3000 (so: range lookup if # of terms to lookup > 3000, else N single lookups). Note it's still a bit of a magic (arbitrary) number as both strategy times should also be dependent on the # of terms in DB, but apart from when the DB is empty, a higher number like 3000 generally seems to give better results than 1000 (as doing 3000 single lookups is still fairly fast). Maybe later we can come up with a more smart way of switching between the strategies which also factors in DB size. For ref:
|
can we merge this as well? Or do you still want some code-review? |
- also renamed return value from `data` to `results`; feel it conveys meaning better - fixed up return type to be `Promise` of prev type (minor mistake)
Done a bit of a look over this today. Minor change to add some clarity to a condition. Can't see any reason not to bring in now, as should be a nice improvement for bigger pages. Feel free to get anyone else to look over it if you want, however this is only a fairly small change code-wise. |
Ok, I'll merge it then for now. |
Discussion in #187 prompted me to look more into the indexing algo to find slowdown on larger pages (this page is a good example with ~6.5k terms: https://en.wikipedia.org/wiki/United_States - takes me ~90s to index the terms with 6k pages in my DB). After putting some timers around the place, it was pretty clear the main slowdown was from the lookup of existing terms (this line). Specifically it does N lookups for N terms.
For pages with large amount of terms, it can take a while to do all the queries and get all this data back. This change makes a decision based on the # of terms needed to index whether to do a single range lookup over the entire terms index vs doing the N individual lookups. For large pages this speeds up the process by a lot (the above wiki page goes from ~90s to ~20s for me) - for smaller pages it increases time, hence the decision based on # of terms.
Currently it simply does the old N lookups behaviour on pages with < 1000 terms, and range lookup otherwise. This 1000 number is completely arbitrary, so will play around with it and try to get a more meaningful value but it's definitely a big improvement for now on larger pages!