-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Look into making search index values static #187
Comments
Idea as mentioned in Slack:"Is it possible to make the time order implicit by making sure the last visited doc is always put first in the array? |
This is something we could need your help and ideas for @blahah, as it is pretty fundamental. The user facing problem we try to solve is that after having a couple of thousand pages indexed, re-indexing pages can take a bit. Especially if they are huge. Like this one. The problem on the index level is, is that we store a lastVisitTime to every page-ID value of the term keys, to be able to score the results and have quicker time to retrieve them because we can stream them. Next Problem: If we were to remove the lastVisitTime of a term, we could just only update/append terms that have changed (which is good). But that would increase the time for a search to resolve if we have a lot of pages associated with a term, as we cannot stream them anymore and needed to request the lastVisittime of each page. Means with 800 results, 800 additional requests. I had another idea that might also solve this problem from another angle, at least parts of it. This of course doesn't solve the issue with the time it takes for re-indexing a page, when the store is already quite full, but it would be another layer of improvement. |
I have another idea how to handle the removal of time stamps from the terms. Since we pretty easily can determine the number of results, we can maybe just show a disclaimer to the user that with, with 300+ results, it takes longer to load them. Therefore it doesnt look like the tool is fucked up, but being upfront that loading takes a bit. While it is loading, we can show something like: "We have found more than 500 pages fitting these results, loading may take a bit." This could go in combination with the domain clustering of common terms for a given domain, as we could score the results coming from a domain much lower than results that come from a page, where the words have been unique to the article. problems:Removing the time stamp from the term might also influence the ability to quickly filter by time. @blahah, are there other ways to speed up the lookup/write time in indexedDB, that dont slow down with growing size of the terms? |
The key question is where the slowdown is coming from. If it's from having to update the Instead of the {
docs: ['id1', 'id2'],
lastVisittimes: {
id1: 032492304923, // this is the timestamp
id2: 032492307722
}
} IndexingWhen indexing, you only have to update the
By always putting the updated document ID at the beginning of the array, you don't need the Note that if you were to store an array of objects SearchingDo exactly what you do already, except pass the |
Thanks for sharing your thoughts!
How can we find out? What tests can we make to find other blockers. If that is a problem, I can imagine there are others as well?
Yeah I just thought about that problem as well in the context of searching for multiple terms and having to combine both results. |
Thanks for the ideas @blahah. Though I can't really see how this structure would improve things time-wise for indexing (I understand how using an array may make things simpler at sorting stage in search, but just focusing in terms of indexing only, as that's the main issue here). Currently the term values have a dictionary-like structure (stored as {
id1: { latest: '032492304923' },
id2: { latest: '032492307722' },
} How each term value would be reduced at indexing time:
I think this is O(# existing entries) time because of step 3. I think your term value reduction algorithm is also O(# existing entries) time as step 2 would be linear to
It has only ever been reported when there is a lot of existing data (correct me if I'm wrong @oliversauter); I've never been able to reproduce although this is probably because my DB is fairly empty most of the time as I'm nuking it during dev multiple times a day, depending what feature I'm working on. I suppose this makes sense if the current time is linear to # existing entries (for a given term) After writing-out the existing algorithm above, I think complexity could be reduced to " I'm pretty terrible at explaining things, so let me know if my explanation of what currently happens doesn't make sense and also if you think I'm missing anything with the understanding of your algorithm (probably those JS optimizations come into play here). |
Another idea I had today is to get rid of the time sorting all together for now. This way we still have quick searches but also could enable quick indexing, by skipping terms that have been the same. |
Pretty sure this would make search non-deterministic. As in a search for "google" could return different results every time it's run. Big complication would be with pagination as you can't simply say "give me the next page of results" anymore. You would have to keep getting more results and compare them to the ones you already have in state until you have enough new results to make a new page. RE:
Imported ~6k docs this morning and compared these algorithm changes, but not much noticeable improvement. After putting some timers around the place, it was clear that the terms reduction part of indexing algo is not really an issue; I suppose that makes sense as it's all done in-memory, so can do it fairly fast, and the time bound to existing term value size is nothing major (relative to the actual input size of # terms per page). The real slowdown was the N individual lookups for existing terms for N terms in a page (when N is sufficiently large - for smaller inputs it's fast enough). Converted this to a single range lookup over the terms index and it took my indexing time for a page with 6.5k terms down from ~90s to ~20s (95% of that time is this lookup stage). More info in PR #213 where I've made a change to the indexing algo to decide whether to do a single range lookup vs N individual lookups, depending on how many terms are present in a page. 20s still feels fairly slow, but it's definitely a big improvement. |
Ok understand, this would indeed be a problem.
Amazing find!!!!
Could this not lead to a problem as the whole term index is loaded into memory? Or how is that lookup internally handled? EDIT: we can continue this question in #213 |
@oliversauter brought up a good point recently that if the terms values don't need to change, any time we revisit an existing site we would only need to index new terms.
This isn't so much a problem with static pages, as each time it should be roughly the same input size. However with dynamic sites, the associated terms will grow over time with each visit (as we remember old terms) hence the indexing time will increase linearly with the # terms.
Currently we need to reindex existing terms to update their latest visit time value which is used in search for scoring.
So, main question to look into and break down more later:
The text was updated successfully, but these errors were encountered: