Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into making search index values static #187

Open
poltak opened this issue Nov 12, 2017 · 9 comments
Open

Look into making search index values static #187

poltak opened this issue Nov 12, 2017 · 9 comments

Comments

@poltak
Copy link
Member

poltak commented Nov 12, 2017

@oliversauter brought up a good point recently that if the terms values don't need to change, any time we revisit an existing site we would only need to index new terms.

This isn't so much a problem with static pages, as each time it should be roughly the same input size. However with dynamic sites, the associated terms will grow over time with each visit (as we remember old terms) hence the indexing time will increase linearly with the # terms.

Currently we need to reindex existing terms to update their latest visit time value which is used in search for scoring.

So, main question to look into and break down more later:

  • how would search have to change if this value is removed
@blackforestboi
Copy link
Member

blackforestboi commented Nov 13, 2017

how would search have to change if this value is removed

Idea as mentioned in Slack:

"Is it possible to make the time order implicit by making sure the last visited doc is always put first in the array?
This way we would make sure a chronological order is maintained, without mentioning the lastvisittime?"

@blackforestboi
Copy link
Member

blackforestboi commented Nov 14, 2017

This is something we could need your help and ideas for @blahah, as it is pretty fundamental.

The user facing problem we try to solve is that after having a couple of thousand pages indexed, re-indexing pages can take a bit. Especially if they are huge. Like this one.
But mostly we would just re-index the same words again, which causes unecessary work/load.

The problem on the index level is, is that we store a lastVisitTime to every page-ID value of the term keys, to be able to score the results and have quicker time to retrieve them because we can stream them.
But this also requires that we go over each term of a page that is indexed again, because we need to update the lastVisittime for the particular page.

Next Problem: If we were to remove the lastVisitTime of a term, we could just only update/append terms that have changed (which is good). But that would increase the time for a search to resolve if we have a lot of pages associated with a term, as we cannot stream them anymore and needed to request the lastVisittime of each page. Means with 800 results, 800 additional requests.
Seems as if that can take a bit. How can we solve this? What is the best tradeoff between search speed and indexing speed?

I had another idea that might also solve this problem from another angle, at least parts of it.
The problem with long indexing times at the moment is not only is the time it takes for an article to reindex, but it also blocks the search from being able to be executed, as it is queued up. Maybe a solution would be to have the queue also on a term level, this way we could prioritise a search operation, so that it is executed between terms that are indexed.

This of course doesn't solve the issue with the time it takes for re-indexing a page, when the store is already quite full, but it would be another layer of improvement.
Another thing needed to do before we can do this, is to develop a module that can go through the index and rewrite it.

@poltak poltak changed the title Look into making terms index values static Look into making search index values static Nov 16, 2017
@blackforestboi
Copy link
Member

I have another idea how to handle the removal of time stamps from the terms.

Since we pretty easily can determine the number of results, we can maybe just show a disclaimer to the user that with, with 300+ results, it takes longer to load them. Therefore it doesnt look like the tool is fucked up, but being upfront that loading takes a bit.

While it is loading, we can show something like: "We have found more than 500 pages fitting these results, loading may take a bit."

This could go in combination with the domain clustering of common terms for a given domain, as we could score the results coming from a domain much lower than results that come from a page, where the words have been unique to the article.

problems:

Removing the time stamp from the term might also influence the ability to quickly filter by time.

@blahah, are there other ways to speed up the lookup/write time in indexedDB, that dont slow down with growing size of the terms?

@blahah
Copy link
Member

blahah commented Nov 23, 2017

The key question is where the slowdown is coming from. If it's from having to update the lastVisittime of each page, there's a solution that solves it based on your idea above, but without slowing down search.

Instead of the term keys having an array of document IDs as the value, they have an object like this:

{
  docs: ['id1', 'id2'],
  lastVisittimes: {
    id1: 032492304923, // this is the timestamp
    id2: 032492307722
  }
}

Indexing

When indexing, you only have to update the term for each term being indexed as follows:

  1. entry is the value for the term key
  2. check if the current document ID is in the entry.docs array. If so, remove it.
  3. add the document ID to the beginning of the entry.docs array
  4. set entry.lastVisittimes[docid] to the last visit time of the current document
  5. write the entry object back to the db

By always putting the updated document ID at the beginning of the array, you don't need the lastVisittime for sorting the results - the array is already sorted in the correct order, with the most recently visited documents first.

Note that if you were to store an array of objects [{ docid: '...', lastVisittime: '...' }] rather than an object of arrays as I suggest, it would be much much slower to check whether the docid exists in the array and remove it, which is crucial to maintaining the array presorted in the order required for returning results. Finding an exact string match in an array is very fast (js is optimised for this), and doesn't require checking each entry in the array. But checking for the presence of an object with a key matching a string requires checking each entry in the array.

Searching

Do exactly what you do already, except pass the lastVisittime from the term object with each result instead of getting it from the document itself.

@blackforestboi
Copy link
Member

Thanks for sharing your thoughts!

If it's from having to update the lastVisittime of each page

How can we find out? What tests can we make to find other blockers. If that is a problem, I can imagine there are others as well?

Do exactly what you do already, except pass the lastVisittime from the term object with each result instead of getting it from the document itself.

Yeah I just thought about that problem as well in the context of searching for multiple terms and having to combine both results.
How would we for example compare both list of docs? Idea: We could take the bigger list and it's order and remove the pages that are not in both, this way we make sure of having all the docs and their order sorted > then we can query more data, like the exact visit time etc. from the DB

@poltak
Copy link
Member Author

poltak commented Nov 24, 2017

Thanks for the ideas @blahah. Though I can't really see how this structure would improve things time-wise for indexing (I understand how using an array may make things simpler at sorting stage in search, but just focusing in terms of indexing only, as that's the main issue here).

Currently the term values have a dictionary-like structure (stored as Maps): doc IDs pointing to metadata object. Currently that metadata object only contains latest string value with the timestamp (was made as an object as I had no idea if more data should be stored per-doc per-term when first implemented - it's ignored in our algorithms; latest always accessed directly per-doc).

{
  id1: { latest: '032492304923' },
  id2: { latest: '032492307722' },
}

How each term value would be reduced at indexing time:

  1. create new dict with single entry { 'newDocId': { latest: 'currentTime' } }
  2. if no existing dict, finish with result of step 1
  3. merge existing dict with result of step 1 (union on keys - new entry taking precedence if same key exists in existing)

I think this is O(# existing entries) time because of step 3. I think your term value reduction algorithm is also O(# existing entries) time as step 2 would be linear to entry.docs.length (although you said JS optimizes this somehow and doesn't need to check each value? how does this work?), and I think step 3 would also be if inserting from the beginning, as it would need to re-index/re-key all existing array values (couldn't you reduce this step to constant time by appending and assuming the opposite order? Maybe I misunderstand Array object)

The key question is where the slowdown is coming from.

It has only ever been reported when there is a lot of existing data (correct me if I'm wrong @oliversauter); I've never been able to reproduce although this is probably because my DB is fairly empty most of the time as I'm nuking it during dev multiple times a day, depending what feature I'm working on. I suppose this makes sense if the current time is linear to # existing entries (for a given term)

After writing-out the existing algorithm above, I think complexity could be reduced to "Map insertion time" by just doing existingDict.set('newDocId', { latest: 'currentTime' }) for step 3 rather than a merge/union (as it's not really necessary here; each run is only ever adding a single entry)?

I'm pretty terrible at explaining things, so let me know if my explanation of what currently happens doesn't make sense and also if you think I'm missing anything with the understanding of your algorithm (probably those JS optimizations come into play here).

@blackforestboi
Copy link
Member

Another idea I had today is to get rid of the time sorting all together for now.
We would simply stream the results into the list, no matter the order of pages.

This way we still have quick searches but also could enable quick indexing, by skipping terms that have been the same.
The weighting of results would then at this time only be via the separation between title + url + body content.

@poltak
Copy link
Member Author

poltak commented Nov 25, 2017

Another idea I had today is to get rid of the time sorting all together for now.

Pretty sure this would make search non-deterministic. As in a search for "google" could return different results every time it's run. Big complication would be with pagination as you can't simply say "give me the next page of results" anymore. You would have to keep getting more results and compare them to the ones you already have in state until you have enough new results to make a new page.

RE:

After writing-out the existing algorithm above, I think complexity could be reduced to "Map insertion time" by just doing existingDict.set('newDocId', { latest: 'currentTime' }) for step 3 rather than a merge/union (as it's not really necessary here; each run is only ever adding a single entry)?

Imported ~6k docs this morning and compared these algorithm changes, but not much noticeable improvement. After putting some timers around the place, it was clear that the terms reduction part of indexing algo is not really an issue; I suppose that makes sense as it's all done in-memory, so can do it fairly fast, and the time bound to existing term value size is nothing major (relative to the actual input size of # terms per page).

The real slowdown was the N individual lookups for existing terms for N terms in a page (when N is sufficiently large - for smaller inputs it's fast enough). Converted this to a single range lookup over the terms index and it took my indexing time for a page with 6.5k terms down from ~90s to ~20s (95% of that time is this lookup stage). More info in PR #213 where I've made a change to the indexing algo to decide whether to do a single range lookup vs N individual lookups, depending on how many terms are present in a page. 20s still feels fairly slow, but it's definitely a big improvement.

@blackforestboi
Copy link
Member

blackforestboi commented Nov 25, 2017

Big complication would be with pagination

Ok understand, this would indeed be a problem.

Converted this to a single range lookup over the terms index and it took my indexing time for a page with 6.5k terms down from ~90s to ~20s

Amazing find!!!!

Converted this to a single range lookup over the terms index

Could this not lead to a problem as the whole term index is loaded into memory? Or how is that lookup internally handled?
If yes, maybe we need to batch it?

EDIT: we can continue this question in #213

@blackforestboi blackforestboi changed the title Look into making search index values static MTNI-17 ⁃ Look into making search index values static Apr 19, 2018
@blackforestboi blackforestboi changed the title MTNI-17 ⁃ Look into making search index values static Look into making search index values static Apr 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants