Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial page visit indexing #268

Merged
merged 12 commits into from
Feb 6, 2018
Merged

Conversation

poltak
Copy link
Member

@poltak poltak commented Jan 23, 2018

  • fixes Store urls and titles for all pages, even those which are closed quickly again #232
  • description of new page visits process in 8cbda3d
  • a lot of cleanup of modules related to the page visit process (log-page-visit, store-page, page-analysis)
  • new index method for stage ii allowing updating just the terms for an existing page
  • removal of unused pouch visits during page visit process
  • cleanup of index pipeline code, and allowing it to work with no main terms content (stage i)
  • update of terms search implementation to merge in title/URL terms results with content terms results
  • implementation of new PausableTimer class: state wrapper around setTimeout (7b8c4c9)

TODO/enhancements:

  • stage ii still does the old behaviour of "waiting" for at least 10s. This waiting includes when the tab is not active. Recent work on the new tabs state means we have access to each tab's accumulated active time. Maybe we could try to use this as a smarter way to decide when to trigger stage ii
  • see if we can get the webNav API's transition data stored in tabs state. This is only available from webNav.onCommitted event (happens before the webNav.onCompleted event that we use for stage i), but it should be able to persist in tab state for later use
  • the webNavigation API events used for stage i seem to both fire on certain pages - find a better way to ensure stage i only can ever happen per page

@poltak poltak force-pushed the feature/initial-page-index branch 3 times, most recently from 64f5c94 to bc969c5 Compare January 26, 2018 04:41
@blackforestboi
Copy link
Member

I tried it with this article: newyorker.com/tech/elements/the-mission-to-decentralize-the-internet

It shows up in the list of results when no query is typed, and also in the index all the data is there. However, when searching for "decentralize" or "internet", no results come up.

@poltak
Copy link
Member Author

poltak commented Jan 29, 2018

@oliversauter looked into this, but seems like it is because of a logical bug in the existing terms search implementation which ignored title and URL terms if there wasn't a corresponding content term entry. Can reproduce on master. I've made an update to the search algo in here (as init visit is pretty useless without working title/URL search!) to fix that behaviour. Great find!

@poltak poltak force-pushed the feature/initial-page-index branch from 7564252 to 2238857 Compare January 29, 2018 04:33
@blackforestboi
Copy link
Member

Went over it again, and a few changes still necessary:

  1. Renaming the text for indexing into:
  • 'Indexing page title & url: url' for the first index
  • 'Indexing Page Content (X Terms): url' for the content index
    Whereas 'url'is the current url.
  1. Somehow this page now takes 18 seconds to index the content. Wonder if it is just an error in calculation in respect to the start of the index, or the actual index time. The latter would not be ideal :)

screen shot 2018-02-05 at 18 36 30

- will be needed for #232: when tab first loads, the title will be set to URL. We want to index right after the tab title changes
- this won't get every site. some more dynamic sites (like youtube) will go through multiple title changes in a load; still should cover a large majority
- terms extraction happening multiple places when it is a common behaviour
- exported term extraction + URL extraction to use outside of main pipeline
- was just weird before; passing the Promise from the pipeline to `performIndexing` - why not wait until resolve first?
- updated JSDoc
- remove unused conc-unsafe `addPage`
- general use case until now is to not index any pages without terms content (reject at pipeline stage)
- also `store-page` always invoked analysis
- now we're re-using these modules for "stub"/initial indexing so we want to be able to skip it sometimes
@poltak poltak force-pushed the feature/initial-page-index branch from 1e7fe90 to 382b069 Compare February 6, 2018 03:18
- a lot of shared code - need to unify this stuff
- TODO: general clean up of log-page-visit and called modules (store-page, store-visit) - huge mess
- broken state

Move init page index to `webNav.onCompleted` event

- prev was the "first title update" event, derived from `tabs.onUpdated` event
- this one seems more appropriate after learning about the webnav API - title and URL should always be ready after the first fire

Remove pouch visit generation in log-page-visit

- completely unused in the extension - still being created in the imports scenario but we can clean later
- now the init page + visit should happen when the page first loads, then the full page index happens later (no need to revisit or update non-terms content)

Update init indexing event trigger

- `webNav.onCompleted` event triggers for every frame including nested iframes
- filter out just the top level page and do for that (no need to worry about debounce)

Clean up log-page-visit module; working with init index

- the init and delayed page indexing now set up to work alright
- can see about optimising the delayed indexing (page content) to skip steps that are already taken care of in the init (title, url)
- should also stop the scheduled indexing if the init indexing was skipped (too recent, or error in creating stub)

Add missing docs on store-page, log-page-visit modules

- this part of codebase starting to become more understandable

Allow clearing of content index if init index fails

- or if init index skipped (because last index was too recent - currently set at 20s)

Write "add page terms" indexing method

Move index queuing HOF from tags module to search-index

- can be reused in existing search-index code, not just tags
- updated add module to use this instead of wrapping a Promise around a indexQueue.push

Write "add page terms" indexing method

- similar to other page indexing methods, but assumes existence of page, simply merging the new terms with existing terms
- reverse page and terms indexes are updated by this method
- set up as the delayed stage of page visit indexing
Clean up a lot of existing page visit logic

- mainly store-page and page-analysis modules were overly complicated with type signatures - returning promises of promises and nested properties within objects
- simplified as much as possible to simply return or resolve to the page doc that is the main piece of data produced from this process
- added a bunch of JSDoc to interface fns

Quick fix for error spam with scrolling on untracked tabs

- if a browser tab isn't in tab state (reload bg script, for example), the content script's scroll event will still try to send data and update that tab's state
- now allow it to fail gracefully and not spam the console
Add stage i on `onHistoryStateUpdated` event

- SPAs using client side routing (History API) won't trigger the webNav.onCompleted event
- still treated the same though

Keep track of last nav event's data for each tab

- explained more in the JSDoc for the event listener
- TabManager is essentially a Map<number, Tab>
- makes sense to put own Tab state mutations in own class
Ensure init page indexing only called once

- some pages (news.google) fire off both events in their navigation process
- pass same debounced handler to both to ensure only the latest one actually invokes the log

Implement active-time-based page content indexing

- instead of simply being a ~10s delay from the time the page is opened, it now does stage ii processing (page content) after ~10s of a user being active on the page (time away from the page doesn't count)

Modularize page-analysis code

- refactoring commit
Create PauseableTimer class wrapping setTimeout

- acts like setTimeout but affords being paused and resumed, while keeping remaining time state

Move init visit indexing to tab.onUpdated event

- exactly the same spot as stage ii, but stage ii is scheduled for later execution
- will happen as soon as DOM is loaded
- the webNavigation API events proved to be quite problematic on some dynamic pages that have slightly different navigation process

Update Tab class to use PausableTimer for scheduling log

- pauses and resumes on active state changes
- meaning the timer will count down while that tab is active by the user; time in background doesn't count

Update search to handle title/URL terms w/o content terms

- if pages are indexed with title/URL terms that don't appear in the content terms (`terms/` index), search won't find them! (short circuited early if not enough results for content terms search)
- now this runs title/URL terms search regardless of outcome of content terms, merging the results later

Put a catch on missing tab errs - too much console spam

- missing tab can occur if you have tabs not assoc. with the ext (any existing tabs when you install/reload the ext)
- this module has been cleaned up a lot now, so nice to have a brief overview of how it works to clear things up for others in the future
- also updated some comments to correspond to listed stages

Add missing docs for Tab class
@poltak poltak force-pushed the feature/initial-page-index branch from ce99210 to 61018bb Compare February 6, 2018 03:40
@poltak
Copy link
Member Author

poltak commented Feb 6, 2018

Forgot this work was still pending! @oliversauter added those requested dev console logs, although there's no access to the URL at the indexing terms stage, so logs the encoded ID version instead.

Can't reproduce 2 here. If reproducible for you, will have to get some info like # docs in DB. I think main thing is the size of your terms index. Terms indexing time is described here. Shouldn't be related to this work though.

You could paste this into background script dev console to check your terms index size, and I could try to reproduce with similar size:

((i = 0) => index.db.createReadStream({ gte: 'term/', lte: 'term/\uffff' }).on('data', () => (i += 1)).on('end', () => console.log(i)))()

@blackforestboi
Copy link
Member

I could reproduce it :)

It was because I had Memex active twice (one prod, one testing). So it had to do the indexing on each. If only one is activated it takes 1.6 seconds.

I'll merge this one in.

@blackforestboi blackforestboi merged commit bba8706 into master Feb 6, 2018
@poltak poltak deleted the feature/initial-page-index branch February 13, 2018 03:20
@blackforestboi blackforestboi changed the title Initial page visit indexing MTNI-203 ⁃ Initial page visit indexing Apr 19, 2018
@blackforestboi blackforestboi changed the title MTNI-203 ⁃ Initial page visit indexing Initial page visit indexing Apr 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants