Initial page visit indexing #268

poltak · 2018-01-23T04:49:57Z

fixes Store urls and titles for all pages, even those which are closed quickly again #232
description of new page visits process in 8cbda3d
a lot of cleanup of modules related to the page visit process (log-page-visit, store-page, page-analysis)
new index method for stage ii allowing updating just the terms for an existing page
removal of unused pouch visits during page visit process
cleanup of index pipeline code, and allowing it to work with no main terms content (stage i)
update of terms search implementation to merge in title/URL terms results with content terms results
implementation of new PausableTimer class: state wrapper around setTimeout (7b8c4c9)

TODO/enhancements:

stage ii still does the old behaviour of "waiting" for at least 10s. This waiting includes when the tab is not active. Recent work on the new tabs state means we have access to each tab's accumulated active time. Maybe we could try to use this as a smarter way to decide when to trigger stage ii
see if we can get the webNav API's transition data stored in tabs state. This is only available from webNav.onCommitted event (happens before the webNav.onCompleted event that we use for stage i), but it should be able to persist in tab state for later use
the webNavigation API events used for stage i seem to both fire on certain pages - find a better way to ensure stage i only can ever happen per page

blackforestboi · 2018-01-26T19:36:00Z

I tried it with this article: newyorker.com/tech/elements/the-mission-to-decentralize-the-internet

It shows up in the list of results when no query is typed, and also in the index all the data is there. However, when searching for "decentralize" or "internet", no results come up.

poltak · 2018-01-29T04:08:59Z

@oliversauter looked into this, but seems like it is because of a logical bug in the existing terms search implementation which ignored title and URL terms if there wasn't a corresponding content term entry. Can reproduce on master. I've made an update to the search algo in here (as init visit is pretty useless without working title/URL search!) to fix that behaviour. Great find!

blackforestboi · 2018-02-05T17:41:10Z

Went over it again, and a few changes still necessary:

Renaming the text for indexing into:

'Indexing page title & url: url' for the first index
'Indexing Page Content (X Terms): url' for the content index
Whereas 'url'is the current url.

Somehow this page now takes 18 seconds to index the content. Wonder if it is just an error in calculation in respect to the start of the index, or the actual index time. The latter would not be ideal :)

- will be needed for #232: when tab first loads, the title will be set to URL. We want to index right after the tab title changes - this won't get every site. some more dynamic sites (like youtube) will go through multiple title changes in a load; still should cover a large majority

- terms extraction happening multiple places when it is a common behaviour - exported term extraction + URL extraction to use outside of main pipeline

- was just weird before; passing the Promise from the pipeline to `performIndexing` - why not wait until resolve first? - updated JSDoc - remove unused conc-unsafe `addPage`

- general use case until now is to not index any pages without terms content (reject at pipeline stage) - also `store-page` always invoked analysis - now we're re-using these modules for "stub"/initial indexing so we want to be able to skip it sometimes

- a lot of shared code - need to unify this stuff - TODO: general clean up of log-page-visit and called modules (store-page, store-visit) - huge mess - broken state Move init page index to `webNav.onCompleted` event - prev was the "first title update" event, derived from `tabs.onUpdated` event - this one seems more appropriate after learning about the webnav API - title and URL should always be ready after the first fire Remove pouch visit generation in log-page-visit - completely unused in the extension - still being created in the imports scenario but we can clean later - now the init page + visit should happen when the page first loads, then the full page index happens later (no need to revisit or update non-terms content) Update init indexing event trigger - `webNav.onCompleted` event triggers for every frame including nested iframes - filter out just the top level page and do for that (no need to worry about debounce) Clean up log-page-visit module; working with init index - the init and delayed page indexing now set up to work alright - can see about optimising the delayed indexing (page content) to skip steps that are already taken care of in the init (title, url) - should also stop the scheduled indexing if the init indexing was skipped (too recent, or error in creating stub) Add missing docs on store-page, log-page-visit modules - this part of codebase starting to become more understandable Allow clearing of content index if init index fails - or if init index skipped (because last index was too recent - currently set at 20s) Write "add page terms" indexing method Move index queuing HOF from tags module to search-index - can be reused in existing search-index code, not just tags - updated add module to use this instead of wrapping a Promise around a indexQueue.push Write "add page terms" indexing method - similar to other page indexing methods, but assumes existence of page, simply merging the new terms with existing terms - reverse page and terms indexes are updated by this method - set up as the delayed stage of page visit indexing

Clean up a lot of existing page visit logic - mainly store-page and page-analysis modules were overly complicated with type signatures - returning promises of promises and nested properties within objects - simplified as much as possible to simply return or resolve to the page doc that is the main piece of data produced from this process - added a bunch of JSDoc to interface fns Quick fix for error spam with scrolling on untracked tabs - if a browser tab isn't in tab state (reload bg script, for example), the content script's scroll event will still try to send data and update that tab's state - now allow it to fail gracefully and not spam the console

Add stage i on `onHistoryStateUpdated` event - SPAs using client side routing (History API) won't trigger the webNav.onCompleted event - still treated the same though Keep track of last nav event's data for each tab - explained more in the JSDoc for the event listener

- TabManager is essentially a Map<number, Tab> - makes sense to put own Tab state mutations in own class

Ensure init page indexing only called once - some pages (news.google) fire off both events in their navigation process - pass same debounced handler to both to ensure only the latest one actually invokes the log Implement active-time-based page content indexing - instead of simply being a ~10s delay from the time the page is opened, it now does stage ii processing (page content) after ~10s of a user being active on the page (time away from the page doesn't count) Modularize page-analysis code - refactoring commit

Create PauseableTimer class wrapping setTimeout - acts like setTimeout but affords being paused and resumed, while keeping remaining time state Move init visit indexing to tab.onUpdated event - exactly the same spot as stage ii, but stage ii is scheduled for later execution - will happen as soon as DOM is loaded - the webNavigation API events proved to be quite problematic on some dynamic pages that have slightly different navigation process Update Tab class to use PausableTimer for scheduling log - pauses and resumes on active state changes - meaning the timer will count down while that tab is active by the user; time in background doesn't count Update search to handle title/URL terms w/o content terms - if pages are indexed with title/URL terms that don't appear in the content terms (`terms/` index), search won't find them! (short circuited early if not enough results for content terms search) - now this runs title/URL terms search regardless of outcome of content terms, merging the results later Put a catch on missing tab errs - too much console spam - missing tab can occur if you have tabs not assoc. with the ext (any existing tabs when you install/reload the ext)

- this module has been cleaned up a lot now, so nice to have a brief overview of how it works to clear things up for others in the future - also updated some comments to correspond to listed stages Add missing docs for Tab class

- requested

poltak · 2018-02-06T03:59:07Z

Forgot this work was still pending! @oliversauter added those requested dev console logs, although there's no access to the URL at the indexing terms stage, so logs the encoded ID version instead.

Can't reproduce 2 here. If reproducible for you, will have to get some info like # docs in DB. I think main thing is the size of your terms index. Terms indexing time is described here. Shouldn't be related to this work though.

You could paste this into background script dev console to check your terms index size, and I could try to reproduce with similar size:

((i = 0) => index.db.createReadStream({ gte: 'term/', lte: 'term/\uffff' }).on('data', () => (i += 1)).on('end', () => console.log(i)))()

blackforestboi · 2018-02-06T13:32:31Z

I could reproduce it :)

It was because I had Memex active twice (one prod, one testing). So it had to do the indexing on each. If only one is activated it takes 1.6 seconds.

I'll merge this one in.

poltak force-pushed the feature/initial-page-index branch 3 times, most recently from 64f5c94 to bc969c5 Compare January 26, 2018 04:41

poltak force-pushed the feature/initial-page-index branch from 7564252 to 2238857 Compare January 29, 2018 04:33

poltak added 4 commits February 6, 2018 10:18

Split up pipeline module into reusable parts

8fff43f

- terms extraction happening multiple places when it is a common behaviour - exported term extraction + URL extraction to use outside of main pipeline

Clean up async handling in search-index add module

3b248ca

- was just weird before; passing the Promise from the pipeline to `performIndexing` - why not wait until resolve first? - updated JSDoc - remove unused conc-unsafe `addPage`

poltak force-pushed the feature/initial-page-index branch from 1e7fe90 to 382b069 Compare February 6, 2018 03:18

poltak added 8 commits February 6, 2018 10:39

Move TabManager Tab states to own class

4546b99

- TabManager is essentially a Map<number, Tab> - makes sense to put own Tab state mutations in own class

Write README for activity-logger feature module

8eec153

- this module has been cleaned up a lot now, so nice to have a brief overview of how it works to clear things up for others in the future - also updated some comments to correspond to listed stages Add missing docs for Tab class

Add some more dev console logs

61018bb

- requested

poltak force-pushed the feature/initial-page-index branch from ce99210 to 61018bb Compare February 6, 2018 03:40

blackforestboi merged commit bba8706 into master Feb 6, 2018

poltak deleted the feature/initial-page-index branch February 13, 2018 03:20

poltak mentioned this pull request Feb 28, 2018

[WIP] Dexie-based search index implementation #322

Merged

44 tasks

blackforestboi changed the title ~~Initial page visit indexing~~ MTNI-203 ⁃ Initial page visit indexing Apr 19, 2018

blackforestboi changed the title ~~MTNI-203 ⁃ Initial page visit indexing~~ Initial page visit indexing Apr 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial page visit indexing #268

Initial page visit indexing #268

poltak commented Jan 23, 2018 •

edited by blackforestboi

Loading

blackforestboi commented Jan 26, 2018

poltak commented Jan 29, 2018

blackforestboi commented Feb 5, 2018

poltak commented Feb 6, 2018

blackforestboi commented Feb 6, 2018

Initial page visit indexing #268

Initial page visit indexing #268

Conversation

poltak commented Jan 23, 2018 • edited by blackforestboi Loading

blackforestboi commented Jan 26, 2018

poltak commented Jan 29, 2018

blackforestboi commented Feb 5, 2018

poltak commented Feb 6, 2018

blackforestboi commented Feb 6, 2018

poltak commented Jan 23, 2018 •

edited by blackforestboi

Loading