-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial page visit indexing #268
Conversation
64f5c94
to
bc969c5
Compare
I tried it with this article: newyorker.com/tech/elements/the-mission-to-decentralize-the-internet It shows up in the list of results when no query is typed, and also in the index all the data is there. However, when searching for "decentralize" or "internet", no results come up. |
@oliversauter looked into this, but seems like it is because of a logical bug in the existing terms search implementation which ignored title and URL terms if there wasn't a corresponding content term entry. Can reproduce on |
7564252
to
2238857
Compare
Went over it again, and a few changes still necessary:
|
- will be needed for #232: when tab first loads, the title will be set to URL. We want to index right after the tab title changes - this won't get every site. some more dynamic sites (like youtube) will go through multiple title changes in a load; still should cover a large majority
- terms extraction happening multiple places when it is a common behaviour - exported term extraction + URL extraction to use outside of main pipeline
- was just weird before; passing the Promise from the pipeline to `performIndexing` - why not wait until resolve first? - updated JSDoc - remove unused conc-unsafe `addPage`
- general use case until now is to not index any pages without terms content (reject at pipeline stage) - also `store-page` always invoked analysis - now we're re-using these modules for "stub"/initial indexing so we want to be able to skip it sometimes
1e7fe90
to
382b069
Compare
- a lot of shared code - need to unify this stuff - TODO: general clean up of log-page-visit and called modules (store-page, store-visit) - huge mess - broken state Move init page index to `webNav.onCompleted` event - prev was the "first title update" event, derived from `tabs.onUpdated` event - this one seems more appropriate after learning about the webnav API - title and URL should always be ready after the first fire Remove pouch visit generation in log-page-visit - completely unused in the extension - still being created in the imports scenario but we can clean later - now the init page + visit should happen when the page first loads, then the full page index happens later (no need to revisit or update non-terms content) Update init indexing event trigger - `webNav.onCompleted` event triggers for every frame including nested iframes - filter out just the top level page and do for that (no need to worry about debounce) Clean up log-page-visit module; working with init index - the init and delayed page indexing now set up to work alright - can see about optimising the delayed indexing (page content) to skip steps that are already taken care of in the init (title, url) - should also stop the scheduled indexing if the init indexing was skipped (too recent, or error in creating stub) Add missing docs on store-page, log-page-visit modules - this part of codebase starting to become more understandable Allow clearing of content index if init index fails - or if init index skipped (because last index was too recent - currently set at 20s) Write "add page terms" indexing method Move index queuing HOF from tags module to search-index - can be reused in existing search-index code, not just tags - updated add module to use this instead of wrapping a Promise around a indexQueue.push Write "add page terms" indexing method - similar to other page indexing methods, but assumes existence of page, simply merging the new terms with existing terms - reverse page and terms indexes are updated by this method - set up as the delayed stage of page visit indexing
Clean up a lot of existing page visit logic - mainly store-page and page-analysis modules were overly complicated with type signatures - returning promises of promises and nested properties within objects - simplified as much as possible to simply return or resolve to the page doc that is the main piece of data produced from this process - added a bunch of JSDoc to interface fns Quick fix for error spam with scrolling on untracked tabs - if a browser tab isn't in tab state (reload bg script, for example), the content script's scroll event will still try to send data and update that tab's state - now allow it to fail gracefully and not spam the console
Add stage i on `onHistoryStateUpdated` event - SPAs using client side routing (History API) won't trigger the webNav.onCompleted event - still treated the same though Keep track of last nav event's data for each tab - explained more in the JSDoc for the event listener
- TabManager is essentially a Map<number, Tab> - makes sense to put own Tab state mutations in own class
Ensure init page indexing only called once - some pages (news.google) fire off both events in their navigation process - pass same debounced handler to both to ensure only the latest one actually invokes the log Implement active-time-based page content indexing - instead of simply being a ~10s delay from the time the page is opened, it now does stage ii processing (page content) after ~10s of a user being active on the page (time away from the page doesn't count) Modularize page-analysis code - refactoring commit
Create PauseableTimer class wrapping setTimeout - acts like setTimeout but affords being paused and resumed, while keeping remaining time state Move init visit indexing to tab.onUpdated event - exactly the same spot as stage ii, but stage ii is scheduled for later execution - will happen as soon as DOM is loaded - the webNavigation API events proved to be quite problematic on some dynamic pages that have slightly different navigation process Update Tab class to use PausableTimer for scheduling log - pauses and resumes on active state changes - meaning the timer will count down while that tab is active by the user; time in background doesn't count Update search to handle title/URL terms w/o content terms - if pages are indexed with title/URL terms that don't appear in the content terms (`terms/` index), search won't find them! (short circuited early if not enough results for content terms search) - now this runs title/URL terms search regardless of outcome of content terms, merging the results later Put a catch on missing tab errs - too much console spam - missing tab can occur if you have tabs not assoc. with the ext (any existing tabs when you install/reload the ext)
- this module has been cleaned up a lot now, so nice to have a brief overview of how it works to clear things up for others in the future - also updated some comments to correspond to listed stages Add missing docs for Tab class
- requested
ce99210
to
61018bb
Compare
Forgot this work was still pending! @oliversauter added those requested dev console logs, although there's no access to the URL at the indexing terms stage, so logs the encoded ID version instead. Can't reproduce 2 here. If reproducible for you, will have to get some info like # docs in DB. I think main thing is the size of your terms index. Terms indexing time is described here. Shouldn't be related to this work though. You could paste this into background script dev console to check your terms index size, and I could try to reproduce with similar size: ((i = 0) => index.db.createReadStream({ gte: 'term/', lte: 'term/\uffff' }).on('data', () => (i += 1)).on('end', () => console.log(i)))() |
I could reproduce it :) It was because I had Memex active twice (one prod, one testing). So it had to do the indexing on each. If only one is activated it takes 1.6 seconds. I'll merge this one in. |
log-page-visit
,store-page
,page-analysis
)PausableTimer
class: state wrapper aroundsetTimeout
(7b8c4c9)TODO/enhancements:
webNav
API's transition data stored in tabs state. This is only available fromwebNav.onCommitted
event (happens before thewebNav.onCompleted
event that we use for stage i), but it should be able to persist in tab state for later usewebNavigation
API events used for stage i seem to both fire on certain pages - find a better way to ensure stage i only can ever happen per page