-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Dexie-based search index implementation #322
Conversation
Wow, such great work @poltak. As already mentioned, when testing it 2 days ago i could finish 15k documents in about 1.5 hours :) But I also ran into a rather nasty bug: How to reproduce:
Also what we talked about before, does it make sense to include a term level concurrency mode, so that in any case a search request does not have to wait for a page to be indexed? |
Addition: While the importer is going, it seems like gradually increasing the RAM load, so there may be some things not properly being removed and clogging the RAM? |
Cool thanks for the report. Will look into.
There's no queue used for index ops currently. Dexie seems to manage this fine, but this is another thing I should try to break and get confirmation on. Will add to the list
Dexie puts us at a much higher level than the prev IndexedDB implementation, so worrying about indexing individual terms is no longer relevant
… On Feb 28, 2018, at 16:41, Oliver Sauter ***@***.***> wrote:
Wow, such great work @poltak. As already mentioned, when testing it 2 days ago i could finish 15k documents in about 1.5 hours :)
But I also ran into a rather nasty bug:
How to reproduce:
Let the importer run
After a while it hangs itself and the ram shoots to almost 1GB and the CPU to 240%
Only way out is restarting the extension.In the latest test even that didn't help and I could not progress with the imports anymore.
In order for it to work, I had to change the blacklist in order to force a recalculation.
If you guys can't reproduce, i'd be happy to hop on a skype session and test with you what is necessary.
Also what we talked about before, does it make sense to include a term level concurrency mode, so that in any case a search request does not have to wait for a page to be indexed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Does this mean Dexie handles prioritisation between search requests and indexing itself? |
Dexie handles it at another level which it calls transactions. These are like a set of DB ops that either all happen or none happen (if something goes wrong, it's rolled back). In our ext, things like adding a new page or searching is a single transaction. It will handle scheduling these after each other if they are writing to the same table, but apparently allows them to run in parallel if they are just reading (search only needs to read). That big 6k+ terms United States wiki article now only takes ~400ms for me to index the terms. Takes > 2s for me on the |
Just tested your updates and wanna leave here that the problem with the stuck import still persists. Thought you maybe have fixed it with c71ddeb Could you reproduce it? |
Haven't looked at this yet @oliversauter. It's on the radar |
c71ddeb
to
1071408
Compare
89c4f37
to
712fd0a
Compare
b62f514
to
a51d09c
Compare
@oliversauter spent a bit of time playing with imports; managed to import ~2k pages pretty quickly at 20 concurrency and it stayed at fairly constant resource overhead the whole time. Didn't get into any hang or stop. We should talk about it briefly in call tonight if still a prob. I found one issue that may be related which is sometimes it doesn't properly stop the imports progress when pause/cancel is pressed. The UI looks like it stopped fine, but in the background it's still downloading. Which isn't good as it could continue through the entire history, and the user will be left confused why the browser's going very slow. |
Another round of reviews:
|
@oliversauter search oddities are expected until I get around to going over all that code this week and verifying things. Gonna be lots of little things to fix up. 6 and 5 work fine for me. 6 will be greyed out if you open the popup before the initial indexing happens (DOM load + few hundred ms) - no changes there. 5 should show the correct total count as long as there are terms entered - all look correct for me with 4k pages to search through, but will keep an eye out for any wrong. Planning to look into possibly of counts for non-terms search this week 4 and 1, like imports, is not really related to these changes but will look into including fixes for them once all the other stuff is done. One big issue to mention is the new index currently won't run under Firefox due to IndexedDB/Dexie transactions not working with FF's native @oliversauter I'll let you know when I need any more manual testing or feedback just to save time and effort |
Looks very good John, lot's of great work done here! Regarding the testing here's some thoughts:
One other question:
|
Found a bug with the domain filters.
|
@oliversauter good find! I had totally forgotten about implementing domains filter extraction from queries (different to the UI filter). Wrote a test to confirm it indeed works on old but fails on new index. Then wrote the code to ensure it now passes. Domains search without terms seems pretty slow right now (seems linear to something). Thanks a lot for the tests feedback @ShishKabab !
I like your points on hardcoding here instead of calling those ID deriving fns. The URL normalization stuff should get their own tests and be considered an unrelated side-effect to the actual index tests. Don't want to call them directly here. Although note that a few of the index methods still depend on them internally (anything that accepts a URL as param will transform it into an ID). Done in
Yeah this was something I felt was a bit yucky. Have changed those mutation tests (now calling them "read-write ops tests") so that the test data gets reset before each of them, and that they don't make any assumptions about data based on other tests. Also ensured they all do assertions on results both before and after and write ops. All the outer tests now grouped under "read ops tests" which don't need to reset data each time. 35c2599
Updated names, etc. I like the idea of the separate test data modules. May make tests less clear, but I think most editors should be able to easily show what's in the imported test data for reference. Should the expected values live with test data? In this case, we only need to assert matching IDs, so the expected values are all shared. 633f46c
Yes this was a big code smell. It is implementation specific to the Dexie Page model, and should be encapsulated in the new index method implementation methods d2dcb45 nice find
Yes, this was weird. Basically it's the post-search stage where constant size display data fetching for results happens,
Do you remember the weird way it was implemented before where the images were fetched from Pouch in the UI layer? |
Looks very cool John :)
Ah, missed that the code was there :) Have seen the code in the mapping function now, so looks good. Very happy to see that code get out of the UI layer!
I'd say that most of the time you expect something you have put in, so it's in the data file anyway. For expected stuff that you didn't put in I'd say to leave those things either in the tests themselves or in the in the test file. Mentally it makes a the most sense for me like that. |
:) Ok. seems to be the case for #tags filter too. |
Found another search weirdness: How to reproduce:
|
@oliversauter RE that error: are you sure you weren't on Bit confused with the search bug you reported. step 2 seems to contradict step 1: |
@poltak RE: bug: Also happens to me on complete reinstall and using 'rhammer time' as the word ('rhammer' is on purpose.) |
That one actually was Symbol.asyncIterator not being defined ;) |
Yeah I was wrong. This one is saying a missing RE search bug from @oliversauter: |
Yeah at least for now, we should stay with that behaviour. |
On this branch, i dont get the error: https://github.com/digi0ps/Memex/tree/search-injection/src This is my version: https://github.com/digi0ps/Memex/tree/search-injection/src |
@oliversauter These are completely different branches. If reproducible for you on a particular Chrome/ium version, send me that number and I'll see if I can get it to reproduce |
Yeah aware of that, just giving reference to where it happens and where not, so the error might be easier to track. Oh I thought I posted my Chromium version last night in a separate comment. Didn't send it off. |
- we haven't updated in about a year; lots of fixes made since then apparently, including some thigns we've seen like disconnected port errors
- good example is overview: everytime you change the search, URL state updates and tries to update a visit
Skip imports persisting any periods in history with no data - history extracted in week periods - if one week has no history (or all history is deduped), an empty chunk would be stored - this may or may not be an issue for the reading end during imports progress when an empty chunk is retrieved Update import-item-creation generators to be consistent Update import-state's _getChunk to be async generator - I don't think it really make a diff, but seems much more natural to use generators all the way down here, rather than yielding resolved Promises Ensure any import items finished after pause sent to UI - they will be finished anyway, but the user will have no idea about them as the message never got sent - this way the message will always send, even if it's after the time the user presses pause btn
- bit of an overview of purpose, architecture, and responsibilities of each of the parts Update import readme diagram - added in new web ext API abstractions - removed "import" prefix in front of everything - update readme text later
Fix up incorrect class prop JSDoc typings - class props use @type rather than @Property for some reason Refactor-out cache-logic from import-state class - cache logic handles messy allocating into chunks and storing in local storage - now the import-state just concerned with being an interface to fetch, and remove import items - also removed import-item-creation relience on import-state Remove import-conn-handler dep on import-state - it only needs to interface with it to get estimates; afford this through progress-manager Remove import-state's rehyrdation of allowedTypes state - i really don't think this does anything; it will be init'd from progress-manager Abstract web ext API data sources behind class - now `import-item-creation` accesses this interface rather than directly accesses the APIs - still some cleanup and confirmation to do with this Remove old ext migration from imports - lots of code removal; should have gotten all, but may be missing things Simplify import data source abstraction - provides interface between web ext API data sources for `import-item-creation` Simplify import item creator - move all data stuff to DataSources class (root ID BM tree still generated) - DRY out the iteration code now that the DataSoures provides same API for bookmarks and history Improve way ItemCreator is passed around ImportState - prev was creating a new instance whenever cache was empty - now you can pass in a custom instance to ImportStateManager and it will tell that instace to reinit it's data if necessary - also fix regression with counts doubling Revert "Ensure any import items finished after pause sent to UI" This reverts commit 316b496. - it would mean that XHR errors that throw on being aborted would be flagged as error'd items, so removed from the import item pool - doesn't seem to be a way to differentiate the error on the XHR's error event Ensure import state estimate counts init'd from cache # Conflicts: # src/background.js
Write mocks for imports cache and item-creators - will let me test import-state without relying on those classes (item-creator to be tested separately; cache is an interface with local storage) Update import class inputs for ease of re-use - all constructors now accept objects - most have defaults for the general flow, but tests can override certain things Add mocks for hist/bms, blacklist + exist.keys lookups - DataSource class can be mocked as a whole for hist/bms (instead of ItemCreator) - blacklist is a separate thing, so just mock to return false for every item - logability check also mocked in same way (return true for all) Write tests for estimate counts derivation (w/wo cache) - painful but finally getting it working Write tests for import item iteration and marking off - iteration used as main progress thing -> keep requesting new chunks to go through - marking off happens after each item is processed; they are removed from their chunk Update tsconfig lib to es2017 - we're making use of es2017 stuff in rest of codebase and compiling with babel Set up URL list for history/bookmarks test sources - found this project which makes it simple: https://github.com/citizenlab/test-lists - now tests have history size of ~280; some fun URLs in there too - updated the removal test to be more thorough; remove lots of items from each chunk and calculate the expected changes Add mock for URL normalization module - messes with lots of tests - replaced extra input param on ItemCreator that was a poor work-around Write test for error'd import items handling - quite a big test, marks the first item of each incoming chunk as an error, making sure that those errord items don't show up in future item reads, unless errors specified - implemented error flagging on the mock cache - also unified the initial state calc needed for each test (put in `beforeEach`) Rename all classes, update readme + fix TS type errors - classes all renamed according to new README diagram - README text updated accordingly (and to include new `Cache` and `DataSources` classes) - some TS type issues in the state-manager tests fixed - still some weird TS-related issue at runtime with `checkWithBlacklist` (seems to still work tho; to look into) Diversify import item derivation test input sources - rewrote the existing tests inside a function to be run with custom data inputs - found 1000+ URL list to use as a additional source (more fun in there) - run all import tests on more diverse combiniations of bm/hist input sets
- will write tests for this part next; this makes a lot of external stuff afford being mocked - also removes coupling with local storage (moved to conn handler; doesn't belong here) Write mock for import ItemProcessor class - this is the main XHR + send request to search-index part of imports - should be N of these at any one time, where N is concurrency - trying to figure out how to replace the `process` method with setTimeout to simulate some time Set up imports progress manager tests - confused with getting the fake setTimeout working with jest - `runAllTimers` doesn't seem to be working as explained here: https://facebook.github.io/jest/docs/en/timer-mocks.html - the following expects fail as those mock cbs never called - obviously I'm doing something wrong; need to look into more Immediately resolve import processor mock - fake timers were not working properly for whatever reason - now Processor.process immediately resolves instead, which allows us to test the observers being called, but now still hard to test in-progress state Add checking of concurrent processors to progress tests - also updated the mock cache so it's not fixed at chunk size of 10 (means concurrency > 10 does nothing in tests) Write tests for interrupting imports progress - again not as nice as the timers aren't working - basically starting off the importer, then immediately stopping and making sure all the concurrent processors are set to cancelled and none of the observer cbs are called Write imports progress restart after interruption tests Fix bug with tsc transpiling asyncawait instead of babel - es6 target was telling tsc to handle our asyncawait code; we've already got babel to do that though and they don't seem to be happy to work together sometimes - tsc now targets ESNext, so it will ignore a lot more ES features and leave to babel Set up separate tsconfig for jest - we're not using Babel after tsc for ts test modules - hence we need to target lower for jest to run it, as babel isn't going to transpile stuff Add skipped big (4000+) imports progress test - takes a while to run, so skipping it - maybe look at optional tests or something (skip will never run until test updated)
- make sure it works for both counts and actual progressing through the created items (no URL gets iterated twice)
- should be Set of URL strings rather than encoded page IDs - put a simple decode call in and removed old unused `trimPrefix` arg
- after getting the tags/domains filtered URLs (fast) it was doing a range lookup over visits index and only keeping those from prev filter (slow; worst case is linear to visit index size) - no need to do range lookup at all; just get the latest events for those already filtered URLs and paginate! - for terms search, it already performs in log time
- URLs are not unique in visits index (compound PK on time + URL, as each page can have many visits) - this meant the existing `.eachPrimaryKey` iteration on the query result could be quite long if many visits to single pages (my memex page had hundreds of visits) - seems a hell of a lot faster just doing N parallel lookups just on URL and getting the first one that passes criteria (within time filters)
- untracked tabs aren't supported yet, but in FF will throw a bunch of "Tab not found" errors; expected but should be caught
- search response shape changed slightly in new index work to be less "Pouch-like"
- previously forced this to make Dexie work nicely with FF - it seems like it messes with a lot of other stuff in FF though, like content_script can no longer access local storage without the Promise rejecting for "unknown reason" - taking it out, indexing still seems to work fine (FF 59) and all the promise-related bugs go away
- minor cleanup of some derived imports UI state too (imports UI really needs a work-over; it's gotten quite bad) - bg script logic to handle forced recalc
- still left-over stuff for old ext items-specific state - start import btn disable state derivation simplified
- this module handles differentiating the listeners of different notifications via a tiny state (only a single event shared between all notifs; switch on IDs) - refactored all existing notif creation calls to use this module
- just suggestion text for now; will change - links to main knowledge base page until we write article/blog post
72eed15
to
f976ca4
Compare
- rewrite new notifications module in TS to take advantage of it; seems to work well
- the reverse can be replaced by simply setting the redux reducer to prepend the details rows whenenver a new item is finished
- now it should show up on any search that has at least: any terms, any domains or any tags filters defined - also made show not to show it when loading (doesn't flash in and out now) - removed the old `getTotalCount` search param; it's always 'true'
- TODO url boost - this was working slightly differently between the implementations (rounding up vs down) - now fixed
- similar to prev. added boosted title terms search
4185616
to
af6a724
Compare
Ran a few tests and it all works pretty well :) A couple of things and then I think we can merge it. (one might be a biit bigger) List is in order of priority
|
List of all the stuff that has been/needs to be redone. No user-facing behaviour or features should change.
Data model
Data model is laid out as Dexie models in here and Dexie index schema is defined here
missing screenshot + favicon BlobsAdding stuff
Deleting stuff
Tags specific
Simplified this interface - omitting unused methods
Bookmark specific
Utilities
Page
model instance orundefined
Imports
Search
result count for no-terms search (filters)(not really feasible)Known bugs
typing 1 letter in the popup search/tags inputs results in infinite typing loop (can't see how this is related yet...)(seems to be my browser; other ext popups doing this too)Misc TODOs
search-index-new/index
Other things?