Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Dexie-based search index implementation #322

Merged
merged 48 commits into from
Apr 11, 2018
Merged

Conversation

poltak
Copy link
Member

@poltak poltak commented Feb 28, 2018

List of all the stuff that has been/needs to be redone. No user-facing behaviour or features should change.

Data model

Data model is laid out as Dexie models in here and Dexie index schema is defined here

  • Page
    • URL PK (0-1 page per URL)
    • indexes on [title/URL/content] terms (full text search)
    • index on domain (suggestions)
    • missing screenshot + favicon Blobs
  • Visit
    • time + URL compound PK (N Visits per 1 Page)
    • index on URL (find all visits for page)
  • Bookmark
    • URL PK (0-1 per URL)
  • Tag
    • name + URL compound PK (0-N Tags per 1 Page)
    • index on name (suggestions)
    • index on URL (find all tags for page)

Adding stuff

  • add page (+ bookmark, visits) by URL
    • main page indexing method; pages should only exist with at least a visit or bookmark
  • add page terms by URL
  • add visit interaction data by URL + time
    • stage iii/final of page visit scenario
  • add visit by URL
    • used for recent revisits (no page data re-indexing)

Deleting stuff

  • delete pages by URLs
    • overview delete
  • delete pages by domain
    • opt. popup page on-blacklist-update delete hook
  • delete pages by pattern
    • opt. options page on-blacklist-update delete hook

Tags specific

Simplified this interface - omitting unused methods

  • add tag for URL
    • tags dropdowns in popup + overview pages
  • delete tag for URL

Bookmark specific

  • add bookmark by URL
    • via overview or popup
    • includes logic to extract data from tab if no page exists (popup-only)
  • delete bookmark by URL
    • unbookmarking via overview or popup
  • browser bookmark create listener
    • happens when user adds a bookmark via browser, not the ext.
    • may do a remote fetch (same as imports) if the page does not exist (user goes to bookmark mgmt, bookmarks a completely new page)

Utilities

  • get page by URL
    • takes URL, returns Page model instance or undefined

Imports

  • grab already stored history + bookmark URLs (perform set difference with browser data sources)
    • would be nice to see if we can now replace the linear in-mem diff with a Dexie query

Search

  • terms search
  • filter by time (bm + visit events)
  • filter by bookmarks
  • filter by domain
  • filter by tag
  • blank search
  • blank search + filters
  • total (not current page) result count
  • result count for no-terms search (filters) (not really feasible)
  • omnibar search working
  • auto-complete/suggest domains
  • auto-complete/suggest tags

Known bugs

  • imports creates a visit for import time
  • typing 1 letter in the popup search/tags inputs results in infinite typing loop (can't see how this is related yet...) (seems to be my browser; other ext popups doing this too)
  • display times in overview are always the latest bookmark/visit even if time filter used (display issue only)
  • various imports related issues; progress sometimes isn't halted properly causing it to continue running lots of imports in background probably not related
  • search slows down sometimes for some unknown reason; script refresh fixes it. Maybe something with Dexie
  • FF Promise breaks FF

Misc TODOs

  • update old popup code dependent on old page structure
  • go over any remaining usages of Pouch
    • lots of old WebMemex parts highly coupled to Pouch
    • need to refactor within the old search index code and wrap in some index method
  • ensure no more usages of old page IDs exist in codebase (outside of old index dir)
    • now simply using URL (which page ID was derived from in an overly complicated way)
  • cleanup the Storage class and how it relates to the interface in search-index-new/index
  • ensure all the old search-index works after refactoring
    • lots of modules highly-coupled with Pouch meant big refactoring to move it into old index dir
    • get some data on older version, then update to this branch and everything should behave the same
  • go over everything; at least this time a lot less code than prev. implementation
  • write some unit tests for index methods
  • squash this branch

Other things?

@blackforestboi
Copy link
Member

Wow, such great work @poltak. As already mentioned, when testing it 2 days ago i could finish 15k documents in about 1.5 hours :)

But I also ran into a rather nasty bug:

How to reproduce:

  1. Let the importer run
  2. After a while it hangs itself and the ram shoots to almost 1GB and the CPU to 240%
  3. Only way out is restarting the extension.In the latest test even that didn't help and I could not progress with the imports anymore.
    In order for it to work, I had to change the blacklist in order to force a recalculation.
    If you guys can't reproduce, i'd be happy to hop on a skype session and test with you what is necessary.

Also what we talked about before, does it make sense to include a term level concurrency mode, so that in any case a search request does not have to wait for a page to be indexed?

@blackforestboi
Copy link
Member

Addition: While the importer is going, it seems like gradually increasing the RAM load, so there may be some things not properly being removed and clogging the RAM?

@poltak
Copy link
Member Author

poltak commented Feb 28, 2018 via email

@blackforestboi
Copy link
Member

so worrying about indexing individual terms is no longer relevant

Does this mean Dexie handles prioritisation between search requests and indexing itself?
Just want to prevent that under high indexing load (long ass article or anything like that) we have a delay in the search speed, in the way it is now.

@poltak
Copy link
Member Author

poltak commented Mar 1, 2018

Does this mean Dexie handles prioritisation between search requests and indexing itself?

Dexie handles it at another level which it calls transactions. These are like a set of DB ops that either all happen or none happen (if something goes wrong, it's rolled back). In our ext, things like adding a new page or searching is a single transaction. It will handle scheduling these after each other if they are writing to the same table, but apparently allows them to run in parallel if they are just reading (search only needs to read).

That big 6k+ terms United States wiki article now only takes ~400ms for me to index the terms. Takes > 2s for me on the master version. Seems to be a good improvement with terms indexing. Would be good to play around with bigger DB once imports issues sorted and see how it scales

@blackforestboi
Copy link
Member

Just tested your updates and wanna leave here that the problem with the stuck import still persists.

Thought you maybe have fixed it with c71ddeb

Could you reproduce it?

@poltak
Copy link
Member Author

poltak commented Mar 1, 2018

Haven't looked at this yet @oliversauter. It's on the radar

@poltak poltak force-pushed the dev/dexie-search-index branch from c71ddeb to 1071408 Compare March 2, 2018 01:08
@poltak poltak force-pushed the feature/swappable-search-index branch from 89c4f37 to 712fd0a Compare March 2, 2018 06:47
@poltak poltak force-pushed the dev/dexie-search-index branch from b62f514 to a51d09c Compare March 2, 2018 06:51
@poltak
Copy link
Member Author

poltak commented Mar 2, 2018

@oliversauter spent a bit of time playing with imports; managed to import ~2k pages pretty quickly at 20 concurrency and it stayed at fairly constant resource overhead the whole time. Didn't get into any hang or stop. We should talk about it briefly in call tonight if still a prob.

I found one issue that may be related which is sometimes it doesn't properly stop the imports progress when pause/cancel is pressed. The UI looks like it stopped fine, but in the background it's still downloading. Which isn't good as it could continue through the entire history, and the user will be left confused why the browser's going very slow.
Not reproducible yet, just noticed it happening once, spent ages trying to get it happen again, then it happened later when I wasn't expecting. Shouldn't be related to the new index though - imports isn't really touched - probably a bug from the recent changes to imports state, but I'll see if I can get a fix in here.

@blackforestboi
Copy link
Member

blackforestboi commented Mar 2, 2018

Ok, another set of clues:
The importer didn't completely hang itself anymore in the sense that the application crashed.
But at some point it does get stuck and only very slowly progresses.
The only output then is a few error'd out urls, which actually should not error out.
Most of them are "data fetch failed"
screen shot 2018-03-02 at 12 32 16

I am not sure, but maybe it has something to do with us sending too many requests?
Do we handle all HTTP cases well?

EDIT: with reinstalling the extension it works again so probably less to do with the amount of requests we send?

@blackforestboi
Copy link
Member

Another round of reviews:

  1. clicking on tag filter in a result element when already a search term is typed, does not work > adds the #test tag to the search, but only by clicking on the search field and adding a space, it will put the tag into the filter.
  2. When a page has ben revisited, I expect it to move to the top of the empty search. (test: import, then visit one of the pages, reload search overview after visit has been logged)

screen shot 2018-03-03 at 16 38 21

3) Scoring for titles and urls does not work yet. Expect the erdogan article be on top

screen shot 2018-03-03 at 16 39 09

  1. sometimes when I search too fast, and it would result in an empty search result, it resorts to showing the results for no term entered.

  2. The results counter does not work properly, it shows only the number of results that are already streamed in.

  3. Adding tags to pages when navigating in the same tab not possible. The first page works, everything after not.
    Test: 1) Open this article, try to tag, should work. 2) find a guardian link to another article, follow it, try to tag, tag field greyed out. In Prod it works.

@poltak
Copy link
Member Author

poltak commented Mar 4, 2018

@oliversauter search oddities are expected until I get around to going over all that code this week and verifying things. Gonna be lots of little things to fix up.

6 and 5 work fine for me. 6 will be greyed out if you open the popup before the initial indexing happens (DOM load + few hundred ms) - no changes there. 5 should show the correct total count as long as there are terms entered - all look correct for me with 4k pages to search through, but will keep an eye out for any wrong. Planning to look into possibly of counts for non-terms search this week

4 and 1, like imports, is not really related to these changes but will look into including fixes for them once all the other stuff is done.

One big issue to mention is the new index currently won't run under Firefox due to IndexedDB/Dexie transactions not working with FF's native Promise implementation. Should be able to get around it easy by updating our build to replace that code, at least for FF. This is main thing I'm playing around with, along with continuing implementation. More info upstream.

@oliversauter I'll let you know when I need any more manual testing or feedback just to save time and effort

@ShishKabab
Copy link
Member

Looks very good John, lot's of great work done here!

Regarding the testing here's some thoughts:

  • Usually I try to hard code data, or write functions in the tests themselves to generate it (if those functions can be kept simple enough, since I also try to minimize logic in tests.) One example of this is createId() in runSuite(). Here maybe we break normalizeUrl() at one point, or suddenly we break the external API by changing the ID generation, and the tests won't catch that the way they're currently written. It's a debatable point, because writing this tests in this way might need more work both up front and maintaining a changing codebase, but it will probably increase the usefulness of the tests.
  • We cannot assume the tests in mutation tests are executed in order. Every test() is meant to be a complete scenario. For readability, you can split up the tests in different functions, maybe with a certain naming convention, which will show up in the stack trace if something goes wrong.
  • The mapResultsFunc() kind of worries me. The new and old index should have the same API right? Here I may be misunderstanding something, but to my understanding the rest of the codebase should not know if the new or old index is used.
  • I'd renamed expected{1,2,3} to something more descriptive.
  • The call to page.loadRels(): 1) Is this symetric with the old index API? 2) This is an implementation detail resulting from the use of Dexie, and thus should not be part of the public API. If you want to test the internals of the new index, I'd recommend moving the test inside search-index-new. Also, I'd recommend to prefix that method with an underscore, or make an entirely new object that wraps the internal object. Maybe we should have a clear policy about these things, how to hide internals and prevent people from using APIs we do not intend to support in the long run.
  • As a naming convention, I usually write top-level constants like TEST_PAGE_1 instead of page1 and visit1. This makes it cleared these things are shared between multiple tests.
  • What we could do as a convention is to place test data in a .test.data.js file, like I've done in the dexie-import branch. We need to decide here whether we want to use it in the tests in one of ways 1) import * as TEST_DATA from './index.test.data'; TEST_DATA.page1 or 2) import { TEST_PAGE_1, ... } from './index.test.data'. Thoughts about this?
  • Not sure if this matters, but my mental model would say that visit1 < visit2 < visit3

One other question:

  • When writing the import/export functionality, I noticed the code to retrieve screenshots from Pouch was gone. Will the old index keep working as expected?

@blackforestboi
Copy link
Member

Found a bug with the domain filters.

  1. when searching for nytimes.com in the address bar, it shows some results that definitely not from nytimes.com nor do they have this term in the text.
    Seems to happen for all domains entered.

screen shot 2018-03-08 at 20 42 58

2) also weird is, that if i go on "show more results" it only shows one of the results:

screen shot 2018-03-08 at 21 17 33

@poltak
Copy link
Member Author

poltak commented Mar 9, 2018

@oliversauter good find! I had totally forgotten about implementing domains filter extraction from queries (different to the UI filter). Wrote a test to confirm it indeed works on old but fails on new index. Then wrote the code to ensure it now passes. Domains search without terms seems pretty slow right now (seems linear to something).

Thanks a lot for the tests feedback @ShishKabab !

ID gen

I like your points on hardcoding here instead of calling those ID deriving fns. The URL normalization stuff should get their own tests and be considered an unrelated side-effect to the actual index tests. Don't want to call them directly here. Although note that a few of the index methods still depend on them internally (anything that accepts a URL as param will transform it into an ID). Done in
633f46c

dependent tests

Yeah this was something I felt was a bit yucky. Have changed those mutation tests (now calling them "read-write ops tests") so that the test data gets reset before each of them, and that they don't make any assumptions about data based on other tests. Also ensured they all do assertions on results both before and after and write ops. All the outer tests now grouped under "read ops tests" which don't need to reset data each time. 35c2599

test data

Updated names, etc. I like the idea of the separate test data modules. May make tests less clear, but I think most editors should be able to easily show what's in the imported test data for reference. Should the expected values live with test data? In this case, we only need to assert matching IDs, so the expected values are all shared. 633f46c

page.loadRels() thing

Yes this was a big code smell. It is implementation specific to the Dexie Page model, and should be encapsulated in the new index method implementation methods d2dcb45 nice find

mapResultsFunc

Yes, this was weird. Basically it's the post-search stage where constant size display data fetching for results happens,
but the input shapes differed between old and new index (output - what gets sent to UI- is always same however). Updated the old index to be in-line with the new index intermediate results shape and
now no differences needed between the tests. 7b63ea7

I noticed the code to retrieve screenshots from Pouch was gone. Will the old index keep working as expected?

Do you remember the weird way it was implemented before where the images were fetched from Pouch in the UI layer?
This is all encapsulated behind the index interface now, and the search result objects include optional screenshot and favIcon
string properties which are just the image data URIs (displayed directly). In the old index, this is now in the map-search-to-pouch stage
(same ops as before, just not in the UI), while in the new index, it just grabs the Blobs stored with Pages in Dexie, and serializes to data URI.

@ShishKabab
Copy link
Member

Looks very cool John :)

PouchDB attachments

Ah, missed that the code was there :) Have seen the code in the mapping function now, so looks good. Very happy to see that code get out of the UI layer!

Test data

I'd say that most of the time you expect something you have put in, so it's in the data file anyway. For expected stuff that you didn't put in I'd say to leave those things either in the tests themselves or in the in the test file. Mentally it makes a the most sense for me like that.

@blackforestboi
Copy link
Member

good find! I had totally forgotten about implementing domains filter extraction from queries

:) Ok. seems to be the case for #tags filter too.

@blackforestboi
Copy link
Member

blackforestboi commented Mar 14, 2018

Found another search weirdness:

How to reproduce:

  1. Search for a term1 that is not indexed yet > get no results (expected) (my example: hammer)
  2. Search for term1 and term2 which is in a title > get a couple of results, but only the title matches (my example: hammer time)
  3. Search only for term2 > get all results of that term, including title matches. (my example: time)

@blackforestboi
Copy link
Member

Just tested the current version on Chromium and it didnt start the downloads showing the following error:
screen shot 2018-03-14 at 18 19 14

@poltak
Copy link
Member Author

poltak commented Mar 15, 2018

@oliversauter RE that error: are you sure you weren't on dev/dexie-import branch? This is a known build error @ShishKabab brought up with me yesterday, however it looks like a commit has been pushed to that branch to fix it now. If you're sure it was this branch (dev/dexie-search-index), and it's still happening, let me know. Building and installing ok here

Bit confused with the search bug you reported. step 2 seems to contradict step 1: hammer has results in step 2, but in step 1, it's yet to be indexed - is there a missing in-between step? What are the expected and actual results you have for each step? I can try writing a test to cover this case with some more specifics and see what needs to be fixed.

@blackforestboi
Copy link
Member

@poltak
Re Error: Yeah sure I am on the right branch.
screen shot 2018-03-15 at 08 35 34
Building and installing works, it happens when starting the import process.

RE: bug:
since now i had to delete this github issue page with 'hammer' in it and then tried the same saerch again, it didnt show any results anymore for 'hammer time'. Only for 'time'.
http://recordit.co/OQ44bXaby5

Also happens to me on complete reinstall and using 'rhammer time' as the word ('rhammer' is on purpose.)

@ShishKabab
Copy link
Member

This is a known build error @ShishKabab brought up with me yesterday

That one actually was Symbol.asyncIterator not being defined ;)

@poltak
Copy link
Member Author

poltak commented Mar 16, 2018

That one actually was Symbol.asyncIterator not being defined ;)

Yeah I was wrong. This one is saying a missing Symbol.iterator prop on an object, which is used by standard for...of loops to iterate over objects. So here some for...of is trying to iterate an un-iterable object. This one has me confused though. I have tried on Chrome 65 (stable), 66, and 67, and also tried corresponding Chromium builds just in case, but cannot get this one to reproduce. Also gone through all the uses of for...of, including the one in that error message, but cannot see any issues.
I presume this is something with the changes in the babel setup that were needed in this branch, as there haven't been any other related changes - should be fairly simple to look into once I am able to reproduce it.
@oliversauter what is the Chrome/ium version that this is happening for you on?
@ShishKabab does this happen for you? If so, which browser

RE search bug from @oliversauter:
Including an unindexed term (that isn't a stop/filtered-out word) should return 0 results regardless of any other terms in the query, going by my understanding of search. Let me know what is your expected behaviour here and I'll write a test and look to see how we can change it

@blackforestboi
Copy link
Member

should return 0 results regardless of any other terms in the query,

Yeah at least for now, we should stay with that behaviour.

@blackforestboi
Copy link
Member

what is the Chrome/ium version that this is happening for you on?

On this branch, i dont get the error: https://github.com/digi0ps/Memex/tree/search-injection/src

This is my version: https://github.com/digi0ps/Memex/tree/search-injection/src

@poltak
Copy link
Member Author

poltak commented Mar 19, 2018

@oliversauter These are completely different branches. If reproducible for you on a particular Chrome/ium version, send me that number and I'll see if I can get it to reproduce

@blackforestboi
Copy link
Member

These are completely different branches.

Yeah aware of that, just giving reference to where it happens and where not, so the error might be easier to track.

Oh I thought I posted my Chromium version last night in a separate comment. Didn't send it off.
Version 62.0.3202.94 (Developer Build) (64-bit)

poltak added 18 commits April 6, 2018 13:01
- we haven't updated in about a year; lots of fixes made since then apparently, including some thigns we've seen like disconnected port errors
- good example is overview: everytime you change the search, URL state updates and tries to update a visit
Skip imports persisting any periods in history with no data

- history extracted in week periods
- if one week has no history (or all history is deduped), an empty chunk would be stored
- this may or may not be an issue for the reading end during imports progress when an empty chunk is retrieved

Update import-item-creation generators to be consistent

Update import-state's _getChunk to be async generator

- I don't think it really make a diff, but seems much more natural to use generators all the way down here, rather than yielding resolved Promises

Ensure any import items finished after pause sent to UI

- they will be finished anyway, but the user will have no idea about them as the message never got sent
- this way the message will always send, even if it's after the time the user presses pause btn
- bit of an overview of purpose, architecture, and responsibilities of each of the parts

Update import readme diagram

- added in new web ext API abstractions
- removed "import" prefix in front of everything
- update readme text later
Fix up incorrect class prop JSDoc typings

- class props use @type rather than @Property for some reason

Refactor-out cache-logic from import-state class

- cache logic handles messy allocating into chunks and storing in local storage
- now the import-state just concerned with being an interface to fetch, and remove import items
- also removed import-item-creation relience on import-state

Remove import-conn-handler dep on import-state

- it only needs to interface with it to get estimates; afford this through progress-manager

Remove import-state's rehyrdation of allowedTypes state

- i really don't think this does anything; it will be init'd from progress-manager

Abstract web ext API data sources behind class

- now `import-item-creation` accesses this interface rather than directly accesses the APIs
- still some cleanup and confirmation to do with this

Remove old ext migration from imports

- lots of code removal; should have gotten all, but may be missing things

Simplify import data source abstraction

- provides interface between web ext API data sources for `import-item-creation`

Simplify import item creator

- move all data stuff to DataSources class (root ID BM tree still generated)
- DRY out the iteration code now that the DataSoures provides same API for bookmarks and history

Improve way ItemCreator is passed around ImportState

- prev was creating a new instance whenever cache was empty
- now you can pass in a custom instance to ImportStateManager and it will tell that instace to reinit it's data if necessary
- also fix regression with counts doubling

Revert "Ensure any import items finished after pause sent to UI"

This reverts commit 316b496.

- it would mean that XHR errors that throw on being aborted would be flagged as error'd items, so removed from the import item pool
- doesn't seem to be a way to differentiate the error on the XHR's error event

Ensure import state estimate counts init'd from cache

# Conflicts:
#	src/background.js
Write mocks for imports cache and item-creators

- will let me test import-state without relying on those classes (item-creator to be tested separately; cache is an interface with local storage)

Update import class inputs for ease of re-use

- all constructors now accept objects
- most have defaults for the general flow, but tests can override certain things

Add mocks for hist/bms, blacklist + exist.keys lookups

- DataSource class can be mocked as a whole for hist/bms (instead of ItemCreator)
- blacklist is a separate thing, so just mock to return false for every item
- logability check also mocked in same way (return true for all)

Write tests for estimate counts derivation (w/wo cache)

- painful but finally getting it working

Write tests for import item iteration and marking off

- iteration used as main progress thing -> keep requesting new chunks to go through
- marking off happens after each item is processed; they are removed from their chunk

Update tsconfig lib to es2017

- we're making use of es2017 stuff in rest of codebase and compiling with babel

Set up URL list for history/bookmarks test sources

- found this project which makes it simple: https://github.com/citizenlab/test-lists
- now tests have history size of ~280; some fun URLs in there too
- updated the removal test to be more thorough; remove lots of items from each chunk and calculate the expected changes

Add mock for URL normalization module

- messes with lots of tests
- replaced extra input param on ItemCreator that was a poor work-around

Write test for error'd import items handling

- quite a big test, marks the first item of each incoming chunk as an error, making sure that those errord items don't show up in future item reads, unless errors specified
- implemented error flagging on the mock cache
- also unified the initial state calc needed for each test (put in `beforeEach`)

Rename all classes, update readme + fix TS type errors

- classes all renamed according to new README diagram
- README text updated accordingly (and to include new `Cache` and `DataSources` classes)
- some TS type issues in the state-manager tests fixed
- still some weird TS-related  issue at runtime with `checkWithBlacklist` (seems to still work tho; to look into)

Diversify import item derivation test input sources

- rewrote the existing tests inside a function to be run with custom data inputs
- found 1000+ URL list to use as a additional source (more fun in there)
- run all import tests on more diverse combiniations of bm/hist input sets
- will write tests for this part next; this makes a lot of external stuff afford being mocked
- also removes coupling with local storage (moved to conn handler; doesn't belong here)

Write mock for import ItemProcessor class

- this is the main XHR + send request to search-index part of imports
- should be N of these at any one time, where N is concurrency
- trying to figure out how to replace the `process` method with setTimeout to simulate some time

Set up imports progress manager tests

- confused with getting the fake setTimeout working with jest
- `runAllTimers` doesn't seem to be working as explained here: https://facebook.github.io/jest/docs/en/timer-mocks.html
- the following expects fail as those mock cbs never called
- obviously I'm doing something wrong; need to look into more

Immediately resolve import processor mock

- fake timers were not working properly for whatever reason
- now Processor.process immediately resolves instead, which allows us to test the observers being called, but now still hard to test in-progress state

Add checking of concurrent processors to progress tests

- also updated the mock cache so it's not fixed at chunk size of 10 (means concurrency > 10 does nothing in tests)

Write tests for interrupting imports progress

- again not as nice as the timers aren't working
- basically starting off the importer, then immediately stopping and making sure all the concurrent processors are set to cancelled and none of the observer cbs are called

Write imports progress restart after interruption tests

Fix bug with tsc transpiling asyncawait instead of babel

- es6 target was telling tsc to handle our asyncawait code; we've already got babel to do that though and they don't seem to be happy to work together sometimes
- tsc now targets ESNext, so it will ignore a lot more ES features and leave to babel

Set up separate tsconfig for jest

- we're not using Babel after tsc for ts test modules
- hence we need to target lower for jest to run it, as babel isn't going to transpile stuff

Add skipped big (4000+) imports progress test

- takes a while to run, so skipping it
- maybe look at optional tests or something (skip will never run until test updated)
- make sure it works for both counts and actual progressing through the created items (no URL gets iterated twice)
- should be Set of URL strings rather than encoded page IDs
- put a simple decode call in and removed old unused `trimPrefix` arg
- after getting the tags/domains filtered URLs (fast) it was doing a range lookup over visits index and only keeping those from prev filter (slow; worst case is linear to visit index size)
- no need to do range lookup at all; just get the latest events for those already filtered URLs and paginate!
- for terms search, it already performs in log time
- URLs are not unique in visits index (compound PK on time + URL, as each page can have many visits)
- this meant the existing `.eachPrimaryKey` iteration on the query result could be quite long if many visits to single pages (my memex page had hundreds of visits)
- seems a hell of a lot faster just doing N parallel lookups just on URL and getting the first one that passes criteria (within time filters)
- untracked tabs aren't supported yet, but in FF will throw a bunch of "Tab not found" errors; expected but should be caught
- search response shape changed slightly in new index work to be less "Pouch-like"
- previously forced this to make Dexie work nicely with FF
- it seems like it messes with a lot of other stuff in FF though, like content_script can no longer access local storage without the Promise rejecting for "unknown reason"
- taking it out, indexing still seems to work fine (FF 59) and all the promise-related bugs go away
- minor cleanup of some derived imports UI state too (imports UI really needs a work-over; it's gotten quite bad)
- bg script logic to handle forced recalc
- still left-over stuff for old ext items-specific state
- start import btn disable state derivation simplified
- this module handles differentiating the listeners of different notifications via a tiny state (only a single event shared between all notifs; switch on IDs)
- refactored all existing notif creation calls to use this module
- just suggestion text for now; will change
- links to main knowledge base page until we write article/blog post
@poltak poltak force-pushed the dev/dexie-search-index branch from 72eed15 to f976ca4 Compare April 6, 2018 06:01
poltak added 5 commits April 6, 2018 13:51
- rewrite new notifications module in TS to take advantage of it; seems to work well
- the reverse can be replaced by simply setting the redux reducer to prepend the details rows whenenver a new item is finished
- now it should show up on any search that has at least: any terms, any domains or any tags filters defined
- also made show not to show it when loading (doesn't flash in and out now)
- removed the old `getTotalCount` search param; it's always 'true'
- TODO url boost
- this was working slightly differently between the implementations (rounding up vs down) - now fixed
- similar to prev. added boosted title terms search
@poltak poltak force-pushed the dev/dexie-search-index branch from 4185616 to af6a724 Compare April 11, 2018 07:30
@poltak poltak merged commit 34d6014 into master Apr 11, 2018
@blackforestboi
Copy link
Member

blackforestboi commented Apr 11, 2018

Ran a few tests and it all works pretty well :)

A couple of things and then I think we can merge it. (one might be a biit bigger) List is in order of priority

  1. The notification at the end to reenable the fishing warnings (as already talked about)
  • the first warning should say: "Your browser may stop imports suddenly. Find out why and how to solve it"
  • also make that notification stay as long as the user does not actively remove it.
  • The lst notification should say: "In case you disabled your fishing filter, don't forget to re-enable it"
  • This is the link to the knowledge base article: https://worldbrain.helprace.com/i49-prevent-your-imports-from-stopping-midway
  1. in the popup when i blacklist stuff, the red icon is gone (nothing to see there)
  • screen shot 2018-04-11 at 19 28 39
  1. the undo button goes to the "settings" it should go to "blacklists"
  2. when there is nothing more to download and i select "include previously failed urls" I cant start the process.
  3. I think we need to find a better solution with the favicons for several reasons:
  • they are constantly refetched when the same domain urls are loaded (even from disk, it takes sometimes 200-300ms), i assume we can save a lot of time there.
    -> even though they dont take much space, we might abstract those away and have favicons as a separate store belonging to a certain domain and when the results are loaded we load the favicons separately. What are ideas to mitigate this?
    If it is an issue to get it done quickly, how can we prevent it being a problem if we do change that later?

@blackforestboi
Copy link
Member

And another thing that is still not working properly is the counters.
They always show something different.
I was cancelling the download with 109XX of 14XXX downloaded items.
When I got back to the overview it still showed 14830 items. Then i pressed reload. then it showed me 8XXX and then i pressed reload again and it showed me 6XXX.

@poltak poltak deleted the dev/dexie-search-index branch May 8, 2018 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants