Store unprocessed page text content in page doc #266

poltak · 2018-01-19T04:27:33Z

page text content goes through two main stages: 1) HTML preproc -> 2) plain-text preproc
previously storing result of 2) at path content.fullText in page docs - pretty much the same content as will go into the terms index
now storing result of 1) in page docs, while terms index source remains unchanged
language, extracted from the DOM, is also used in 2), although only available from the source data of 1) - now simply adding that as an additional content.lang key in the page doc for access in 2)
seems to be functional in both importing and page visit scenarios

- just use the Set constructor rather than the manual filtering - made more generic

- the pouch doc is created straight after the content analysis result is returned to the bg script, so now it will store the result of the initial HTML preprocessing - the index pipeline is set up to do the text preprocessing on the content stored in the input page doc - language now also stored as page doc metadata (needed for text preprocessing)

blackforestboi · 2018-01-24T19:57:57Z

This can be merged.

Just a last question: What happens on revisit? The fullContent is entirely replaced, right? But the index is appended?

This is OK behaviour, so can be merged in. Not ideal yet, but ok to merge.

poltak · 2018-01-25T02:17:47Z

@oliversauter yes, that is how it works, although note that "index is appended" still means updating each of the existing terms' latest timestamp as well as that's how the scoring works in the terms indexes (#187 - unrelated to this PR)

What would be the ideal behaviour for the full content in pouch on a revisit? I think appending would be pretty difficult and probably need some smart way to understand the text content (so you're not just duplicating content each visit). Or there is the dumb route of uniqing the pouch text content by words - appending is then easy (dupe words ignored), but you would lose any possibility of deriving term frequency from that data.

blackforestboi · 2018-01-25T12:59:37Z

need some smart way to understand the text content

Yeah that is what we still need to figure out before we can do appending etc. as long as the document can be searched with old words, its ok. This is how people remember stuff again. Just in case we need to re-index/re-analyse, we would need the lastest version.

So all good for now.

poltak added 2 commits January 19, 2018 10:37

Simplify remove dupe words text transform

fb4a6e9

- just use the Set constructor rather than the manual filtering - made more generic

poltak merged commit a6cfd21 into master Jan 25, 2018

poltak deleted the feature/store-raw-page-content branch February 8, 2018 05:21

blackforestboi changed the title ~~Store unprocessed page text content in page doc~~ MTNI-205 ⁃ Store unprocessed page text content in page doc Apr 19, 2018

blackforestboi changed the title ~~MTNI-205 ⁃ Store unprocessed page text content in page doc~~ Store unprocessed page text content in page doc Apr 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store unprocessed page text content in page doc #266

Store unprocessed page text content in page doc #266

poltak commented Jan 19, 2018 •

edited by blackforestboi

Loading

blackforestboi commented Jan 24, 2018

poltak commented Jan 25, 2018

blackforestboi commented Jan 25, 2018

Store unprocessed page text content in page doc #266

Store unprocessed page text content in page doc #266

Conversation

poltak commented Jan 19, 2018 • edited by blackforestboi Loading

blackforestboi commented Jan 24, 2018

poltak commented Jan 25, 2018

blackforestboi commented Jan 25, 2018

poltak commented Jan 19, 2018 •

edited by blackforestboi

Loading