Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store unprocessed page text content in page doc #266

Merged
merged 2 commits into from
Jan 25, 2018

Conversation

poltak
Copy link
Member

@poltak poltak commented Jan 19, 2018

  • page text content goes through two main stages: 1) HTML preproc -> 2) plain-text preproc
  • previously storing result of 2) at path content.fullText in page docs - pretty much the same content as will go into the terms index
  • now storing result of 1) in page docs, while terms index source remains unchanged
  • language, extracted from the DOM, is also used in 2), although only available from the source data of 1) - now simply adding that as an additional content.lang key in the page doc for access in 2)
  • seems to be functional in both importing and page visit scenarios

- just use the Set constructor rather than the manual filtering
- made more generic
- the pouch doc is created straight after the content analysis result is returned to the bg script, so now it will store the result of the initial HTML preprocessing
- the index pipeline is set up to do the text preprocessing on the content stored in the input page doc
- language now also stored as page doc metadata (needed for text preprocessing)
@blackforestboi
Copy link
Member

This can be merged.

Just a last question: What happens on revisit? The fullContent is entirely replaced, right? But the index is appended?

This is OK behaviour, so can be merged in. Not ideal yet, but ok to merge.

@poltak
Copy link
Member Author

poltak commented Jan 25, 2018

@oliversauter yes, that is how it works, although note that "index is appended" still means updating each of the existing terms' latest timestamp as well as that's how the scoring works in the terms indexes (#187 - unrelated to this PR)

What would be the ideal behaviour for the full content in pouch on a revisit? I think appending would be pretty difficult and probably need some smart way to understand the text content (so you're not just duplicating content each visit). Or there is the dumb route of uniqing the pouch text content by words - appending is then easy (dupe words ignored), but you would lose any possibility of deriving term frequency from that data.

@poltak poltak merged commit a6cfd21 into master Jan 25, 2018
@blackforestboi
Copy link
Member

need some smart way to understand the text content

Yeah that is what we still need to figure out before we can do appending etc. as long as the document can be searched with old words, its ok. This is how people remember stuff again. Just in case we need to re-index/re-analyse, we would need the lastest version.

So all good for now.

@poltak poltak deleted the feature/store-raw-page-content branch February 8, 2018 05:21
@blackforestboi blackforestboi changed the title Store unprocessed page text content in page doc MTNI-205 ⁃ Store unprocessed page text content in page doc Apr 19, 2018
@blackforestboi blackforestboi changed the title MTNI-205 ⁃ Store unprocessed page text content in page doc Store unprocessed page text content in page doc Apr 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants