Simplify text pre-processing #333

poltak · 2018-03-14T06:22:10Z

Fixes #330

So far two stages mentioned in that issue have been removed and tests updated to verify the change works and doesn't effect other stuff.

Updated the pipeline tests in general to ensure both the current and slightly changed (type sig) new pipeline are both being tested. Fixed one issue from that.

Some others remain that would be nice to get fixed with this work:

empty string can sometimes get through and be indexed (sometimes happens in current index too)
stopwords can sometimes get through and be indexed

It's pretty strange that it seems like only "sometimes", and cannot reproduce in the tests (so far, could try some more test data), so need to look over all the pipeline-interacting code and see how these could get through. Also have the URLs, so could manually try and reproduce with those page contents.

- words with digits no longer removed - words with many consonants no longer removed

- invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words

- indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch

- as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too

- this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression)

poltak · 2018-03-15T07:42:45Z

After trying many things, found there was a special space character (U+00A0) appearing in a lot of pages that we did not cover in our terms delimiter. This was leading to things like stopwords getting through along with empty string.

Will merge this in soon, but i think the pipeline test inputs need to contain some more weird character cases to make sure we're handling it all ok.

blackforestboi · 2018-03-15T07:43:55Z

Great find :)
We might wanna cover the removal of this character in the migration process as well, so we don't carry over trashy content?
@ShishKabab

* Remove stages from text preproc - words with digits no longer removed - words with many consonants no longer removed * Update pipeline tests to reflect new text proc behaviour - invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words * Set up pipeline tests to run on both old and new pipelines - indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch * Ensure new pipeline tests pass - as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too * Add timers for HTML + text proc * Add support for special spaces in term delimiter pattern - this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression) * Add pipeline test for space-normalization

* Remove stages from text preproc - words with digits no longer removed - words with many consonants no longer removed * Update pipeline tests to reflect new text proc behaviour - invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words * Set up pipeline tests to run on both old and new pipelines - indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch * Ensure new pipeline tests pass - as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too * Add timers for HTML + text proc * Add support for special spaces in term delimiter pattern - this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression) * Add pipeline test for space-normalization Update behaviour of hypen splitting in text preproc - previously made 'dash-word' into 'dash' and 'word' - now preserves the original joined form and creates new words - added test to confirm Add support for emails in text pre-proc - added corresponding test Remove hyphen `-` from def terms separator pattern - we now include hypenated terms, and derive terms from them also - update tests to ensure this is happening

* Remove stages from text preproc - words with digits no longer removed - words with many consonants no longer removed * Update pipeline tests to reflect new text proc behaviour - invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words * Set up pipeline tests to run on both old and new pipelines - indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch * Ensure new pipeline tests pass - as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too * Add timers for HTML + text proc * Add support for special spaces in term delimiter pattern - this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression) * Add pipeline test for space-normalization

Update behaviour of hypen splitting in text preproc - previously made 'dash-word' into 'dash' and 'word' - now preserves the original joined form and creates new words - added test to confirm Add support for emails in text pre-proc - added corresponding test Remove hyphen `-` from def terms separator pattern - we now include hypenated terms, and derive terms from them also - update tests to ensure this is happening # Conflicts: # src/search/pipeline.test.js # src/search/util.js # src/util/transform-page-text.js

* Remove stages from text preproc - words with digits no longer removed - words with many consonants no longer removed * Update pipeline tests to reflect new text proc behaviour - invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words * Set up pipeline tests to run on both old and new pipelines - indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch * Ensure new pipeline tests pass - as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too * Add timers for HTML + text proc * Add support for special spaces in term delimiter pattern - this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression) * Add pipeline test for space-normalization

Update behaviour of hypen splitting in text preproc - previously made 'dash-word' into 'dash' and 'word' - now preserves the original joined form and creates new words - added test to confirm Add support for emails in text pre-proc - added corresponding test Remove hyphen `-` from def terms separator pattern - we now include hypenated terms, and derive terms from them also - update tests to ensure this is happening # Conflicts: # src/search/pipeline.test.js # src/search/util.js # src/util/transform-page-text.js

* Remove stages from text preproc - words with digits no longer removed - words with many consonants no longer removed * Update pipeline tests to reflect new text proc behaviour - invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words * Set up pipeline tests to run on both old and new pipelines - indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch * Ensure new pipeline tests pass - as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too * Add timers for HTML + text proc * Add support for special spaces in term delimiter pattern - this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression) * Add pipeline test for space-normalization

Update behaviour of hypen splitting in text preproc - previously made 'dash-word' into 'dash' and 'word' - now preserves the original joined form and creates new words - added test to confirm Add support for emails in text pre-proc - added corresponding test Remove hyphen `-` from def terms separator pattern - we now include hypenated terms, and derive terms from them also - update tests to ensure this is happening # Conflicts: # src/search/pipeline.test.js # src/search/util.js # src/util/transform-page-text.js

* Remove stages from text preproc - words with digits no longer removed - words with many consonants no longer removed * Update pipeline tests to reflect new text proc behaviour - invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words * Set up pipeline tests to run on both old and new pipelines - indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch * Ensure new pipeline tests pass - as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too * Add timers for HTML + text proc * Add support for special spaces in term delimiter pattern - this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression) * Add pipeline test for space-normalization Further simplify text pre-processing (#333) Update behaviour of hypen splitting in text preproc - previously made 'dash-word' into 'dash' and 'word' - now preserves the original joined form and creates new words - added test to confirm Add support for emails in text pre-proc - added corresponding test Remove hyphen `-` from def terms separator pattern - we now include hypenated terms, and derive terms from them also - update tests to ensure this is happening

poltak added 6 commits March 14, 2018 11:52

Remove stages from text preproc

d0e5041

- words with digits no longer removed - words with many consonants no longer removed

Update pipeline tests to reflect new text proc behaviour

027fc30

- invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words

Ensure new pipeline tests pass

27d9aeb

- as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too

Add timers for HTML + text proc

93307d3

Add pipeline test for space-normalization

f5c228e

poltak merged commit 8ee1c4b into dev/dexie-search-index Mar 16, 2018

poltak deleted the feature/simplify-text-proc branch March 16, 2018 03:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify text pre-processing #333

Simplify text pre-processing #333

poltak commented Mar 14, 2018

poltak commented Mar 15, 2018

blackforestboi commented Mar 15, 2018 •

edited

Loading

Simplify text pre-processing #333

Simplify text pre-processing #333

Conversation

poltak commented Mar 14, 2018

poltak commented Mar 15, 2018

blackforestboi commented Mar 15, 2018 • edited Loading

blackforestboi commented Mar 15, 2018 •

edited

Loading