-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify text pre-processing #333
Conversation
- words with digits no longer removed - words with many consonants no longer removed
- invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words
- indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch
- as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too
- this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression)
After trying many things, found there was a special space character (U+00A0) appearing in a lot of pages that we did not cover in our terms delimiter. This was leading to things like stopwords getting through along with empty string. Will merge this in soon, but i think the pipeline test inputs need to contain some more weird character cases to make sure we're handling it all ok. |
Great find :) |
* Remove stages from text preproc - words with digits no longer removed - words with many consonants no longer removed * Update pipeline tests to reflect new text proc behaviour - invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words * Set up pipeline tests to run on both old and new pipelines - indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch * Ensure new pipeline tests pass - as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too * Add timers for HTML + text proc * Add support for special spaces in term delimiter pattern - this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression) * Add pipeline test for space-normalization
* Remove stages from text preproc - words with digits no longer removed - words with many consonants no longer removed * Update pipeline tests to reflect new text proc behaviour - invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words * Set up pipeline tests to run on both old and new pipelines - indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch * Ensure new pipeline tests pass - as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too * Add timers for HTML + text proc * Add support for special spaces in term delimiter pattern - this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression) * Add pipeline test for space-normalization
* Remove stages from text preproc - words with digits no longer removed - words with many consonants no longer removed * Update pipeline tests to reflect new text proc behaviour - invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words * Set up pipeline tests to run on both old and new pipelines - indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch * Ensure new pipeline tests pass - as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too * Add timers for HTML + text proc * Add support for special spaces in term delimiter pattern - this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression) * Add pipeline test for space-normalization
* Remove stages from text preproc - words with digits no longer removed - words with many consonants no longer removed * Update pipeline tests to reflect new text proc behaviour - invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words * Set up pipeline tests to run on both old and new pipelines - indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch * Ensure new pipeline tests pass - as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too * Add timers for HTML + text proc * Add support for special spaces in term delimiter pattern - this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression) * Add pipeline test for space-normalization Update behaviour of hypen splitting in text preproc - previously made 'dash-word' into 'dash' and 'word' - now preserves the original joined form and creates new words - added test to confirm Add support for emails in text pre-proc - added corresponding test Remove hyphen `-` from def terms separator pattern - we now include hypenated terms, and derive terms from them also - update tests to ensure this is happening
* Remove stages from text preproc - words with digits no longer removed - words with many consonants no longer removed * Update pipeline tests to reflect new text proc behaviour - invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words * Set up pipeline tests to run on both old and new pipelines - indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch * Ensure new pipeline tests pass - as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too * Add timers for HTML + text proc * Add support for special spaces in term delimiter pattern - this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression) * Add pipeline test for space-normalization
Update behaviour of hypen splitting in text preproc - previously made 'dash-word' into 'dash' and 'word' - now preserves the original joined form and creates new words - added test to confirm Add support for emails in text pre-proc - added corresponding test Remove hyphen `-` from def terms separator pattern - we now include hypenated terms, and derive terms from them also - update tests to ensure this is happening # Conflicts: # src/search/pipeline.test.js # src/search/util.js # src/util/transform-page-text.js
Update behaviour of hypen splitting in text preproc - previously made 'dash-word' into 'dash' and 'word' - now preserves the original joined form and creates new words - added test to confirm Add support for emails in text pre-proc - added corresponding test Remove hyphen `-` from def terms separator pattern - we now include hypenated terms, and derive terms from them also - update tests to ensure this is happening # Conflicts: # src/search/pipeline.test.js # src/search/util.js # src/util/transform-page-text.js
* Remove stages from text preproc - words with digits no longer removed - words with many consonants no longer removed * Update pipeline tests to reflect new text proc behaviour - invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words * Set up pipeline tests to run on both old and new pipelines - indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch * Ensure new pipeline tests pass - as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too * Add timers for HTML + text proc * Add support for special spaces in term delimiter pattern - this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression) * Add pipeline test for space-normalization
Update behaviour of hypen splitting in text preproc - previously made 'dash-word' into 'dash' and 'word' - now preserves the original joined form and creates new words - added test to confirm Add support for emails in text pre-proc - added corresponding test Remove hyphen `-` from def terms separator pattern - we now include hypenated terms, and derive terms from them also - update tests to ensure this is happening # Conflicts: # src/search/pipeline.test.js # src/search/util.js # src/util/transform-page-text.js
* Remove stages from text preproc - words with digits no longer removed - words with many consonants no longer removed * Update pipeline tests to reflect new text proc behaviour - invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words * Set up pipeline tests to run on both old and new pipelines - indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch * Ensure new pipeline tests pass - as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too * Add timers for HTML + text proc * Add support for special spaces in term delimiter pattern - this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression) * Add pipeline test for space-normalization
Update behaviour of hypen splitting in text preproc - previously made 'dash-word' into 'dash' and 'word' - now preserves the original joined form and creates new words - added test to confirm Add support for emails in text pre-proc - added corresponding test Remove hyphen `-` from def terms separator pattern - we now include hypenated terms, and derive terms from them also - update tests to ensure this is happening # Conflicts: # src/search/pipeline.test.js # src/search/util.js # src/util/transform-page-text.js
* Remove stages from text preproc - words with digits no longer removed - words with many consonants no longer removed * Update pipeline tests to reflect new text proc behaviour - invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words * Set up pipeline tests to run on both old and new pipelines - indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch * Ensure new pipeline tests pass - as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too * Add timers for HTML + text proc * Add support for special spaces in term delimiter pattern - this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression) * Add pipeline test for space-normalization Further simplify text pre-processing (#333) Update behaviour of hypen splitting in text preproc - previously made 'dash-word' into 'dash' and 'word' - now preserves the original joined form and creates new words - added test to confirm Add support for emails in text pre-proc - added corresponding test Remove hyphen `-` from def terms separator pattern - we now include hypenated terms, and derive terms from them also - update tests to ensure this is happening
* Remove stages from text preproc - words with digits no longer removed - words with many consonants no longer removed * Update pipeline tests to reflect new text proc behaviour - invert the prev tests to make sure the pipeline output includes words with numbers and many consonant words * Set up pipeline tests to run on both old and new pipelines - indexes use the pipeline slightly differently - new one has no concern for events at that stage (separate data entities) so it ignores those as input - it also does not add IDB key prefixes to things like terms, domains, tags, etc. - it also returns a bigger final page as we now also store display data and data that was prev. in pouch * Ensure new pipeline tests pass - as new pipeline doesn't deal with IDB prefix keys (like 'term/'), our prev filter condition to exclude empty terms needs to be updated - now tests passing for new index pipeline too * Add timers for HTML + text proc * Add support for special spaces in term delimiter pattern - this was apparently common in a lot of pages, breaking stopword removal and indexing terms like empty string - also changed the order of text transforms slighty to try and remove more earlier (things like dupe words, random digits) - also ensure text is lowercased (regression) * Add pipeline test for space-normalization Further simplify text pre-processing (#333) Update behaviour of hypen splitting in text preproc - previously made 'dash-word' into 'dash' and 'word' - now preserves the original joined form and creates new words - added test to confirm Add support for emails in text pre-proc - added corresponding test Remove hyphen `-` from def terms separator pattern - we now include hypenated terms, and derive terms from them also - update tests to ensure this is happening
Fixes #330
So far two stages mentioned in that issue have been removed and tests updated to verify the change works and doesn't effect other stuff.
Updated the pipeline tests in general to ensure both the current and slightly changed (type sig) new pipeline are both being tested. Fixed one issue from that.
Some others remain that would be nice to get fixed with this work:
It's pretty strange that it seems like only "sometimes", and cannot reproduce in the tests (so far, could try some more test data), so need to look over all the pipeline-interacting code and see how these could get through. Also have the URLs, so could manually try and reproduce with those page contents.