-
-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spelling checker has been modified #71
Conversation
Sourcery Code Quality Report❌ Merging this PR will decrease code quality in the affected files by 2.06%.
Here are some functions in these files that still need a tune-up:
Legend and ExplanationThe emojis denote the absolute quality of the code:
The 👍 and 👎 indicate whether the quality has improved or gotten worse with this pull request. Please see our documentation here for details on how these metrics are calculated. We are actively working on this report - lots more documentation and extra metrics to come! Help us improve this quality report! |
This PR depends on the merging of PR #69 - once we merge that PR the current one can proceed but till then let's resolve any comments on this PR |
Please also do one last check in https://github.com/neomatrix369/nlp_profiler/blob/master/CONTRIBUTING.md to see if any dependent files need changing i.e. re-running notebooks etc, the Developer Guide is also something to review as a closing action. Maybe you can enhance the existing grammar check example in the notebook(s) to illustrate the new package's features. There are notebooks on this repo, please take a look at them and re-run them on your local machine to see if your changes have taken effect and no issues have arisen. There are also markdown files in this repo, they may need a touch-up due to this change - can you pls check if that's the case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - just a few changes requested
This PR is related to #8, have a good read of the issue to see if all or most of the requirements there are resolved by this PR |
I checked #8 and #2 and it addresses both issues. The results have been modified with fuzzy algorithm and they are penalizing for each misspelled word and arrangement of tokens. See this article: https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe |
One last thing to do is update the CHANGELOG.md for this change - its very easy to do, see how the previous ones are done |
Please check the options that you have completed and strike out the options that do not apply via this pull request:
you have read
the notebooks are updated (see notebooks folder, read the Notebooks docs)Goal or purpose of the PR
The spelling checker previously used TextBlob and required tokenization for the spelling checking and spelling quality summarisation. This took significant time and the result score calculated was also not satisfactory.
Changes implemented in the PR
I replaced the checker function with a package that states to be much faster than TextBlob and jamspell, namely, Symspellpy. Further, the result scoring was entirely based on the ratio of the number of misspelled words to the total length of the string. This doesn't take ease of reading or "whether the phrase makes sense" into account. To resolve these issues, I used fuzzy-matching techniques that compare the original text and the rectified text and mark the score of the text accordingly.