-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Common contractions are missing from frequency_dictionary_en_82_765.txt #109
Comments
The frequency_dictionary_en_82_765.txt was created by intersecting the two lists mentioned below. By reciprocally filtering only those words which appear in both lists are used. Additional filters were applied and the resulting list truncated to ≈ 80,000 most frequent words. Google Books Ngram data : Provides representative word frequencies SCOWL - Spell Checker Oriented Word Lists (License) : Ensures genuine English vocabulary The Google Ngram data does not contain contraction, therefore they were missing also from the resulting dictionary and manually added afterwards with artificial counts. Missing contractions can be added. The Books Ngram Viewer e.g. replaces "didn't" with "did not" to match how they processed the books. You can lookup the frequency of "did not" in the frequency_bigramdictionary_en_243_342.txt included in symspell and use this frequency for "didn't". That should be a better approximation than the artificial counts. |
Thank you for this information! Do you know if there are any other transformations applied to the data used to create the dictionary besides "didn't" -> "did not"? Is there a list somewhere? |
Google: "How does the Ngram Viewer handle punctuation? I guess this transformation is applied by Google to all all English contractions to generate the ngram data from the corpus. The Google ngram data then was used to generate the SymSpell dictionaries (single term dictionary and bigram dictionary). |
I added in the following contractions: |
Attached are numerous contractions, some that do not include apostrophes (such as gonna and gimme). See: https://en.wiktionary.org/wiki/Category:English_contractions |
Thank you. I'm sorry for the delay, its still on my to-do list ... |
It looks like contractions were added as an afterthought to this list after it was constructed, as they appear at the end and have artificial counts:
There are some very common contractions missing from this list, such as "didn't". This means that when I try to correct a phrase like "I didnt want that", I get the suggestion "I didst want that", which is not ideal.
Is this a known issue? Is there a better frequency dictionary to use that includes contractions? Or should I just add more entries with artificial counts? Thank you!
The text was updated successfully, but these errors were encountered: