Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common contractions are missing from frequency_dictionary_en_82_765.txt #109

Open
rogerbock opened this issue Apr 21, 2021 · 6 comments
Open

Comments

@rogerbock
Copy link

It looks like contractions were added as an afterthought to this list after it was constructed, as they appear at the end and have artificial counts:

can't 300000
won't 300000
don't 300000
couldn't 300000
shouldn't 300000
wouldn't 300000
needn't 300000
mustn't 300000
she'll 300000
we'll 300000
he'll 300000
they'll 300000
i'll 300000
i'm 300000

There are some very common contractions missing from this list, such as "didn't". This means that when I try to correct a phrase like "I didnt want that", I get the suggestion "I didst want that", which is not ideal.

Is this a known issue? Is there a better frequency dictionary to use that includes contractions? Or should I just add more entries with artificial counts? Thank you!

@wolfgarbe
Copy link
Owner

The frequency_dictionary_en_82_765.txt was created by intersecting the two lists mentioned below. By reciprocally filtering only those words which appear in both lists are used. Additional filters were applied and the resulting list truncated to ≈ 80,000 most frequent words.

Google Books Ngram data : Provides representative word frequencies
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

SCOWL - Spell Checker Oriented Word Lists (License) : Ensures genuine English vocabulary
http://wordlist.aspell.net/

The Google Ngram data does not contain contraction, therefore they were missing also from the resulting dictionary and manually added afterwards with artificial counts. Missing contractions can be added.

The Books Ngram Viewer e.g. replaces "didn't" with "did not" to match how they processed the books.
https://books.google.com/ngrams/graph?content=didn%27t&year_start=1800&year_end=2019

You can lookup the frequency of "did not" in the frequency_bigramdictionary_en_243_342.txt included in symspell and use this frequency for "didn't". That should be a better approximation than the artificial counts.

@rogerbock
Copy link
Author

Thank you for this information! Do you know if there are any other transformations applied to the data used to create the dictionary besides "didn't" -> "did not"? Is there a list somewhere?

@wolfgarbe
Copy link
Owner

Google: "How does the Ngram Viewer handle punctuation?
We apply a set of tokenization rules specific to the particular language. In English, contractions become two words (they're becomes the bigram they 're, we'll becomes we 'll, and so on). The possessive 's is also split off, but R'n'B remains one token. Negations (n't) are normalized so that don't becomes do not. In Russian, the diacritic ё is normalized to e, and so on. The same rules are applied to parse both the ngrams typed by users and the ngrams extracted from the corpora, which means that if you're searching for don't, don't be alarmed by the fact that the Ngram Viewer rewrites it to do not; it is accurately depicting usages of both don't and do not in the corpus. However, this means there is no way to search explicitly for the specific forms can't (or cannot): you get can't and can not and cannot all at once."
https://books.google.com/ngrams/info

I guess this transformation is applied by Google to all all English contractions to generate the ngram data from the corpus. The Google ngram data then was used to generate the SymSpell dictionaries (single term dictionary and bigram dictionary).
https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions (List of popular English contractions)

@rogerbock
Copy link
Author

I added in the following contractions:
aren't could've didn't doesn't hadn't hasn't haven't he'd he's here's how'd how'll how're i'd i've isn't it'd it'll it's let's might've o'clock she'd she's should've somebody's someone's something's that's there's they'd they're they've wasn't we'd we're we've weren't what's where's who'd who'll who're who's why'd why're why's you'd you'll you're you've
I also added in the following common words that I noticed were not in the dictionary:
covid hi
I think this workaround is sufficient for my purposes, but I'll let you decide if you want to keep this issue open or not. Thank you for your help!

@ghost
Copy link

ghost commented Nov 28, 2021

Attached are numerous contractions, some that do not include apostrophes (such as gonna and gimme).

contractions.txt

See: https://en.wiktionary.org/wiki/Category:English_contractions

@wolfgarbe
Copy link
Owner

Thank you. I'm sorry for the delay, its still on my to-do list ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants