Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking the source dictionary #5

Open
jaumeortola opened this issue Feb 5, 2024 · 2 comments
Open

Checking the source dictionary #5

jaumeortola opened this issue Feb 5, 2024 · 2 comments

Comments

@jaumeortola
Copy link
Member

jaumeortola commented Feb 5, 2024

The first version of the source dictionary is here: https://github.com/languagetool-org/english-pos-dict/tree/main/src-dict

I will be adding some comments and ideas here. We can open new issues for some parts of the work.

  • We can proceed to separate the entries by groups: the ones that don't need review, the ones that need some manual review, and so on. For example:
    • words in all spelling dicts and tagged -> no need to review
    • words in all spelling dicts but not tagged -> maybe they can be tagged easily
    • words in US and GB and tagged -> maybe they can be accepted by all variants?
    • ...
  • Check that variant labels are coherent with en-US-GB.txt (use scripting).
  • Some sets of entries look suspicious: untagged words in GB with some prefixes (mis-, out-, over-, re-, under-) seem nonsense words. The same with some affixes (see: survivorshipably... survivorshipry).
  • Words with the tag us-large come from a Hunspell US dictionary that we didn't use until now. It is mentioned in Explore differences between en-US and en-US-large #2
  • We are using a simplified format for regular verbs: recharge=verb=all. (We use a few rules to cover more cases of regular verbs. See here). It would be useful to have something similar for nouns: a simple and quick way to tag a noun. We would need to define the format, and ways to write exceptions.
  • What sources we consider authoritative to determine if a word is GB or US? And AU, CA, ZA, NZ? Are there dictionaries for those variants?
@jaumeortola
Copy link
Member Author

jaumeortola commented Feb 7, 2024

Separating the dict in src-clean.txt (accepted entries), src-pending.txt and src-discarded.txt. 551de86

Done:

  • tagged words in all spelling dicts -> moved to src-clean
  • tagged words in GB and US -> moved to src-clean, variant modified to "all"
  • untagged words in GB with some prefixes (mis-, out-, over-, re-, under-, in- + uppercase) -> moved to src-discarded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants