Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(dict): Remove only corrections if a space could be inserted as well #792

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

not-my-profile
Copy link
Contributor

@not-my-profile not-my-profile commented Aug 7, 2023

The typo dictionary words.csv previously contained a bunch of problematic entries such as:

abouta,about
algorithmi,algorithm
attachen,attach
shouldbe,should

Which resulted in wrong corrections if the following spaces (indicated by ␣) were accidentally missed:

about␣a
algorithm␣i developed
attach␣en masse
should␣be

Many of these entries were introduced by taking entries from the codespell-dict and removing corrections containing spaces (since typos currently doesn't support them), e.g the codespell dictionary contains:

abouta->about a, about,
shouldbe->should, should be,

This commit updates tests/verify.rs to automatically remove entries in the form of {correction}{common_word},{correction}, where {common_word} is one of the 1000 most frequent English words.

The top-1000-most-frequent-words.csv file was generated by running:

curl https://norvig.com/ngrams/count_1w.txt \
  | head -n1024 \
  | awk '{print $1;}' \
  | grep -vE '^([^ia]|al|re)$' \
  > top-1000-most-frequent-words.csv

@@ -0,0 +1,1000 @@
the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this off for now because we'd need to workout cases like "extrememe"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worked these cases out by adding the check:

                        if only_correction.ends_with(suffix) {
                            // We still want to correct e.g. "extrememe" to "extreme".
                            return true;
                        }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd still prefer not be constrained by this very mechanical process. It can provide insight but I don't trust it to automatically be applied

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current process now seems to make exactly the changes we want to all our 63,200 entries, which does inspire some confidence in me. Besides the process is very easy to adapt so I think we can just do so when we figure out that it too eagerly filters out something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current process now seems to make exactly the changes we want to all our 63,200 entries

Except this isn't exactly what I want (see the other thread). Arbitrarily combining words that don't make sense when combined leads to us losing corrections we would have otherwise.

@not-my-profile not-my-profile marked this pull request as draft August 7, 2023 20:39
@not-my-profile not-my-profile marked this pull request as ready for review August 7, 2023 21:03
@not-my-profile not-my-profile changed the title fix(dict): Remove unsure corrections fix(dict): Remove only corrections if they could contain spaces Aug 7, 2023
@not-my-profile not-my-profile changed the title fix(dict): Remove only corrections if they could contain spaces fix(dict): Remove only corrections that could contain spaces Aug 7, 2023
@not-my-profile not-my-profile changed the title fix(dict): Remove only corrections that could contain spaces fix(dict): Remove only corrections if a space could be inserted as well Aug 7, 2023
@not-my-profile
Copy link
Contributor Author

I have updated the PR to now also detect {common_word}{correction},{correction} and not just {correction}{common_word},{correction}.

aand,and
aanother,another
aapply,apply
aack
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The redundant a's is a separate thing and we should be correcting these

The challenge with blindly checking concatenated words is it doesn't filter out for when they don't make sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'd feel better if we just looked at what changed due to the spaces and applied it to those. We could then separate decide which of these changes might make sense. As is, I'm seeing a lot that don't and don't want to take the time to decide that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh the logic I had did actually already detect these with

            if only_correction.starts_with(prefix) {
                return false;
            }

I just must have omitted resetting words.csv to run SNAPSHOTS=overwrite cargo test verify again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the many cases have just become:

aequidistant
aequivalent
afor
amuch
anumber
ascripts
asudo
imakes
isimilar
itheir
itheirs
iwithout

which I think make sense to not correct automatically.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iwithout and many of these don't make sense from a "word combining" perspective. Things we can correct in master will become uncorrectable with this change.

The typo dictionary words.csv previously contained
a bunch of problematic entries such as:

    abouta,about
    algorithmi,algorithm
    attachen,attach
    shouldbe,should
    anumber,number

Which resulted in wrong automatic corrections if the following
spaces (indicated by ␣) were accidentally missed:

    about␣a
    algorithm␣i developed
    attach␣en masse
    should␣be
    a␣number

Many of these entries were introduced by taking entries from the
codespell-dict and removing corrections containing spaces (since typos
currently doesn't support them), e.g the codespell dictionary contains:

    abouta->about a, about,
    shouldbe->should, should be,

This commit updates `tests/verify.rs` to automatically remove
corrections in the form of `{correction}{common_word},{correction}`
or `{common_word}{correction},{correction}`, where `{common_word}` is
one of the 1000 most frequent English words (except if `{correction}`
also ends/starts in `{common_word}`, since we still want to correct e.g.
"extrememe" to "extreme").

The top-1000-most-frequent-words.csv file was generated by running:

    curl https://norvig.com/ngrams/count_1w.txt \
      | head -n1024 \
      | awk '{print $1;}' \
      | grep -vE '^([^ia]|al|re)$' \
      > top-1000-most-frequent-words.csv
@not-my-profile not-my-profile marked this pull request as draft August 11, 2023 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants