Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support plain text .dic dictionary files #931

Open
nyurik opened this issue Feb 8, 2024 · 25 comments
Open

Support plain text .dic dictionary files #931

nyurik opened this issue Feb 8, 2024 · 25 comments

Comments

@nyurik
Copy link

nyurik commented Feb 8, 2024

Many projects like Chromium use standard .dic files to list all "known" words, i.e. those words that should NOT be corrected. Is it possible to add support for this? Or is this something already supported (I couldn't find it in the readme or code search)

A .dic file is a simple text file with one word per line. I don't recall how capitalization is specified (i.e. must be exact, or it allows a lower-cased word in the .dic file to be in upper-case to be ignored, but not the other way around).

@nyurik nyurik changed the title Support simple plain text .dic dictionary files Support plain text .dic dictionary files Feb 8, 2024
@epage
Copy link
Collaborator

epage commented Feb 8, 2024

A file of valid words is insufficient for typos because it doesn't coerce code to blessed words but instead a list of cursed words with blessed candidates.

@nyurik
Copy link
Author

nyurik commented Feb 8, 2024

I'm not sure what that means, please elaborate

@epage
Copy link
Collaborator

epage commented Feb 8, 2024

See https://github.com/crate-ci/typos/blob/master/crates/typos-dict/assets/words.csv for our dictionary format we use at compile time.

@nyurik
Copy link
Author

nyurik commented Feb 8, 2024

@epage thx, I understand about the conversion from "bad" to "good" words. What I don't understand is the workflow for the most typical use-case:

  • A user sees some word incorrectly highlighted in their code, and clicks "add to dictionary"
  • The dictionary is an allow-list of all words that will simply be ignored, rather than analyzed/corrected

As such, the .dic files seem to be a perfect fit.

@epage
Copy link
Collaborator

epage commented Feb 8, 2024

Ok, I misunderstood. You aren't asking for us to treat this as a collection of words to correct to but as a list of words we shouldn't attempt to correct. Is that right?

@nyurik
Copy link
Author

nyurik commented Feb 8, 2024

Ok, I misunderstood. You aren't asking for us to treat this as a collection of words to correct to but as a list of words we shouldn't attempt to correct. Is that right?

Exactly! Thanks :)

@nyurik
Copy link
Author

nyurik commented Feb 8, 2024

P.S. And of course you may consider using these words to auto-correct INTO (e.g. if I have a custom foobar, and in my code I mistype it as fobar, you MAY want to autocorrect / suggest foobar as the "right" spelling)

@epage
Copy link
Collaborator

epage commented Feb 9, 2024

Is there a spec for this format?

Can you link to examples of where open source projects use these files with descriptions of how they are used?

@nyurik
Copy link
Author

nyurik commented Feb 9, 2024

I am not certain there is an official "spec" similar to .csv (some variants, not perfectly standardized) -- i.e. it seems UTF-8 is a relatively "recent" change to it, while many programs still treat those files as being in their language own encoding (i.e. uses whatever common encoding was used for the language of the dictionary). A quick search showed these:

@nyurik
Copy link
Author

nyurik commented Feb 9, 2024

P.S. I think this is the best documentation page I found: https://proofingtoolgui.org/proofingtoolgui_files/ProofingToolGUI_manual_V30.html

@epage
Copy link
Collaborator

epage commented Feb 9, 2024

Looks like .dic files are not standalone but require a .aff file to interpret them to get derived forms of words (different suffixes, prefixes).

At this point, I'm going to step back and restart the conversation. Can you describe the problem being addressed (.dic files are a solution), what your proposed solution is, and ideally prior art for that solution?

@nyurik
Copy link
Author

nyurik commented Feb 9, 2024

My understanding was that .aff is "optional" - i.e. initially (from the old Lotus Notes days(?)), a .dic was a simple list of words, one word per line. Later, LibreOffice/hunspell expanded that to support optional <word>/<flag> notation. Those flags are for advanced usage, and may require additional .aff files. TBH, I never even heard of the .aff files until today - but I did see some .dic files stored in various projects a while back - as simple lists of words.

Now, to the main question of what I would like solved:

I would like to have a very easy, minimal no frills way to store custom list of words per project. I have done many PRs for big FOSS projects doing spell checking - e.g. using IntelliJ's spellchecking tool to go through the code. As part of that process, I often have thousands (!!!) of words that are custom to each project, and I have to go through them one by one, "accepting" them into the dictionary. This is an extremely tedious and boring task, and I would much rather have a tool to list all suspicious words into a plain text file, sort it, and quickly read through it to delete any words that are likely spelling mistakes. Whatever left is my new "project dictionary" - a file I can check into the project. The dictionary file should not have any structure because they are much easier to work with when they get fairly large -- no spaces or commas or quotes or escapes, no mandatory wrapping braces, easy to edit, easy to sort the whole file if needed, easy to diff between multiple files, easy to load it with libreoffice to do some multi-file meshes or lookups, etc.

P.S. A few times I had to even manually create this file out of the code by concatenating needed code files, replace all \s+ with \n, remove all [^a-zA-Z], and later converting this simple .dic-like file into a massively painful XML file that IntelliJ was using internally for its dictionary.

@epage
Copy link
Collaborator

epage commented Feb 9, 2024

Those flags are for advanced usage, and may require additional .aff files.

Looks like those are used by both your wooorm and LibreOffice links. This is an example of why I wanted to step back, to understand your request and how people today are using these files to fulfill your request to understand if you are asking for us to support LibreOffice dic files or if there are uses that are a common subset. It also didn't help that when i searched on my own for the referenced Chromium dic file, I accidentally ended up in a dict file which had a different format.

  • but I did see some .dic files stored in various projects a while back - as simple lists of words.

Would you be able to find those and link to them? I'd like to see how projects are using them in practice.

A part of all of this is that we have a way to define blessed words, so an important part of this is "why do we need something different". Prior art / meeting existing projects where they are at is important. This also helps guide discussions on auto-discovery vs specified paths in config, single or multiple files, etc.

P.S. A few times I had to even manually create this file out of the code by concatenating needed code files, replace all \s+ with \n, remove all [^a-zA-Z], and later converting this simple .dic-like file into a massively painful XML file that IntelliJ was using internally for its dictionary.

I wonder if typos --words would help :)

Speaking of, I assume we would want to support specifying these for both words and identifiers.

@nyurik
Copy link
Author

nyurik commented Feb 9, 2024

@nyurik
Copy link
Author

nyurik commented Feb 9, 2024

(I found it with a simple github search https://github.com/search?q=path%3A*.dic&type=code )

@epage
Copy link
Collaborator

epage commented Feb 26, 2024

Looks like tokio is using cargo spellcheck which seems aimed to support some of the more advanced features of .dic files, see https://github.com/drahnr/cargo-spellcheck/blob/master/docs/remedy.md#missing-word-variants

@nyurik
Copy link
Author

nyurik commented Feb 26, 2024

Sure - advanced usages are always possible -- once the simple cases are solved. They mention /S to keep the dictionary small - a nice to have but not a big deal to add both cases - singular and plural - if needed.

@ostr00000
Copy link

I can confirm that the good enough solution is to provide a file with known words.

My use case: In the code, there are used non-english "business" words. I already maintain a file with these valid words (it is in fact a .dic file). The singular and plural forms are not a problem (actually there are also dozens of grammar cases), because I can include these words several times if needed (in various grammar cases). Note that I do not use .aff file at all.

Lack of this feature prevent me to use this tool in pre-commit checks in some of our projects. Probably generating config in extend-words config field from .dic file would also solve my problem, but this would require to write a custom script. Instead, the ability to include a simple "known words" file is a much cleaner and convenient solution.

@epage
Copy link
Collaborator

epage commented Mar 18, 2024

For us to say we are supporting a format and then only supporting a fraction of it feels like it would be setting invalid expectations for users.

I looked around and not seeing other tools implement this. cspell only discusses it in passing in streetsidesoftware/cspell#4942

codespells makes no reference to a specific format but does have an "ignore file" with a line per word and a custom dictionary format

scspell uses a modified format with headers for saying what the "valid words apply to, e.g. their own dict

@epage
Copy link
Collaborator

epage commented Mar 18, 2024

With all of that said, the fact that we have native support for words makes this a lower priority for me resolving.

@nyurik
Copy link
Author

nyurik commented Mar 18, 2024

@epage I understand your desire to have "ideal" solution (nothing wrong with that :) ) - my point of this ticket is that in my experience, the most common need is a plain text .dic files of word lists, not the fancier functionality with significantly higher barrier of entry. Please make it simple for the common usecase, and then eventually other usecases might also be implemented.

@epage
Copy link
Collaborator

epage commented Mar 18, 2024

I'm not shooting for an ideal; I just don't want a lie.

@ostr00000
Copy link

With all of that said, the fact that we have native support for words makes this a lower priority for me resolving.

So the current workaround is to place .dic content in default.extend-words configuration (from docs: When the correction is the key, the word is always valid) - I am correct?

"ignore file" with a line per word

Would it be possible to extend configuration to accept a path to a such file? (I would like to not pollute my pyproject.toml with generated content)

I think the format itself is not so import and solution in codespell is what I am looking for. If it were possible to use any file, that is even better.
For example, I found that firefox uses .dat file for excluding custom valid words (persdict.dat):

@nyurik
Copy link
Author

nyurik commented Mar 18, 2024

I agree, if you think .dic is too much of a promise, let's pick a different extension. Do note that I suspect most people are not even aware of the extra functionality beyond the simple word list -- I certainly was not before this discussion -- so I feel it would be more confusing to pick a new extension than to simply implement a subset of functionality, but whatever gets us going :)

@ccoVeille
Copy link

I'm also interested in the feature to be able to provide a list of words to ignore via a simple file (no matter the extension)

I would expect to be able to provide something like this via the .toml file

[files]
extend-ignore = ["ignore1.txt",".github/ignored.bar"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants