Support plain text .dic dictionary files #931

nyurik · 2024-02-08T20:14:38Z

Many projects like Chromium use standard .dic files to list all "known" words, i.e. those words that should NOT be corrected. Is it possible to add support for this? Or is this something already supported (I couldn't find it in the readme or code search)

A .dic file is a simple text file with one word per line. I don't recall how capitalization is specified (i.e. must be exact, or it allows a lower-cased word in the .dic file to be in upper-case to be ignored, but not the other way around).

The text was updated successfully, but these errors were encountered:

epage · 2024-02-08T20:50:09Z

A file of valid words is insufficient for typos because it doesn't coerce code to blessed words but instead a list of cursed words with blessed candidates.

nyurik · 2024-02-08T20:56:33Z

I'm not sure what that means, please elaborate

epage · 2024-02-08T22:05:45Z

See https://github.com/crate-ci/typos/blob/master/crates/typos-dict/assets/words.csv for our dictionary format we use at compile time.

nyurik · 2024-02-08T22:14:59Z

@epage thx, I understand about the conversion from "bad" to "good" words. What I don't understand is the workflow for the most typical use-case:

A user sees some word incorrectly highlighted in their code, and clicks "add to dictionary"
The dictionary is an allow-list of all words that will simply be ignored, rather than analyzed/corrected

As such, the .dic files seem to be a perfect fit.

epage · 2024-02-08T22:44:16Z

Ok, I misunderstood. You aren't asking for us to treat this as a collection of words to correct to but as a list of words we shouldn't attempt to correct. Is that right?

nyurik · 2024-02-08T22:45:12Z

Ok, I misunderstood. You aren't asking for us to treat this as a collection of words to correct to but as a list of words we shouldn't attempt to correct. Is that right?

Exactly! Thanks :)

nyurik · 2024-02-08T22:47:09Z

P.S. And of course you may consider using these words to auto-correct INTO (e.g. if I have a custom foobar, and in my code I mistype it as fobar, you MAY want to autocorrect / suggest foobar as the "right" spelling)

epage · 2024-02-09T16:38:26Z

Is there a spec for this format?

Can you link to examples of where open source projects use these files with descriptions of how they are used?

nyurik · 2024-02-09T16:42:36Z

I am not certain there is an official "spec" similar to .csv (some variants, not perfectly standardized) -- i.e. it seems UTF-8 is a relatively "recent" change to it, while many programs still treat those files as being in their language own encoding (i.e. uses whatever common encoding was used for the language of the dictionary). A quick search showed these:

https://github.com/wooorm/dictionaries
Chromium: https://www.chromium.org/developers/how-tos/editing-the-spell-checking-dictionaries/
LibreOffice: https://github.com/LibreOffice/dictionaries
- Related wiki page with lots of info: https://wiki.documentfoundation.org/Development/Dictionaries

nyurik · 2024-02-09T16:51:36Z

P.S. I think this is the best documentation page I found: https://proofingtoolgui.org/proofingtoolgui_files/ProofingToolGUI_manual_V30.html

epage · 2024-02-09T16:56:18Z

Looks like .dic files are not standalone but require a .aff file to interpret them to get derived forms of words (different suffixes, prefixes).

At this point, I'm going to step back and restart the conversation. Can you describe the problem being addressed (.dic files are a solution), what your proposed solution is, and ideally prior art for that solution?

nyurik · 2024-02-09T17:27:52Z

My understanding was that .aff is "optional" - i.e. initially (from the old Lotus Notes days(?)), a .dic was a simple list of words, one word per line. Later, LibreOffice/hunspell expanded that to support optional <word>/<flag> notation. Those flags are for advanced usage, and may require additional .aff files. TBH, I never even heard of the .aff files until today - but I did see some .dic files stored in various projects a while back - as simple lists of words.

Now, to the main question of what I would like solved:

I would like to have a very easy, minimal no frills way to store custom list of words per project. I have done many PRs for big FOSS projects doing spell checking - e.g. using IntelliJ's spellchecking tool to go through the code. As part of that process, I often have thousands (!!!) of words that are custom to each project, and I have to go through them one by one, "accepting" them into the dictionary. This is an extremely tedious and boring task, and I would much rather have a tool to list all suspicious words into a plain text file, sort it, and quickly read through it to delete any words that are likely spelling mistakes. Whatever left is my new "project dictionary" - a file I can check into the project. The dictionary file should not have any structure because they are much easier to work with when they get fairly large -- no spaces or commas or quotes or escapes, no mandatory wrapping braces, easy to edit, easy to sort the whole file if needed, easy to diff between multiple files, easy to load it with libreoffice to do some multi-file meshes or lookups, etc.

P.S. A few times I had to even manually create this file out of the code by concatenating needed code files, replace all \s+ with \n, remove all [^a-zA-Z], and later converting this simple .dic-like file into a massively painful XML file that IntelliJ was using internally for its dictionary.

epage · 2024-02-09T17:48:29Z

Those flags are for advanced usage, and may require additional .aff files.

Looks like those are used by both your wooorm and LibreOffice links. This is an example of why I wanted to step back, to understand your request and how people today are using these files to fulfill your request to understand if you are asking for us to support LibreOffice dic files or if there are uses that are a common subset. It also didn't help that when i searched on my own for the referenced Chromium dic file, I accidentally ended up in a dict file which had a different format.

but I did see some .dic files stored in various projects a while back - as simple lists of words.

Would you be able to find those and link to them? I'd like to see how projects are using them in practice.

A part of all of this is that we have a way to define blessed words, so an important part of this is "why do we need something different". Prior art / meeting existing projects where they are at is important. This also helps guide discussions on auto-discovery vs specified paths in config, single or multiple files, etc.

P.S. A few times I had to even manually create this file out of the code by concatenating needed code files, replace all \s+ with \n, remove all [^a-zA-Z], and later converting this simple .dic-like file into a massively painful XML file that IntelliJ was using internally for its dictionary.

I wonder if typos --words would help :)

Speaking of, I assume we would want to support specifying these for both words and identifiers.

nyurik · 2024-02-09T17:56:12Z

Tokio project :) https://github.com/tokio-rs/tokio/blob/0fbde0e94b06536917b6686e996856a33aeb29ee/spellcheck.dic

nyurik · 2024-02-09T17:57:46Z

(I found it with a simple github search https://github.com/search?q=path%3A*.dic&type=code )

epage · 2024-02-26T22:45:09Z

Looks like tokio is using cargo spellcheck which seems aimed to support some of the more advanced features of .dic files, see https://github.com/drahnr/cargo-spellcheck/blob/master/docs/remedy.md#missing-word-variants

nyurik · 2024-02-26T23:18:22Z

Sure - advanced usages are always possible -- once the simple cases are solved. They mention /S to keep the dictionary small - a nice to have but not a big deal to add both cases - singular and plural - if needed.

ostr00000 · 2024-02-27T00:16:06Z

I can confirm that the good enough solution is to provide a file with known words.

My use case: In the code, there are used non-english "business" words. I already maintain a file with these valid words (it is in fact a .dic file). The singular and plural forms are not a problem (actually there are also dozens of grammar cases), because I can include these words several times if needed (in various grammar cases). Note that I do not use .aff file at all.

Lack of this feature prevent me to use this tool in pre-commit checks in some of our projects. Probably generating config in extend-words config field from .dic file would also solve my problem, but this would require to write a custom script. Instead, the ability to include a simple "known words" file is a much cleaner and convenient solution.

epage · 2024-03-18T19:07:48Z

For us to say we are supporting a format and then only supporting a fraction of it feels like it would be setting invalid expectations for users.

I looked around and not seeing other tools implement this. cspell only discusses it in passing in streetsidesoftware/cspell#4942

codespells makes no reference to a specific format but does have an "ignore file" with a line per word and a custom dictionary format

scspell uses a modified format with headers for saying what the "valid words apply to, e.g. their own dict

epage · 2024-03-18T19:08:10Z

With all of that said, the fact that we have native support for words makes this a lower priority for me resolving.

nyurik · 2024-03-18T19:29:29Z

@epage I understand your desire to have "ideal" solution (nothing wrong with that :) ) - my point of this ticket is that in my experience, the most common need is a plain text .dic files of word lists, not the fancier functionality with significantly higher barrier of entry. Please make it simple for the common usecase, and then eventually other usecases might also be implemented.

epage · 2024-03-18T19:35:44Z

I'm not shooting for an ideal; I just don't want a lie.

ostr00000 · 2024-03-18T22:42:22Z

With all of that said, the fact that we have native support for words makes this a lower priority for me resolving.

So the current workaround is to place .dic content in default.extend-words configuration (from docs: When the correction is the key, the word is always valid) - I am correct?

"ignore file" with a line per word

Would it be possible to extend configuration to accept a path to a such file? (I would like to not pollute my pyproject.toml with generated content)

I think the format itself is not so import and solution in codespell is what I am looking for. If it were possible to use any file, that is even better.
For example, I found that firefox uses .dat file for excluding custom valid words (persdict.dat):

nyurik · 2024-03-18T23:00:43Z

I agree, if you think .dic is too much of a promise, let's pick a different extension. Do note that I suspect most people are not even aware of the extra functionality beyond the simple word list -- I certainly was not before this discussion -- so I feel it would be more confusing to pick a new extension than to simply implement a subset of functionality, but whatever gets us going :)

ccoVeille · 2024-07-09T09:25:42Z

I'm also interested in the feature to be able to provide a list of words to ignore via a simple file (no matter the extension)

I would expect to be able to provide something like this via the .toml file

[files]
extend-ignore = ["ignore1.txt",".github/ignored.bar"]

nyurik changed the title ~~Support simple plain text .dic dictionary files~~ Support plain text .dic dictionary files Feb 8, 2024

epage mentioned this issue Oct 21, 2024

Question: Define false positives in a separate file #1128

Closed

jvacek mentioned this issue Oct 22, 2024

Extend configuration from another file #1129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support plain text .dic dictionary files #931

Support plain text .dic dictionary files #931

nyurik commented Feb 8, 2024 •

edited

Loading

epage commented Feb 8, 2024

nyurik commented Feb 8, 2024

epage commented Feb 8, 2024

nyurik commented Feb 8, 2024

epage commented Feb 8, 2024

nyurik commented Feb 8, 2024

nyurik commented Feb 8, 2024 •

edited

Loading

epage commented Feb 9, 2024

nyurik commented Feb 9, 2024 •

edited

Loading

nyurik commented Feb 9, 2024

epage commented Feb 9, 2024

nyurik commented Feb 9, 2024 •

edited

Loading

epage commented Feb 9, 2024

nyurik commented Feb 9, 2024 •

edited

Loading

nyurik commented Feb 9, 2024

epage commented Feb 26, 2024

nyurik commented Feb 26, 2024

ostr00000 commented Feb 27, 2024

epage commented Mar 18, 2024

epage commented Mar 18, 2024

nyurik commented Mar 18, 2024

epage commented Mar 18, 2024

ostr00000 commented Mar 18, 2024

nyurik commented Mar 18, 2024

ccoVeille commented Jul 9, 2024

Support plain text .dic dictionary files #931

Support plain text .dic dictionary files #931

Comments

nyurik commented Feb 8, 2024 • edited Loading

epage commented Feb 8, 2024

nyurik commented Feb 8, 2024

epage commented Feb 8, 2024

nyurik commented Feb 8, 2024

epage commented Feb 8, 2024

nyurik commented Feb 8, 2024

nyurik commented Feb 8, 2024 • edited Loading

epage commented Feb 9, 2024

nyurik commented Feb 9, 2024 • edited Loading

nyurik commented Feb 9, 2024

epage commented Feb 9, 2024

nyurik commented Feb 9, 2024 • edited Loading

epage commented Feb 9, 2024

nyurik commented Feb 9, 2024 • edited Loading

nyurik commented Feb 9, 2024

epage commented Feb 26, 2024

nyurik commented Feb 26, 2024

ostr00000 commented Feb 27, 2024

epage commented Mar 18, 2024

epage commented Mar 18, 2024

nyurik commented Mar 18, 2024

epage commented Mar 18, 2024

ostr00000 commented Mar 18, 2024

nyurik commented Mar 18, 2024

ccoVeille commented Jul 9, 2024

nyurik commented Feb 8, 2024 •

edited

Loading

nyurik commented Feb 8, 2024 •

edited

Loading

nyurik commented Feb 9, 2024 •

edited

Loading

nyurik commented Feb 9, 2024 •

edited

Loading

nyurik commented Feb 9, 2024 •

edited

Loading