-
Notifications
You must be signed in to change notification settings - Fork 23
Conversation
Necesssary for structured content support
This allows a user to install the English version and another version without cluttering their setup with duplicated information. If a user doesn't want to use the English version, they can get the "search" and "forms" terms by installing the separate jmdict_forms file.
If a term has a frequency tag, it should return higher in search results than a match which does not have a tag. For example, a search for 素性 should return すじょう rather than そせい, because the former has a "news" frequency tag.
Sense numbers start at 1, not 0
If a headword appears in multiple entries, then each entry needs a corresponding "forms" term in the output dictionary. For example, 軽卒 is the only headword in entry 2275730, but 軽卒 also appears as an irregular form in entry 1252910. If a "forms" term is not included for the former entry, then it will appear that 軽卒 is irregular for all senses in the output dictionary.
This commit ensures that terms are grouped among their entries of origin and displayed in correct sequential order in Yomichan's default result grouping mode, "Group term-reading pairs."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
お疲れ様です
I can see a lot of work has gone into this. Aside from some nitpicks, I think this looks like a righteous set of changes to me. Can you speak to any backwards compatibility issues here? I'm not seeing any as fields are being added to existing structures, but wanted to make sure that this is something we are thinking about.
Additional nit: can you please switch filenames like jmdictConstants.go
from camelCase to snake_case?
Thanks for the feedback, @FooSoft . I've updated the PR in accordance with your current notes. As for backwards compatibility, do you mean with respect to Yomichan, the Yomichan-Import user interface, old versions of the JMdict file, or something else entirely? I've added one new format handler to the command line options ( The program will no longer produce the old style of JMdict dictionary files (the ones lacking all the supplemental info). Since my new version takes ~30 minutes or so to validate during the import process, it might not be wise to publish it for the general public until a solution is worked out. So if we want Yomichan-Import to continue to be able to produce the old style dictionaries, I could add some options to do so.
@Thermospore, I just hid it using custom CSS. I like to hide the dictionary tags and distinguish different dictionaries using differently colored backgrounds. like this: .tag[data-category="dictionary"] {
display: none
}
.definition-item[data-dictionary="JMdict"] {
background-color: rgba(255,255,0,0.02);
}
.definition-item[data-dictionary="新明解"] {
background-color: rgba(0,0,255,0.1);
}
.definition-item[data-dictionary="広辞苑"] {
background-color: rgba(0,255,255,0.4);
} etc. |
ah that's cool, thanks for reference I just tried importing the sans sentence one (jmdict_english_2023_01_28.zip). step 3 finished at 4m50s. step 6 finished 3.5 min later, at 8m22s. I think my 広辞苑 takes about 6m for me. I'm using chrome, win 10, and an AMD ryzen 9 5900X |
@stephenmk if it's simple enough to add a flag for the output that can be quickly validated, we should probably do that. Folks do use this tool to build jmdict out-of-band of whenever I "officially" update the dictionaries. The backwards compatibility question is mostly directed at how loading older versions of the dictionary is handled. There are weird unofficial versions of it floating around (for languages not officially supported); hopefully the new changes don't rely on this new data being present. |
Require `-language=english_extra` to produce the complete version of the new JMdict dictionary file. If and when we determine that the all the new features are ready to be included the dictionary by default, we can remove this logic.
I updated my code to require a Regarding weird unofficial versions of JMdict, as long as they can be parsed by your JMdict library, I think there shouldn't be much of a problem. The only issue I can think of would be with the headword frequency tags ("news1", "ichi2", "spec1", etc.) and info tags ("iK", "oK", "io", etc.). My version of Yomichan-Import searches for these tags specifically, renames them, and uses them to determine things like term ranking in search results. So if the weird unofficial version contained different tags, I think Yomichan-Import would still output an otherwise-complete dictionary file, but those custom tags would be missing. That said, I'm very surprised to hear that such unofficial versions exist. The idea of the JMdict XML schema being used by anyone other than the EDRDG sounds ill-advised. The thought of maintaining unofficial support for such files also scares me a little. The JMdict format is a moving target with many changes planned in the future, so if the weird unofficial versions are still being developed and used, I think it would be better for the users to contact us so that we may provide a dedicated module for those formats rather than for us to continue providing unofficial support. I share your concern and think that you make a good point, but there is more information that needs to be considered. Search-only tags in JMdict are generally for forms that would not have been allowed in the dictionary prior to August 2022. This means that these forms are almost always relatively rare and also irregular (i.e. not featured in other major dictionaries). You can read more about this in the JMdict editorial policy. My new version should continue to serve as a relatively comprehensive index as it did before. However, there are some notable cases in which it won't. For example, JMdict has an entry for the expression 「そこに山があるから」. A different form of the expression 「其処に山があるから」 is used in Daijisen, but because of the usage of a rare kanji form (其処), this form of the expression was not viable for inclusion in JMdict until the search-only policy was implemented. So while JMdict should function well as an indexing dictionary 99% of the time without the search-only forms, there are indeed some situations in which the search-only forms would be useful to group by. For now, that's a bit outside of the scope of the dictionary file. Just as you suggest, Yomichan itself would need to be modified to hide the search-only terms in the "Group related terms" mode.
No, it should be a fairly rare occurrence. I can see how it would be a hassle if you're using Yomichan's clipboard monitor and aren't actively focused on the browser window, but otherwise it really is just one click to redirect to the standard form of the word. You also need to consider Yomichan's default "Group term-reading pairs" mode, which only displays one headword at a time. In that situation, a user wouldn't be very interested in the search-only form and would want to be redirected to a standard form of the word. The "Group related" mode removes the need for this redirect, but only in that mode.
I think the right approach is to display this information in a table, as I've done in the new dictionary. Presenting three or more forms in a flat list with various different readings and subtle differences in kanji is invariably going to be difficult to comprehend regardless of the addition of extra symbols and colors. |
Custom dictionary files using the JMdict XML format may contain nonstandard frequency and information tags.
I noticed this comment in issue #30:
Having read that, I can see why someone would convert a dictionary into the JMdict XML format. I updated my code so that undocumented frequency and information tags will be included in the output dictionary files. With that completed, I don't believe there are any backwards compatibility issues with this new version. (The current production version of Yomichan-Import actually has a frequency tag whitelist, but I see no reason why unknown tags should be discarded if they're found.) |
I just noticed that non-English versions of the new JMdict dictionaries do not have part-of-speech tags, unlike the old versions. Just wanted to make a note about what's happening here. The prior version of the program would loop through all of the senses in an entry and keep a copy of the last set of part-of-speech tags that it found, even if those tags were from a different language within that entry. If it found a sense without part-of-speech tags, it would assign those previous tags to it. Lines 121 to 127 in 9222417
Strictly speaking this isn't correct, although it might produce correct information some or even most of the time. If both the English and Russian versions of an entry only have one sense each, then the part-of-speech info is most likely the same. All bets are off outside of that special case, though, so I think it's best to stick with the new behavior. However, Yomichan won't be able to deinflect verbs and such if this information is missing. I'll need to modify the program a bit to ensure that at least the appropriate grammar rules are added to these terms. Edit: I thought about this some more and changed my mind a bit. I've written about this more here: #41 |
This dictionary is amazing! but i'm just having one little problem, my pop-up is not showing some emojis like the country flag and the info emoji. I noticed some people with the same problem as me and we are all using win 10 and a chromium base browser, so i think the problem is one of these two(or both). if there's a way i could change to a different emoji i'll would appreciate. |
@UMNV Thanks, I'm glad to hear that people are liking the new dictionary. I did a search for more info about this font issue and found this blog post: https://nolanlawson.com/2022/04/08/the-struggle-of-using-native-emoji-on-the-web/
It sounds like this is an issue specifically with chromium-based browsers on windows. It seems firefox doesn't have this problem because it ships with twitter's emoji font by default. I don't have a PC with windows installed that I could use to help you troubleshoot the problem, unfortunately. I see that there is an extension for chrome which adds the twitter emoji font and claims to fix the flag issue. I can't test it myself, but maybe you could try it. https://chrome.google.com/webstore/detail/twemoji-for-chrome/fopgafjdjlongoeblobbafbnapafcicg?hl=en |
@stephenmk but thanks for your help and the amazing work you're doing. |
@UMNV , sorry to hear it didn't work. I imagine a large majority of yomichan users are on windows and chromium-based browsers, so it would be nice if we had a better solution. It might be smart to include some default embedded fonts in the yomichan extension itself, but implementing that kind of functionality is outside my expertise at the moment. Rather than editing the json files, you can set the icons using some custom CSS in your yomichan settings. Here are the default values: ul[data-sc-content="glossary"] {
list-style-type: circle !important;
}
ul[data-sc-content="infoGlossary"] {
list-style-type: "ℹ️ " !important;
}
ul[data-sc-content="sourceLanguages"] {
list-style-type: "🌐 " !important;
}
ul[data-sc-content="notes"] {
list-style-type: "📝 " !important;
}
ul[data-sc-content="antonyms"] {
list-style-type: "🔄 " !important;
}
ul[data-sc-content="references"] {
list-style-type: "➡️ " !important;
}
ul[data-sc-content="examples"] {
list-style-type: "🇯🇵 " !important;
}
ul[data-sc-content="examples"] > li[lang="en"] {
list-style-type: "🇬🇧 " !important;
} |
The current version of JMdict for Yomichan is missing important supplemental information provided in the original JMdict file. This pull request is to update Yomichan-Import to process this information and include it in the output dictionary files.
Example Sentences
Since 2021, an English version of the JMdict file featuring example sentences from the Tanaka Corpus has been published daily. Only priority-tagged sentences are included, so the amount of examples per word is not overwhelming; typically only one example is included per sense at most.
Following the style used on tatoeba.org, I have marked the English sentences with a UK flag emoji 🇬🇧 and Japanese sentences with a Japanese flag emoji 🇯🇵.
Example: 健康
Example: 現在 (includes two sentences on one sense, which is uncommon)
My updated version of Yomichan-Import can produce Yomichan dictionaries with or without these example sentences depending on which file is input (
JMdict_e_examp
or the regularJMdict
file).Sense Notes
These notes provide extra context on how words are used ("now mostly used in idioms", "also written as 訓む", etc.). I've updated Yomichan-Import to include this information and marked these notes with a notepad emoji 📝.
Example: に付けて
Related issue: yomichan #1165
Gloss Types
Some glosses (aka definitions) contain special type information. The current version of JMdict for Yomichan includes these glosses but does not indicate their types, which can cause some glosses to appear nonsensical.
I've marked these special glosses with an info emoji ℹ️ and prefixed the glosses with their types in italic font ("literally," "figuratively," or "trademark"). There are also "explanatory" gloss types, but I don't think those need an italic prefix.
Example: 上方絵 (explanatory)
Example: 猫の手も借りたい (literal)
Related issue: yomichan #2057
Source Languages
Entries for modern Japanese loanwords (外来語) contain language-of-origin information. I've updated Yomichan-Import to include this information and marked the notes with a globe emoji 🌐.
Example: ミサンガ
Example: スキンシップ (wasei)
Example: マッチポンプ (multiple languages of origin)
References
Some JMdict entries contain cross-references to other entries. I've updated Yomichan-Import to include these references and marked them with an arrow emoji ➡️.
I've formatted the referenced expressions as query links, so a user may jump to that entry by clicking on the link. The notes are also presented with a compact glossary of the referenced expression sense. For situations in which the reading of the referenced expression is ambiguous, I've also included the designated reading in parentheses.
Example: 舌の根
Example: 猛暑日 (includes two references)
Example: 脅す (references a word with an ambiguous reading)
Each reference in JMdict points to a specific numbered sense of an entry rather than the entire entry itself. However, the current version of JMdict for Yomichan does not indicate these sense numbers. For the sake of clarity, I have added numbered tags to entries containing multiple senses in order to indicate the original sense numbers.
Example: 故障 (four senses)
Antonyms
Antonyms are functionally identical to cross-references. I've marked these with a "counterclockwise arrows" emoji 🔄.
Example: 良くないね
Other Forms
JMdict's structure assumes that the reader will be able to view the various forms of an expression alongside the term glossaries. For example, the entry for もと【元・本・素・基】 includes notes specifying that sense 1 is usually written 元, sense 2 is usually written 本, sense 4 is usually 素, etc. These alternative forms are not displayed in Yomichan's default result-grouping mode, and the new inclusion of sense notes without this form information could cause confusion. Aside from that, the ability to view these alternative forms is likely to be of interest to users in general.
For entries with more than one form, I have added an extra "forms" term in the Yomichan dictionary file which contains these forms in a regular list structure.
Example: ことば典
For entries with more than one distinct reading, I have arranged the forms into a table.
Example: 魚虱
Example: 素性
Both the list and table formats contain symbols to represent various meta information about the different forms.
In tables, the ㊒ symbol is used to indicate valid forms without any special meta information. Also, gikun (義訓) readings and ateji (当て字) kanji are presented in angle brackets 〈 〉, which is a convention used by some Japanese dictionaries to denote jukujikun (熟字訓) terms.
I hope that my symbol choices are mostly intuitive, but I'm open to suggestions for improvements. I have also updated the term tags to align with these symbols, so users will be able to hover over them to see detailed explanations.
Example: ふいんき (with alt. text on the ⚠ term tag)
Yomichan includes a "related terms" grouping mode which can be used to display much of this information, but for complicated entries the information can be difficult for users to parse.
Example: 素性 in "Group related terms" mode
I've also designed this version of Yomichan-Import to produce a standalone dictionary which only contains these forms lists and tables, so users can access this information even if they don't want to use the rest of JMdict.
Related issue: yomichan #2183
Search-Only Terms
Since August 2022, JMdict has included "search-only" terms which are meant to aid term look-ups without cluttering entries with rare, non-standard spellings of words. I've updated Yomichan-Import to produce terms which display links to the standard forms of these terms.
Example: 登り旗 (redirects to のぼり旗)
Example: のぼり旗 in "Group related terms" mode (登り旗 is not displayed)
Example: 鉤なり (redirects to a word with an ambiguous reading)
Other Improvements
Frequency tags and term ranking
I've updated the names and descriptions of JMdict frequency tags to better reflect their meanings. The ranking method for determining the search result display order has also been adjusted. I wrote about this in the "Term prioritization" section of this comment.
Rarely-used kanji forms
I've updated the program to produce additional kana-only headwords for kana forms which are only associated with rare, irregular, or outdated kanji forms. So for example, a user who scans "それ" will now see "それ" as the headword of the top result rather than "其れ". I wrote about this in the "Rarely-used Kanji Forms" section of this comment.
Problems and Considerations
Yomichan validation
This new version of JMdict for Yomichan makes extensive use of nested data structures, so the validation step of the import process is very slow with the current version of Yomichan. On my PC (10 year old Intel i7-3770K), the file takes 32 minutes to validate.
Related issue: yomichan #2138
Merging of terms from separate entries
Yomichan's default result-grouping mode merges terms from different JMdict entries if the terms share the same reading and expression. For example, a the top search result for 元 will be a combination of the entries for もと【元・本・素・基】 (sequence 1260670) and もと【元・旧・故】 (sequence 2219590).
Example: 元 (the first 9 senses are for a priority ⭐ term, but the final 3 are not. Note also that sense #9 is hidden because it only applies to the 本 form of the word.)
Example: 軽卒 (the final sense is for an irregular⚠️ usage of the kanji form, but the first 2 are not)
The term tags that appear next to the headword at the top of the Yomichan entry may not apply to every sense in the search results. Users can check the "forms" terms to determine which term tags apply to which senses, but this setup could cause confusion.
Test Dictionary Builds
(Updated 01/29/2023)
jmdict_english_extra_with_examples_2023_01_29.zip
This is the complete new version of JMdict (English) for Yomichan with all of the new features described above.
jmdict_english_extra_2023_01_29.zip
This is the same as the above version, except without the Tanaka Corpus example sentences. I wouldn't recommend this version, personally.
jmdict_english_2023_01_29.zip
This is a "legacy" version of the dictionary which is similar to the currently published version. It does not contain the supplemental information in glossaries or the "forms" terms, but it does contain the search-only redirect terms and the new style of term tags (e.g. ⚠ tags instead of "iK" tags). This version validates very quickly during the import process. Once a solution is worked out for the validation problem, I think we can remove support for this version.
jmdict_forms_2023_01_29.zip
Contains only "forms" terms and search-only redirect terms. This is for users who don't want to use the English version of JMdict.
jmdict_german_2023_01_29.zip
This is a German version of JMdict produced by the new version of Yomichan-Import. The JMdict source file only contains the supplemental information described above (sense notes, cross references, etc.) for English language entries. Therefore the Yomichan versions for other languages remain largely the same as the old versions of the dictionaries.
I've designed the new program to build non-English dictionary files without "forms" terms or the search-only terms. This allows a user to install both the English dictionary and a dictionary for another language without cluttering their installation with duplicated terms. If a user does not want to install the English dictionary, they can acquire the "forms" terms and search-only terms by installing the standalone "jmdict_forms" dictionary file.
Since this pull request summary is already very long, I've tried to avoid going into too much extra detail. I'd be happy to elaborate on any of these topics if there are any questions.
Thanks for taking the time to review this request.