-
Notifications
You must be signed in to change notification settings - Fork 245
[Feature Request] add support for JMdict note
, see
, and ant
#1165
Comments
Looks like they store the notes as |
@toasted-nutbread I am going to make a pass on yomichan-import to clean up some of the dictionary processing (especially for EPWING), let me know if you have a list of format annoyances to take care of while I am at it. |
@FooSoft https://github.com/FooSoft/yomichan/labels/dictionary format lists all the issues I am aware of. Most of them related to Kenkyusha, with some being somewhat cosmetic (due to large size of definitions), whereas a few others impact functionality (terms not marked as verbs, or unexpected reading/expression form). Related to this issue: if we want to support see/note/ant, it may be best to present them differently than standard definitions (as seen in the images, e.g. brackets, crosslinks). This may require some metadata updates to the dictionary format Yomichan uses, perhaps something similar to how image definitions work.
E.g. {"type": "note", "content": "Note content"}
{"type": "see", "expression": "画像", "reading": "がぞう", /*...*/}
/*...*/ |
Some examples of entries where you don't really get it without the notes/refs. I also notice the entries where notes/refs are important tend to also be entries where the monolingual dicts aren't very helpful. But luckily at this point I've developed a spidey sense for when I should go check jisho/jmdictdb for them haha http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=2120780.1 http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=2430230.1 http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1599390.1 |
I think it might be best to just import every tag from https://www.edrdg.org/jmdictdb/cgi-bin/edhelp.py?svc=jmdict&sid=#insyntax |
#2089 adds link support, which can handle cross references to some extent. |
I've been working on a fork of yomichan-import which adds this information (usage notes, cross references, antonyms, and language-of-origin info) to the term glossaries within yomichan, and I think the results are very promising. Images are linked below. Here is a copy of the English dictionary file if you'd like to try it: Click descriptions to expand imagesCross references to other terms (indicated by a right arrow ➡) and references from other terms (left arrow ⬅). Terms that both reference and and are referenced by the same sense will show 🔄. Antonyms (indicated by right-left arrows ⇄) Loanwords (language of origin information) Words loaned from multiple languages Wasei words Usage notes (《now mostly used in idioms》, 《also written as 訓む》, etc.) Gloss types ("literally," "figuratively," "explanatory," and "trademark")
Cross reference information seems to be matching up well with the information displayed in JMdictDB. (Update) 和歌 after moving all the reference types into one gloss each. This is an extreme example; the vast majority of terms only have a few references at most.Even for tricky cross references to kanji that belong to many different entries, everything seems to be showing up where it's supposed to be. Issues
I think that's everything of any importance that I have for now. Let me know what you think. I don't think it would be an exaggeration to say that I would not have learned Japanese if not for yomichan (at the very least, it would have been much more difficult), so it would make me happy to be able to contribute back to the project. |
Awesome work, @stephenmk . I've merged in the changes to the jmdict library used by Yomichan-Import. I'm not too concerned about the performance implication you mention since the dictionary processing happens offline and normal users do not have to carea bout it. Feel free to shoot me a PR when you feel like things are in a good state for it :) |
Thanks @FooSoft. I thought I was finished yesterday but I already thought of several things to polish that I implemented today. I'll let you know when things settle down. I'll shoot a message to Marcusjmdict, who I suspect will be interested in this and could offer some valuable feedback. |
Thank you very much @stephenmk, very interesting. I don't really have time to look into this at the moment but I'll try and remember to do it in a couple of weeks. Might be worth posting about this on the jmdict github too, or on the jmdict mailing list. |
I'm not sure if it's helpful for comparing, but hikibiki also includes each of these except cross-references from other words. (10ten/Rikaichamp includes most of this metadata but currently hides cross-references/antonyms since they're a bit less useful when you can't click them to look them up.) One minor detail we encountered is that when presenting the foreign language terms we try to mark them up with appropriate language tagging so a suitable font will be selected. To do that we end up translating JMdict's ISO-639-2 language codes into BCP-47 language codes that can be directly used in |
I again have some noteworthy updates to share. I hope that I can now call this finished for the near future, but we'll see if I can go more than a day without finding something else to tweak and improve. Here's the latest build of the dictionary file: jmdict_english_info_glosses_2022_04_04.zip Structured Content LinksStructured content links for internal queries are now included in reference notes. It seems that it is only possible to query one expression at a time (i.e., either kanji or kana, but not both). It would be cool if we could query by sequence number once the JMdict file begins to include that information in the references in the future. (I have converted these references into sequence numbers using heuristics, but it is impossible for this procedure to be 100% accurate). Entry for 訓⚠Note that I have compacted these references, so both of the external links (denoted by a left arrow ⬅) appear in the same gloss rather than two glosses. The addition of structured content has a noticeable impact on the amount of time it takes to load the dictionary file into yomichan. The validation stage now takes about as long as the data import stage, but this seems acceptable to me. We can also add html Rarely-used Kanji FormsA "rarely used kanji" [rK] tag has recently been added to JMdict, which the editors are using to indicate kanji forms that appear in less than 3% of all word usages (based upon corpus n-gram counts). I updated the program to de-prioritize these forms in the same way that it already treats other irregularly tagged kanji (e.g., [iK], "word containing irregular kanji usage"; [io], "irregular okurigana usage"; etc.). This should fully resolve issue #2001. The current production version of yomichan-import creates dictionary terms for JMdict readings that are tagged with a [NoKanji] indicator. I have extended this functionality to also apply to readings that are only associated with rare kanji forms. So for example, a user who scans "それ" will now see "それ" as the headword of the top entry rather than "其れ". Entry for それ (ungrouped)⚠Note that the senses for それ (其れ) and それ (interjection) have been merged, as indicated by the sense numbers. It might be nice to make it more explicit to the user that this merging has occurred somehow, but I haven't thought of a clever solution. I could change the color of the sense tags on terms whose kanji/kana pairs also belong to a different entry, but this would look weird in grouped mode. There are about 500 entries affected by this change. For the curious, here is a complete list. New kana-only terms1000320【彼処・あそこ】あそこ あすこ かしこ あしこ あこ I also considered applying this change to kana forms that contain glosses that are all tagged with [uk] "usually kana" indicators, but I think this would be a bad idea. Unlike the rare kanji tags, which indicate less than 3% of all usages, the [uk] tag can mean that a kanji form is only used around 50% of the time. So when one of these [uk] kana forms is scanned by the user, I think it's still good for the corresponding kanji form to appear in the top result. Term prioritizationThere are an abundance of good frequency dictionaries for yomichan now, such as @toasted-nutbread's BCCWJ dictionary. So adjustments to frequency data in JMdict probably won't matter to a lot of people, especially since the data it uses seems not to be held in high regard by many. At any rate, I have made adjustments which will at least make a better out-of-the-box experience for people getting started with yomichan and JMdict. JMdict includes frequency/priority tags based upon three sources:
The "news" and "gai" tags are derived from the first source; "ichi" tags from the second; and "spec" tags from the third. I have updated the tooltip texts on these tags to better convey their meanings. news tags ("news1k" to "news24k")⚠Note that the "news" tag has been split up into 24 different tags ("news1k" to "news24k") based upon the rankings indicated in the JMdict file. This ranking also now affects the order in which terms are displayed. spec tags⚠the previous wording was "common words not included in frequency lists," which never made any sense to me until I read more about JMdict during the past week. "spec" probably isn't a great tag name either, but I don't have any better ideas. I'd almost like to name it "common", but then that would imply that other terms without the "common" tag are not common.The order in which terms are extracted from JMdict is now also factored into their priority ranking (with a smaller weight than the priority tags). For example, 【本・ほん】 is the first term extracted from its JMdict entry, while 【本・もと】 is extracted second from its entry (after 【元・もと】). Both 【本・ほん】 and 【本・もと】 are tagged with "ichi" priority. Now that extraction order is taken into account, the term entries for【本・ほん】 will show up first when the user scans "本". I think that's all I have for now. Thanks for reading. (Edit 2022/04/08) I don't know about anyone else, but I think I might prefer having the JMdict glosses condensed into a single, semicolon delimited list item (see images below). Especially now that there is other information in these sense glossaries, I think having the glosses themselves on one line will improve readability. This is also how jisho.org displays this information. Here's a test build if anyone would like to try it: jmdict_english_info_glosses_2022_04_08.zip (But regardless, my own personal preferences aren't an issue. I can always adjust my copy of yomichan-import to build the dictionary files that I want for myself. So I'm open to feedback on things should be setup for if-and-when my version of yomichan-import gets merged back into the main branch.) |
These often contain critical information. I don't think JMdict should be considered complete without them
Here are some entries that use these, for reference (though there are probably better entries to use as an example of their importance):
http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1465580.1
http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1456360.1
http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1339460.1
Screenshots of what I'm referring to specifically
(I think this should technically be in the yomichan import github, but this seems to be where the main activity happens)
The text was updated successfully, but these errors were encountered: