-
Notifications
You must be signed in to change notification settings - Fork 246
Importing a frequency list with readings #855
Comments
Are you looking for information about the data format that Yomichan uses for frequency lists, or something else?
|
Apologies, yeah in hindsight I was pretty unspecific. Thanks for the info! I'll see if I can get the list imported and report back. |
No problem. Yomichan's dictionary JSON format isn't too complicated, the only real caveat (which you already pointed out) is using katakana instead of hiragana, but there are scripts/tools to do those conversions so you don't have to write them yourself; Yomichan uses https://github.com/WaniKani/WanaKana. |
Got it most of the way there! It successfully imports too. The only remaining thing is to convert the applicable katakana readings to hiragana, which is admittedly beyond my abilities. Hate to ask, but is there anyone interested in doing the conversion? :^) |
I can take a look at it sometime soon, please wait warmly. |
Here is a node script to modify the readings: const fs = require('fs');
const wanakana = require('./wanakana.min.js');
const input = 'term_meta_bank_1.json';
const output = input.replace(/\.[^\.]*$/, '_adjusted$&');
const data = JSON.parse(fs.readFileSync(input, {encoding: 'utf8'}));
for (const item of data) {
item[2].reading = wanakana.toHiragana(item[2].reading);
}
fs.writeFileSync(output, JSON.stringify(data, null, 0), {encoding: 'utf8'}) Depends on WanaKana; you can copy the file from here. I would also recommend adding the additional author/attribution metadata to the index file, as described in this comment: |
Big thanks, I got it to work! The only caveat is that, as you mentioned, it seems Yomichan expects certain readings to remain in katakana For example: so words like that don't show up. That's not a big deal, though, as most of those words I wouldn't be looking up the frequency for anyway (not to mention the other sans-reading frequency lists can still catch most of them), so the list is very useable in this state! Here it is in its current state, with author/attribution data added: |
This might get you better coverage; still not perfect, but better for hiragana and latin characters. const fs = require('fs');
const wanakana = require('./wanakana.min.js');
const input = 'term_meta_bank_1.json';
const output = input.replace(/\.[^\.]*$/, '_adjusted$&');
const data = JSON.parse(fs.readFileSync(input, {encoding: 'utf8'}));
const isPartiallyJapanese = (input) => [...input].reduce((value, char) => value || wanakana.isJapanese(char), false);
for (const item of data) {
const [expression, , {reading}] = item;
if (expression === reading) { continue; } // Both in hiragana/katakana
if (!isPartiallyJapanese(expression.normalize('NFKC'))) { continue; } // Latin/full width characters
item[2].reading = wanakana.toHiragana(reading);
}
fs.writeFileSync(output, JSON.stringify(data, null, 4), {encoding: 'utf8'}) |
Wow, what a legend. I think the vast majority of entries are covered at this point! Anyone feel free to use it: BCCWJ_short_freq_v1.zip Just as a reference for anyone who wants to give it a go, here are two examples of entries that won't show properly:
Lastly, something to keep in mind is that the data source for this frequency list occasionally distinguishes between different parts of speech. For example if you search |
I think the real solution will be to eventually implement #461, so that readings are normalized before Yomichan adds them to the internal database. That will likely affect many other things as well, so it's not a simple change, but it should be the most effective one. |
Nice, I'll check em out! I'm curious, how did you handle the fact that in the long list they split up when a noun is used standalone vs when it is used as a suru verb? (Eg there would be an entry for 勉強 and 勉強する, iirc) There might be other stuff the long list splits up as well that makes it trickier to reference words as simply as with the short list |
The long list is not handled any differently; both entries are included in the dictionary. |
I'll enable all 3 (the old short list, the new short list, and the new long list) for a week or so and look out for any discrepancies! I suspect the way the long list splits things up / over-specifies might be problematic, as it won't always cross reference properly with the way dictionaries format their entries. Some words will probably appear much less frequent than they actually are, or might not show up at all For example if you ctrl+f the lists for Short list Long list Even if you are to search |
That is the expected behaviour when the dictionary doesn't contain an entry for
Yes, since there is no support for part-of-speech disambiguation currently. Having multiple entries would likely be confusing to anyone who doesn't know why there are multiple values due to how the source information is presented.
This is likely because they have readings that are fully in katakana, whereas the readings in the dictionary are using hiragana. In general, I don't think there's a way to know that this is the case for the base dictionary. For example, it seems that JMDict stores the entries made up exclusively with full width characters in katakana, but partial entries may still use hiragana. Furthermore, other readings may have non-readable characters in them. {expression: "AID人工授精", reading: "ひはいぐうしゃかんじんこうじゅせい"}
{expression: "ABC順", reading: "エービーシーじゅん"}
{expression: "AV", reading: "エイ・ヴィ"} Again, this is likely an issue that would need to be resolved at some point on the Yomichan side rather than the dictionary side. The main change is that there is better (but not perfect) coverage for words like |
Gotcha, thanks for the responses. Interestingly, it looks like they got rid of the |
I've actually come to find the long unit list very useful. By comparing the frequency from the short and long unit lists you can infer information, such as if a word tends to be used in isolation or as part of a compound Taking Initially I assumed the short list was simply an abbreviated version of the long list, but that was obviously missing the point So yeah I agree it would have been a mistake to try and edit the long unit list to split things up, since that's just what the short unit list is... Thanks! |
https://pj.ninjal.ac.jp/corpus_center/bccwj/en/freq-list.html
The Long Unit Word list is a bit too, well, long. But I think the Short Unit Word list would be a great addition to the Yomichan suggested dictionaries list, as there is not yet a frequency list there with reading data.
It seems support was added for this here: #450
I looked into importing the data myself, but unfortunately I'm not very familiar with the formatting or import process for Yomichan. Reading through the pull request, it also seems like the readings in the list would need to be converted from katakana to hiragana. But there seem to be edge cases for words that have a combination of katakana and hiragana/kanji
The text was updated successfully, but these errors were encountered: