Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

Importing a frequency list with readings #855

Closed
Thermospore opened this issue Sep 20, 2020 · 19 comments
Closed

Importing a frequency list with readings #855

Thermospore opened this issue Sep 20, 2020 · 19 comments

Comments

@Thermospore
Copy link
Contributor

https://pj.ninjal.ac.jp/corpus_center/bccwj/en/freq-list.html

The Long Unit Word list is a bit too, well, long. But I think the Short Unit Word list would be a great addition to the Yomichan suggested dictionaries list, as there is not yet a frequency list there with reading data.

It seems support was added for this here: #450

I looked into importing the data myself, but unfortunately I'm not very familiar with the formatting or import process for Yomichan. Reading through the pull request, it also seems like the readings in the list would need to be converted from katakana to hiragana. But there seem to be edge cases for words that have a combination of katakana and hiragana/kanji

@toasted-nutbread
Copy link
Collaborator

Are you looking for information about the data format that Yomichan uses for frequency lists, or something else?

@Thermospore
Copy link
Contributor Author

Apologies, yeah in hindsight I was pretty unspecific. Thanks for the info! I'll see if I can get the list imported and report back.

@toasted-nutbread
Copy link
Collaborator

No problem. Yomichan's dictionary JSON format isn't too complicated, the only real caveat (which you already pointed out) is using katakana instead of hiragana, but there are scripts/tools to do those conversions so you don't have to write them yourself; Yomichan uses https://github.com/WaniKani/WanaKana.

@Thermospore
Copy link
Contributor Author

BCCWJ_short_freq_v0.zip

Got it most of the way there! It successfully imports too. The only remaining thing is to convert the applicable katakana readings to hiragana, which is admittedly beyond my abilities.

Hate to ask, but is there anyone interested in doing the conversion? :^)

@toasted-nutbread
Copy link
Collaborator

I can take a look at it sometime soon, please wait warmly.

@toasted-nutbread
Copy link
Collaborator

Here is a node script to modify the readings:

const fs = require('fs');
const wanakana = require('./wanakana.min.js');
const input = 'term_meta_bank_1.json';
const output = input.replace(/\.[^\.]*$/, '_adjusted$&');
const data = JSON.parse(fs.readFileSync(input, {encoding: 'utf8'}));
for (const item of data) {
  item[2].reading = wanakana.toHiragana(item[2].reading);
}
fs.writeFileSync(output, JSON.stringify(data, null, 0), {encoding: 'utf8'})

Depends on WanaKana; you can copy the file from here.

I would also recommend adding the additional author/attribution metadata to the index file, as described in this comment:
#834 (comment)

@Thermospore
Copy link
Contributor Author

Big thanks, I got it to work! The only caveat is that, as you mentioned, it seems Yomichan expects certain readings to remain in katakana

For example:
キリスト教 -> キリストきょう
AC -> エーシー

so words like that don't show up. That's not a big deal, though, as most of those words I wouldn't be looking up the frequency for anyway (not to mention the other sans-reading frequency lists can still catch most of them), so the list is very useable in this state!

Here it is in its current state, with author/attribution data added:
BCCWJ_short_freq_v0p5.zip

@toasted-nutbread
Copy link
Collaborator

toasted-nutbread commented Sep 23, 2020

This might get you better coverage; still not perfect, but better for hiragana and latin characters.

const fs = require('fs');
const wanakana = require('./wanakana.min.js');
const input = 'term_meta_bank_1.json';
const output = input.replace(/\.[^\.]*$/, '_adjusted$&');
const data = JSON.parse(fs.readFileSync(input, {encoding: 'utf8'}));
const isPartiallyJapanese = (input) => [...input].reduce((value, char) => value || wanakana.isJapanese(char), false);
for (const item of data) {
  const [expression, , {reading}] = item;
  if (expression === reading) { continue; } // Both in hiragana/katakana
  if (!isPartiallyJapanese(expression.normalize('NFKC'))) { continue; } // Latin/full width characters
  item[2].reading = wanakana.toHiragana(reading);
}
fs.writeFileSync(output, JSON.stringify(data, null, 4), {encoding: 'utf8'})

@Thermospore
Copy link
Contributor Author

Wow, what a legend. I think the vast majority of entries are covered at this point!

Anyone feel free to use it: BCCWJ_short_freq_v1.zip

Just as a reference for anyone who wants to give it a go, here are two examples of entries that won't show properly:

  • [ "キングマン・アンド・アイブズ", "freq", { "reading": "きんぐまんあんどあいぶず", "frequency": 152442 } ],

  • [ "ラジアン毎秒", "freq", { "reading": "らじあんまいびょう", "frequency": 152442 } ],

Lastly, something to keep in mind is that the data source for this frequency list occasionally distinguishes between different parts of speech. For example if you search 切り you will see two frequency entries in Yomichan. Checking the original BCCWJ list will show that one instance is for its use as a suffix and the other for as a noun. Just something to be aware of!

@toasted-nutbread
Copy link
Collaborator

I think the real solution will be to eventually implement #461, so that readings are normalized before Yomichan adds them to the internal database. That will likely affect many other things as well, so it's not a simple change, but it should be the most effective one.

@Thermospore
Copy link
Contributor Author

Nice, I'll check em out! I'm curious, how did you handle the fact that in the long list they split up when a noun is used standalone vs when it is used as a suru verb? (Eg there would be an entry for 勉強 and 勉強する, iirc)

There might be other stuff the long list splits up as well that makes it trickier to reference words as simply as with the short list

@toasted-nutbread
Copy link
Collaborator

The long list is not handled any differently; both entries are included in the dictionary.

@Thermospore
Copy link
Contributor Author

I'll enable all 3 (the old short list, the new short list, and the new long list) for a week or so and look out for any discrepancies!

I suspect the way the long list splits things up / over-specifies might be problematic, as it won't always cross reference properly with the way dictionaries format their entries. Some words will probably appear much less frequent than they actually are, or might not show up at all

For example if you ctrl+f the lists for 席捲 you get the following results:

Short list
20962 席捲 (182 hits)

Long list
22186 席捲する (154 hits)
122640 席捲 (16 hits)
282791 席捲し始める (5 hits)

Even if you are to search 席捲する in yomichan, all the dictionaries will have their entry as 席捲, with the する dropped. The long list would then return a rank of 122640, making the word look considerably rare when it is actually fairly common

@Thermospore
Copy link
Contributor Author

Thermospore commented Dec 23, 2020

Yea that seems to be the case
image
AJ = anime & jdrama frequency list/
W = wikipedia
IC = innocent corpus
Bs = the old short list version from this thread

@Thermospore
Copy link
Contributor Author

Thermospore commented Dec 23, 2020

  • キリスト教 shows up! nice!

  • I assume it is intentional that only the first instance on the list is included (looking at アラビア数字 and 切り for example)?

  • words like AIDS, HACCP, AC show up on the old version, but not on the two new ones
    image

@toasted-nutbread
Copy link
Collaborator

Even if you are to search 席捲する in yomichan, all the dictionaries will have their entry as 席捲, with the する dropped. The long list would then return a rank of 122640, making the word look considerably rare when it is actually fairly common

That is the expected behaviour when the dictionary doesn't contain an entry for 席捲する, as it is presumably searching for the case when it is used as a noun or something without -suru. I'm not saying the long unit word version is as useful as the short one, as I don't think most of the dictionaries available are in the same format / have all the same compound words, but generation of a dictionary using that data is supported.

I assume it is intentional that only the first instance on the list is included (looking at アラビア数字 and 切り for example)?

Yes, since there is no support for part-of-speech disambiguation currently. Having multiple entries would likely be confusing to anyone who doesn't know why there are multiple values due to how the source information is presented.

  • words like AIDS, HACCP, AC show up on the old version, but not on the two new ones

This is likely because they have readings that are fully in katakana, whereas the readings in the dictionary are using hiragana. In general, I don't think there's a way to know that this is the case for the base dictionary. For example, it seems that JMDict stores the entries made up exclusively with full width characters in katakana, but partial entries may still use hiragana. Furthermore, other readings may have non-readable characters in them.

{expression: "AID人工授精", reading: "ひはいぐうしゃかんじんこうじゅせい"}
{expression: "ABC順", reading: "エービーシーじゅん"}
{expression: "AV", reading: "エイ・ヴィ"}

Again, this is likely an issue that would need to be resolved at some point on the Yomichan side rather than the dictionary side. The main change is that there is better (but not perfect) coverage for words like キリスト教, with katakana + kanji.

@Thermospore
Copy link
Contributor Author

Gotcha, thanks for the responses. Interestingly, it looks like they got rid of the in AV recently
http://www.edrdg.org/jmdictdb/cgi-bin/entr.py?svc=jmdict&sid=&q=1958780.1
image

@Thermospore
Copy link
Contributor Author

I've actually come to find the long unit list very useful. By comparing the frequency from the short and long unit lists you can infer information, such as if a word tends to be used in isolation or as part of a compound

Taking 席捲 above as an example, the fact that the long unit list returns a significantly lower rank than the other lists indicates this word tends to be used in a compound, not by itself

Initially I assumed the short list was simply an abbreviated version of the long list, but that was obviously missing the point

So yeah I agree it would have been a mistake to try and edit the long unit list to split things up, since that's just what the short unit list is... Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants