Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

Yomichan shouldn't prioritize exact match over frequency. #1669

Open
epistularum opened this issue May 12, 2021 · 5 comments
Open

Yomichan shouldn't prioritize exact match over frequency. #1669

epistularum opened this issue May 12, 2021 · 5 comments

Comments

@epistularum
Copy link

epistularum commented May 12, 2021

  1. The frequency of de-inflected verbs/adj should be properly taken into account. When looking up 歩き chances are you actually want to see the definition of 歩く first, that is why the frequency of 歩く is way higher than the frequency of 歩き and tthat order should be respected when displayed within yomichan.
  2. It gets quite difficult (or near impossible) to find the deconjugated match under multiple "exact matches". Names, for instance. In my case, I have to go through 29 entries in order to finally find 落ちる when looking up おち.

Here are some examples, I've excluded more extreme examples that would result in images ridiculously long:
image
image

image
image

@toasted-nutbread
Copy link
Collaborator

You seem to be having two different issues here:

  1. Exact matches appearing before deinflected matches (you would see the same thing if the score was identical for 歩き and 歩く).
  2. Names appearing before "more meaningful" definitions.

I would argue that 1 is the correct behaviour, because how do we know the user doesn't want to see 歩き instead of 歩く? 歩き has the additional noun meaning which could be correct for the context. Compare vs Jisho, which also doesn't list 歩く at the top. And while maybe this is a contrived example, a learner should also be able to intuit that 歩き is a form of 歩く from both the raw text and the definition.

2 is probably the same issue as #105, and you can improve this by decreasing the priority of the names dictionary.

@Thermospore
Copy link
Contributor

Yeah I just moved jmnedict to a separate profile so I didn't have to flip through stacks of names when looking for a word

@ttu-ttu
Copy link

ttu-ttu commented May 14, 2021

I was thinking maybe provide an option in the settings to prioritize deinflected form over the inflection, and I think it makes sense because in J-J dictionaries, 90% of the time they will ask us do refer to the base (deinflected form).

Another way to deal with this is to place the deinflected form right below the exact match, also controlled by settings of course since I believe it's more of a user preference

@epistularum
Copy link
Author

I would argue that 1 is the correct behaviour, because how do we know the user doesn't want to see 歩き instead of 歩く?

I believe this should be handled by the freq information. For instance, 歩き has a freq of 2 while 歩く has a freq of 601. This freq information is taken from the provided jmdict dict. On most instances I believe it makes more sense showing the de-inflected form but it is true that sometimes the conjugated form is way more frequent than the unconjugated one. ex: 物思い vs 物思う.
That is why I think we should rely on the freq indicator since it can differentiate between the two.
Having a toggle like ttu-ttu explained is also another idea worth looking into but it is not as granular as what I explained above.

On another note, where does this freq info come from? I can't seem to find it in the jmdict file itself.

2 is probably the same issue as #105, and you can improve this by decreasing the priority of the names dictionary.

I already have my name dictionary on the lowest priority compared to my other dicts. That is why I believe yomichan displays direct matches higher than deconjugated matches. In this example, all the names are considered as a direct match since the looked up text is in phonetic while 食べる need to be de-conjugated and would be considered as an indirect match. At least, that is what my understanding of the behaviour is.

@toasted-nutbread
Copy link
Collaborator

Another way to deal with this is to place the deinflected form right below the exact match

This information isn't store in the dictionaries that Yomichan imports, and I'm not sure it would be safe in the general case to assume what is and isn't an inflection.

That is why I think we should rely on the freq indicator since it can differentiate between the two.

To clarify: by "freq" do you mean the score for a definition, the green frequency tags, or something else?

On another note, where does this freq info come from? I can't seem to find it in the jmdict file itself.

https://github.com/FooSoft/yomichan-import/blob/83e3e44f46e344bfe66d9c7181caa5b113f8fb2a/edict.go#L160
https://github.com/FooSoft/yomichan-import/blob/83e3e44f46e344bfe66d9c7181caa5b113f8fb2a/edict.go#L48-L65

I already have my name dictionary on the lowest priority compared to my other dicts. That is why I believe yomichan displays direct matches higher than deconjugated matches.

Yeah, I see what you mean now; this issue affects kana-only searches moreso than kanji definitions. There is also some discussion in #1539 about updating how dictionary priority is handled internally, and this may fall into that category as well.


For reference, this is the current code for sorting dictionary entries:

_sortTermDictionaryEntries(dictionaryEntries) {
const stringComparer = this._stringComparer;
const compareFunction = (v1, v2) => {
// Sort by length of source term
let i = v2.maxTransformedTextLength - v1.maxTransformedTextLength;
if (i !== 0) { return i; }
// Sort by the number of inflection reasons
i = v1.inflections.length - v2.inflections.length;
if (i !== 0) { return i; }
// Sort by how many terms exactly match the source (e.g. for exact kana prioritization)
i = v2.sourceTermExactMatchCount - v1.sourceTermExactMatchCount;
if (i !== 0) { return i; }
// Sort by dictionary priority
i = v2.dictionaryPriority - v1.dictionaryPriority;
if (i !== 0) { return i; }
// Sort by term score
i = v2.score - v1.score;
if (i !== 0) { return i; }
// Sort by headword term text
const headwords1 = v1.headwords;
const headwords2 = v2.headwords;
for (let j = 0, jj = Math.min(headwords1.length, headwords2.length); j < jj; ++j) {
const term1 = headwords1[j].term;
const term2 = headwords2[j].term;
i = term2.length - term1.length;
if (i !== 0) { return i; }
i = stringComparer.compare(term1, term2);
if (i !== 0) { return i; }
}
// Sort by dictionary order
i = v1.dictionaryIndex - v2.dictionaryIndex;
return i;
};
dictionaryEntries.sort(compareFunction);
}

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants