Some Japanese text returns language as null #2

mikiex · 2024-01-03T16:11:19Z

print(ELDdetector.detect("終了"))
{"": {"language": null, "scores()": {}, "is_reliable()": false}}

nitotm · 2024-01-04T00:10:39Z

Hi, this would get solved by using the largest database.

ELDdetector = LanguageDetector('ngramsL60')

In the next version of ELD, the large database will be the default database.

Still, this could happen with other combinations of Japanese/Chinese characters for very short text. There is a possible improvement, for a future version of ELD, as stated at the end of the readme:
"The tokenizer could separate characters from languages that have their own alphabet, potentially improving accuracy and reducing the N-grams database. Retraining and testing is needed."

So I leave it up to you to close or not the issue, if you think using the ngramsL60 is a good enough fix, or not, and you would like to see further improvements as commented before.

mikiex · 2024-01-04T12:35:48Z

This makes sense, I just wasn't expecting it to return null with no guess. Switching to the largest database does work for "終了". You are correct, there are still examples : "拒否" where it will return null. Testing it across all my data now with ngramsL60 I found some examples in other languages: "undo". So there are certain words it won't detect on their own (or combined with others it cannot detect).

nitotm · 2024-01-04T13:11:15Z

That is interesting feedback, thanks.
I am surprised that "undo" does not appear, but it is true it is not in the database.
From what I see, it seems that it didn't make the cut, as it did not appear too many times in my training data. A solution in the future would be to increase the size of training data, lower the cut and make an extra-large database.

A shorter term solution, for single words that are undetected, would be to do an internal re-detect, and search the word as a prefix, suffix, or infix in other words; for example in the current database "undo" appears as suffix & infix for English. It might be better a "is_reliable()": false language guess, than null.

I want to ask, do you have any suggestion, about what to return in case of no detection? Maybe the returned object could have an error variable, that in case of no detection will have a message. In this case, would you be happy with "language": null, or do you believe that language should return something other than null?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Japanese text returns language as null #2

Some Japanese text returns language as null #2

mikiex commented Jan 3, 2024

nitotm commented Jan 4, 2024

mikiex commented Jan 4, 2024

nitotm commented Jan 4, 2024 •

edited

Loading

Some Japanese text returns language as null #2

Some Japanese text returns language as null #2

Comments

mikiex commented Jan 3, 2024

nitotm commented Jan 4, 2024

mikiex commented Jan 4, 2024

nitotm commented Jan 4, 2024 • edited Loading

nitotm commented Jan 4, 2024 •

edited

Loading