Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Japanese text returns language as null #2

Open
mikiex opened this issue Jan 3, 2024 · 3 comments
Open

Some Japanese text returns language as null #2

mikiex opened this issue Jan 3, 2024 · 3 comments

Comments

@mikiex
Copy link

mikiex commented Jan 3, 2024

print(ELDdetector.detect("終了"))
{"": {"language": null, "scores()": {}, "is_reliable()": false}}

@nitotm
Copy link
Owner

nitotm commented Jan 4, 2024

Hi, this would get solved by using the largest database.

ELDdetector = LanguageDetector('ngramsL60')

In the next version of ELD, the large database will be the default database.

Still, this could happen with other combinations of Japanese/Chinese characters for very short text. There is a possible improvement, for a future version of ELD, as stated at the end of the readme:
"The tokenizer could separate characters from languages that have their own alphabet, potentially improving accuracy and reducing the N-grams database. Retraining and testing is needed."

So I leave it up to you to close or not the issue, if you think using the ngramsL60 is a good enough fix, or not, and you would like to see further improvements as commented before.

@mikiex
Copy link
Author

mikiex commented Jan 4, 2024

This makes sense, I just wasn't expecting it to return null with no guess. Switching to the largest database does work for "終了". You are correct, there are still examples : "拒否" where it will return null. Testing it across all my data now with ngramsL60 I found some examples in other languages: "undo". So there are certain words it won't detect on their own (or combined with others it cannot detect).

@nitotm
Copy link
Owner

nitotm commented Jan 4, 2024

That is interesting feedback, thanks.
I am surprised that "undo" does not appear, but it is true it is not in the database.
From what I see, it seems that it didn't make the cut, as it did not appear too many times in my training data. A solution in the future would be to increase the size of training data, lower the cut and make an extra-large database.

A shorter term solution, for single words that are undetected, would be to do an internal re-detect, and search the word as a prefix, suffix, or infix in other words; for example in the current database "undo" appears as suffix & infix for English. It might be better a "is_reliable()": false language guess, than null.

I want to ask, do you have any suggestion, about what to return in case of no detection? Maybe the returned object could have an error variable, that in case of no detection will have a message. In this case, would you be happy with "language": null, or do you believe that language should return something other than null?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants