-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a language-detection processor to Ingest Node #29094
Comments
Note that Tika is also providing a different library for lang auto detection:
While we are building an official lang-detect plugin, I think we should evaluate the pro/cons of the 2 libs (I have no idea TBH). |
These aren't the only 2 libraries. There's also CLD2 and CLD3 (though existing java bindings aren't really great from what I've seen) and others. I think we should consider low heap utilization and language detection accuracy as the top 2 metrics to look into and detection speed third, since the types of documents that really need language detection tend to have a relatively low index rate compared to the other types of documents we index. I'm a big fan of this capability lying in an ingest node. |
It seems https://mvnrepository.com/artifact/com.youcruit.com.cybozu.labs/langdetect isn't maintained anymore. CLD seems to be somehwat superior (performance and accuracy) to Tika judging by a quick Google search. It does add a native/JNI dependency though. => Tika seems like the safest bet in terms of maintenance to me, but others probably know more here. |
Please don't use Tika's builtin language detection. See https://issues.apache.org/jira/browse/TIKA-1723 for @kkrugler 's work integrating Optimaize and why we prefer it to our own built-in language detection. Given the other options available, our goal now is to make it easier to integrate other libraries. Right, cybozu isn't maintained. Optimaize, IIRC, is a fork of cybozu and is somewhat more recent, but no activity in 2 years. Y, you're right, CLD looks great, but JNI... It would be interesting to see a replication of Mike McCandless's evaluation with updated versions: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html As a side note @kkrugler has his own language detector: https://github.com/kkrugler/yalder :D |
We (ML team) are currently investigating this as part of a deployment of supervised models into the ingest pipeline. I'll add more details as we move this forward. |
This is resolved by #50292 . We (ML team) will also be publishing guidance on using the model in search use-cases (blog post pending). |
Resolved by #50292 |
There are requests and existing solutions for providing language-detection
support within the ingest pipelines.
request: #23246,
Alex's external plugin: https://github.com/spinscale/elasticsearch-ingest-langdetect
it would be nice to provide this as a separate processor within Elasticsearch as a module or plugin.
The text was updated successfully, but these errors were encountered: