Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a language-detection processor to Ingest Node #29094

Closed
talevy opened this issue Mar 15, 2018 · 7 comments
Closed

Add a language-detection processor to Ingest Node #29094

talevy opened this issue Mar 15, 2018 · 7 comments
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature help wanted adoptme Team:Data Management Meta label for data/management team

Comments

@talevy
Copy link
Contributor

talevy commented Mar 15, 2018

There are requests and existing solutions for providing language-detection
support within the ingest pipelines.

request: #23246,
Alex's external plugin: https://github.com/spinscale/elasticsearch-ingest-langdetect

it would be nice to provide this as a separate processor within Elasticsearch as a module or plugin.

@dadoonet
Copy link
Member

Note that Tika is also providing a different library for lang auto detection:

  • Alex's plugin uses com.youcruit.com.cybozu.labs:langdetect:1.1.2-20151117
  • Tika uses org.apache.tika:tika-langdetect which uses behind the scene com.optimaize.languagedetector:language-detector:jar:0.5

While we are building an official lang-detect plugin, I think we should evaluate the pro/cons of the 2 libs (I have no idea TBH).

@eskibars
Copy link
Contributor

These aren't the only 2 libraries. There's also CLD2 and CLD3 (though existing java bindings aren't really great from what I've seen) and others. I think we should consider low heap utilization and language detection accuracy as the top 2 metrics to look into and detection speed third, since the types of documents that really need language detection tend to have a relatively low index rate compared to the other types of documents we index.

I'm a big fan of this capability lying in an ingest node.

@original-brownbear
Copy link
Member

It seems https://mvnrepository.com/artifact/com.youcruit.com.cybozu.labs/langdetect isn't maintained anymore.
Maybe not the best idea to start depending on that?

CLD seems to be somehwat superior (performance and accuracy) to Tika judging by a quick Google search. It does add a native/JNI dependency though.

=> Tika seems like the safest bet in terms of maintenance to me, but others probably know more here.

@tballison
Copy link

Please don't use Tika's builtin language detection. See https://issues.apache.org/jira/browse/TIKA-1723 for @kkrugler 's work integrating Optimaize and why we prefer it to our own built-in language detection.

Given the other options available, our goal now is to make it easier to integrate other libraries. Right, cybozu isn't maintained. Optimaize, IIRC, is a fork of cybozu and is somewhat more recent, but no activity in 2 years.

Y, you're right, CLD looks great, but JNI... It would be interesting to see a replication of Mike McCandless's evaluation with updated versions: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

As a side note @kkrugler has his own language detector: https://github.com/kkrugler/yalder :D

@stevedodson
Copy link

We (ML team) are currently investigating this as part of a deployment of supervised models into the ingest pipeline. I'll add more details as we move this forward.

@joshdevins
Copy link
Member

joshdevins commented Jan 30, 2020

This is resolved by #50292 . We (ML team) will also be publishing guidance on using the model in search use-cases (blog post pending).

@rjernst rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020
@dakrone
Copy link
Member

dakrone commented May 8, 2024

Resolved by #50292

@dakrone dakrone closed this as completed May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature help wanted adoptme Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

9 participants