Add a language-detection processor to Ingest Node #29094

talevy · 2018-03-15T15:54:33Z

There are requests and existing solutions for providing language-detection
support within the ingest pipelines.

request: #23246,
Alex's external plugin: https://github.com/spinscale/elasticsearch-ingest-langdetect

it would be nice to provide this as a separate processor within Elasticsearch as a module or plugin.

dadoonet · 2018-03-16T07:50:00Z

Note that Tika is also providing a different library for lang auto detection:

Alex's plugin uses com.youcruit.com.cybozu.labs:langdetect:1.1.2-20151117
Tika uses org.apache.tika:tika-langdetect which uses behind the scene com.optimaize.languagedetector:language-detector:jar:0.5

While we are building an official lang-detect plugin, I think we should evaluate the pro/cons of the 2 libs (I have no idea TBH).

eskibars · 2018-05-31T20:59:48Z

These aren't the only 2 libraries. There's also CLD2 and CLD3 (though existing java bindings aren't really great from what I've seen) and others. I think we should consider low heap utilization and language detection accuracy as the top 2 metrics to look into and detection speed third, since the types of documents that really need language detection tend to have a relatively low index rate compared to the other types of documents we index.

I'm a big fan of this capability lying in an ingest node.

original-brownbear · 2018-07-02T12:06:31Z

It seems https://mvnrepository.com/artifact/com.youcruit.com.cybozu.labs/langdetect isn't maintained anymore.
Maybe not the best idea to start depending on that?

CLD seems to be somehwat superior (performance and accuracy) to Tika judging by a quick Google search. It does add a native/JNI dependency though.

=> Tika seems like the safest bet in terms of maintenance to me, but others probably know more here.

tballison · 2018-09-18T21:37:33Z

Please don't use Tika's builtin language detection. See https://issues.apache.org/jira/browse/TIKA-1723 for @kkrugler 's work integrating Optimaize and why we prefer it to our own built-in language detection.

Given the other options available, our goal now is to make it easier to integrate other libraries. Right, cybozu isn't maintained. Optimaize, IIRC, is a fork of cybozu and is somewhat more recent, but no activity in 2 years.

Y, you're right, CLD looks great, but JNI... It would be interesting to see a replication of Mike McCandless's evaluation with updated versions: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

As a side note @kkrugler has his own language detector: https://github.com/kkrugler/yalder :D

stevedodson · 2019-02-06T09:19:48Z

We (ML team) are currently investigating this as part of a deployment of supervised models into the ingest pipeline. I'll add more details as we move this forward.

joshdevins · 2020-01-30T13:59:57Z

This is resolved by #50292 . We (ML team) will also be publishing guidance on using the model in search use-cases (blog post pending).

dakrone · 2024-05-08T21:53:57Z

Resolved by #50292

talevy added >feature help wanted adoptme :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Mar 15, 2018

talevy mentioned this issue Mar 15, 2018

Provide the ability to supply detect_language as a setting via the ingest-attachment pipeline API #23246

Closed

original-brownbear assigned original-brownbear and unassigned original-brownbear Jul 2, 2018

martijnvg mentioned this issue Nov 12, 2019

New processors to expand ingest node's capabilities #48986

Open

9 tasks

rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020

dakrone closed this as completed May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a language-detection processor to Ingest Node #29094

Add a language-detection processor to Ingest Node #29094

talevy commented Mar 15, 2018

dadoonet commented Mar 16, 2018

eskibars commented May 31, 2018

original-brownbear commented Jul 2, 2018

tballison commented Sep 18, 2018

stevedodson commented Feb 6, 2019

joshdevins commented Jan 30, 2020 •

edited

Loading

dakrone commented May 8, 2024

Add a language-detection processor to Ingest Node #29094

Add a language-detection processor to Ingest Node #29094

Comments

talevy commented Mar 15, 2018

dadoonet commented Mar 16, 2018

eskibars commented May 31, 2018

original-brownbear commented Jul 2, 2018

tballison commented Sep 18, 2018

stevedodson commented Feb 6, 2019

joshdevins commented Jan 30, 2020 • edited Loading

dakrone commented May 8, 2024

joshdevins commented Jan 30, 2020 •

edited

Loading