Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding benchmark Lingua comparison #9

Open
Marcono1234 opened this issue Jul 7, 2024 · 4 comments
Open

Question regarding benchmark Lingua comparison #9

Marcono1234 opened this issue Jul 7, 2024 · 4 comments

Comments

@Marcono1234
Copy link

Marcono1234 commented Jul 7, 2024

Hello,
in your benchmark in the README you got pretty bad performance for Lingua. How exactly do you execute Lingua?
Lingua uses quite large models which have to be loaded once (or lazily during usage), but afterwards detection speed should be quite fast if you keep reusing the same detector I think (which is the intended usage). However, if you keep creating new detector instances for every detection, then performance will be rather bad. Also, Lingua requires a lot of memory during runtime, so if you are running it in a memory-constrained environment, maybe its performance will not be that good either.

Have you tried Lingua version 2 as well1? It is based on the Rust implementation and its performance will likely be better. For measuring performance it might also be useful to:

Thanks for doing this benchmark in the first place though!

Footnotes

  1. That version might also cover more than the 54 languages you mention in the README.

@nitotm
Copy link
Owner

nitotm commented Jul 8, 2024

Soon I'm going to redo all benchmarks, for an ELD v3, so it is a good opportunity to fix anything that might be incorrect.

For lingua I use the same detector for each line, so that is not the problem. I did the benchmarks on a 16GB machine, now I have 32GB. I don't see any problem with memory, it uses ~400mb, not too much really. On windows 10.
I was surprised at how slow it was, I tried different things, but I also saw others had the same problem.

Have you tried it? Lingua <2.0 against any of the other detectors I tested to see if the performance difference matches?

I have not tried Lingua v2, I guess I will for the new benchmarks.

@Marcono1234
Copy link
Author

I did the benchmarks on a 16GB machine, now I have 32GB. I don't see any problem with memory, it uses ~400mb, not too much really.

Yes you are right, that should be more than enough.

Have you tried it? Lingua <2.0 against any of the other detectors I tested to see if the performance difference matches?

Sorry, I hadn't actually tried Lingua < 2.0 yet. But I have compared Lingua 1.3.5 and 2.0.2 now:

Lingua version Loading all models1 Detection2
1.3.5 29.02s 233.68s
2.0.2 8.43s 21.97s

So it seems you are right, the performance of Lingua < 2.0 is really not that great. Would really be worth it giving Lingua 2 a try.

Footnotes

  1. Using LanguageDetectorBuilder.from_all_languages().with_preloaded_language_models()

  2. I was testing detection 1000 times of 16 sentences in different languages; though the absolute time value might not be that interesting here, rather the ratio between the Lingua versions

@nitotm
Copy link
Owner

nitotm commented Aug 16, 2024

I’m redoing the benchmarks for v3, and I’m trying Lingua 2.0.2, what a difference really, with my installation of 1.3.2 I’m seeing a great difference. I’m also using with_preloaded_language_models() and it is reasonably fast now.
I will close the issue when I publish v3

@nitotm
Copy link
Owner

nitotm commented Sep 5, 2024

I uploaded ELD v3-beta with the new benchmarks, now Lingua is reasonably fast.

I still find discrepancies in their benchmarks, according to them Lingua-low is x2 slower than fasttext, which is fine; I tested x2-x5 depending on the benchmark, but then their test with CLD2 is very similar in speed to fasttext, and I think CLD2 should be >= x2 faster than fasttext.
(Also, their benchmark for CLD2 is unfair, as they are not using bestEffort = True which would improve its accuracy considerably)

Discussion for v3-beta at: #10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants