Skip to content

Commit

Permalink
Update of the latest usability changes
Browse files Browse the repository at this point in the history
  • Loading branch information
Nito committed Aug 19, 2023
1 parent 78b33b9 commit 84f30f8
Show file tree
Hide file tree
Showing 2 changed files with 46 additions and 62 deletions.
59 changes: 24 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,56 +25,47 @@ ELD is also available in [Javascript](https://github.com/nitotm/efficient-langua
```bash
$ pip install eld
```
Alternatively, download / clone the files will work just fine.
Alternatively, download / clone the files can work too, by changing the import path.

## How to use?

```python
# from src.eld.languageDetector import LanguageDetector # To load ELD without install. Update path.
from eld import LanguageDetector
detector = LanguageDetector()

print(detector.detect('Hola, cómo te llamas?'))
```
`detect()` expects a UTF-8 string, and returns a list, with a value named 'language', which will be either an *ISO 639-1 code* or `False`
```
{'language': 'es'}
{'language': False, 'error': 'Some error', 'scores': {}}
```

- To get the best guess, turn off minimum length & confidence threshold; also used for benchmarking.
`detect()` expects a UTF-8 string, and returns an object, with a 'language' variable, which is either an *ISO 639-1 code* or `None`
```python
print(detector.detect('To', False, False, 0, 1))
# To improve readability Named Parameters can be used
detector.detect(text='To', clean_text=False, check_confidence=False, min_byte_length=0, min_ngrams=1)
# clean_text=True, Removes Urls, domains, emails, alphanumerical & numbers
```
print(detector.detect('Hola, cómo te llamas?'))
# Object { language: "es", scores(): {"es": 0.53, "et": 0.21, ...}, is_reliable(): True }
# Object { language: None|str, scores(): None|dict, is_reliable(): bool }

- To retrieve the scores of all languages detected, we will set `return_scores` to `True`, just once
```python
detector.return_scores = True
print(detector.detect('How are you? Bien, gracias'))
# {'language': 'en', 'scores': {'en': 0.32, 'es': 0.31, ...}}
```
print(detector.detect('Hola, cómo te llamas?').language)
# "es"

# if clean_text(True), detect() removes Urls, domains, emails, alphanumerical & numbers
detector.clean_text(True) # Default is False
```
- To reduce the languages to be detected, there are 3 different options, they only need to be executed once. (Check available [languages](#languages) below)
```python
lang_subset = ['en', 'es', 'fr', 'it', 'nl', 'de']

# with dynamic_lang_subset() the detector executes normally, and then filters excluded languages
# Option 1
# with dynamic_lang_subset(), detect() executes normally, and then filters excluded languages
detector.dynamic_lang_subset(lang_subset)
# Returns an object with a list named 'languages', with the validated languages or 'None'

# lang_subset() Will first remove the excluded languages, from the n-grams database
# Option 2. lang_subset() Will first remove the excluded languages, from the n-grams database
# For a single detection is slower than dynamic_lang_subset(), but for several will be faster
# If save option is true (default), the new Ngrams subset will be stored, and loaded next call
detector.lang_subset(lang_subset) # lang_subset(langs, save=True)
# Returns object {success: True, languages: ['de', 'en', ...], error: None, file: 'ngramsM60...'}

# To remove either dynamic_lang_subset() or lang_subset(), call the methods with False as argument
detector.lang_subset(False)
# To remove either dynamic_lang_subset() or lang_subset(), call the methods with None as argument
detector.lang_subset(None)

# Finally the fastest way to regularly use a languages subset: we create the instance with a file
# The file in the argument can be a subset by lang_subset() or another database like ngrams_L.php
langSubsetDetect = LanguageDetector('ngrams_2f37045c74780aba1d36d6717f3244dc025fb935')
# Finally the optimal way to regularly use a languages subset: we create the instance with a file
# The file in the argument can be a subset by lang_subset() or another database like 'ngramsL60'
langSubsetDetect = LanguageDetector('ngramsL60')
```

## Benchmarks
Expand Down Expand Up @@ -106,7 +97,7 @@ These are the results, first, accuracy and then execution time.
| **CLD3** | 92.2% | 95.8% | 94.7% | 69.0% | 51.5% |
| **franc** | 89.8% | 92.0% | 90.5% | 65.9% | 52.9% |
-->
<img alt="accuracy table" width="800" src="https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/benchmarks/table_accuracy_py.svg">
<img alt="accuracy table" width="800" src="https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/misc/table_accuracy_py.svg">

<!--- Time table
| | Tweets | Big test | Sentences | Word pairs | Single words |
Expand All @@ -120,7 +111,7 @@ These are the results, first, accuracy and then execution time.
| **franc** | 1.2" | 8" | 7.8" | 2.8" | 2" |
| **Nito-ELD-php** | 0.31" | 2.5" | 2.2" | 0.66" | 0.48" |
-->
<img alt="time table" width="800" src="https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/benchmarks/table_time_py.svg">
<img alt="time table" width="800" src="https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/misc/table_time_py.svg">

<sup style="color:#08e">1.</sup> <sup style="color:#777">Lingua could have a small advantage as it participates with 54 languages, 6 less.</sup>
<sup style="color:#08e">2.</sup> <sup style="color:#777">CLD2 and CLD3, return a list of languages, the ones not included in this test where discarded, but usually they return one language, I believe they have a disadvantage.
Expand All @@ -135,7 +126,7 @@ I added *ELD-L* for comparison, which has a 2.3x bigger database, but only incre

Here is the average, per benchmark, of Tweets, Big test & Sentences.

![Sentences tests average](https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/benchmarks/sentences_avg_py.png)
![Sentences tests average](https://raw.githubusercontent.com/nitotm/efficient-language-detector-py/main/misc/sentences_avg_py.png)
<!--- Sentences average
| | Time | Accuracy |
|:--------------------|:------------:|:------------:|
Expand All @@ -154,11 +145,9 @@ These are the *ISO 639-1 codes* of the 60 supported languages for *Nito-ELD* v1

> 'am', 'ar', 'az', 'be', 'bg', 'bn', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 'eu', 'fa', 'fi', 'fr', 'gu', 'he', 'hi', 'hr', 'hu', 'hy', 'is', 'it', 'ja', 'ka', 'kn', 'ko', 'ku', 'lo', 'lt', 'lv', 'ml', 'mr', 'ms', 'nl', 'no', 'or', 'pa', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq', 'sr', 'sv', 'ta', 'te', 'th', 'tl', 'tr', 'uk', 'ur', 'vi', 'yo', 'zh'

Full name languages:

> 'Amharic', 'Arabic', 'Azerbaijani (Latin)', 'Belarusian', 'Bulgarian', 'Bengali', 'Catalan', 'Czech', 'Danish', 'German', 'Greek', 'English', 'Spanish', 'Estonian', 'Basque', 'Persian', 'Finnish', 'French', 'Gujarati', 'Hebrew', 'Hindi', 'Croatian', 'Hungarian', 'Armenian', 'Icelandic', 'Italian', 'Japanese', 'Georgian', 'Kannada', 'Korean', 'Kurdish (Arabic)', 'Lao', 'Lithuanian', 'Latvian', 'Malayalam', 'Marathi', 'Malay (Latin)', 'Dutch', 'Norwegian', 'Oriya', 'Punjabi', 'Polish', 'Portuguese', 'Romanian', 'Russian', 'Slovak', 'Slovene', 'Albanian', 'Serbian (Cyrillic)', 'Swedish', 'Tamil', 'Telugu', 'Thai', 'Tagalog', 'Turkish', 'Ukrainian', 'Urdu', 'Vietnamese', 'Yoruba', 'Chinese'
> Amharic, Arabic, Azerbaijani (Latin), Belarusian, Bulgarian, Bengali, Catalan, Czech, Danish, German, Greek, English, Spanish, Estonian, Basque, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Armenian, Icelandic, Italian, Japanese, Georgian, Kannada, Korean, Kurdish (Arabic), Lao, Lithuanian, Latvian, Malayalam, Marathi, Malay (Latin), Dutch, Norwegian, Oriya, Punjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovene, Albanian, Serbian (Cyrillic), Swedish, Tamil, Telugu, Thai, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Yoruba, Chinese
## Future improvements

Expand Down
49 changes: 22 additions & 27 deletions demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,28 +14,19 @@
See the License for the specific language governing permissions and
limitations under the License.
"""

from src.eld.languageDetector import LanguageDetector
# from eld import LanguageDetector
from eld import LanguageDetector

detector = LanguageDetector()

# detect() expects a UTF-8 string, and returns a dictionary, with key 'language', value: ISO 639-1 code or false
# detect() expects a UTF-8 string, returns an object, with a 'language' variable : ISO 639-1 code or null
print(detector.detect('Hola, cómo te llamas?'))
# {'language': 'es'}
# {'language': False, 'error': 'Some error', 'scores': {}}

# To get the best guess, turn off minimum length and confidence threshold; also used for benchmarking.
print(detector.detect('To', False, False, 0, 1))
# Object { language: "es", scores(): {"es": 0.53, "et": 0.21, ...}, is_reliable(): True }
# Object { language: None|str, scores(): None|dict, is_reliable(): bool }
print(detector.detect('Hola, cómo te llamas?').language)
# "es"

# To improve readability Named Parameters can be used
detector.detect(text='To', clean_text=False, check_confidence=False, min_byte_length=0, min_ngrams=1)
# clean_text=True, Removes Urls, domains, emails, alphanumerical & numbers

# To retrieve the scores of all languages detected, we will set returnScores to True, just once
detector.return_scores = True
print(detector.detect('How are you? Bien, gracias'))
# {'language': 'en', 'scores': {'en': 0.32, 'es': 0.31, ...}}
# clean_text(True) Removes Urls, domains, emails, alphanumerical & numbers
detector.clean_text(True) # Default is False

# To reduce the languages to be detected, there are 3 different options, they only need to be executed once.
# This is the complete list on languages for ELD v1, using ISO 639-1 codes:
Expand All @@ -46,24 +37,28 @@
"""
lang_subset = ['en', 'es', 'fr', 'it', 'nl', 'de']

# dynamic_lang_subset() Will execute the detector normally, but at the end will filter the excluded languages.
# Option 1. With dynamic_lang_subset(), detect() executes normally, but at the end will filter the excluded languages.
detector.dynamic_lang_subset(lang_subset)
# Returns an object with a list named 'languages', with the validated languages or 'None'

# to remove the subset
detector.dynamic_lang_subset(False)
detector.dynamic_lang_subset(None)

""" lang_subset(langs,save=True) Will previously remove the excluded languages form the Ngrams database; for a single
detection might be slower than dynamic_lang_subset(), but for several strings will be faster. If save option is true
(default), the new ngrams subset will be stored, and loaded for the same languages subset, increasing startup speed
""" Option 2. lang_subset(langs,save=True) Will previously remove the excluded languages form the Ngrams database; for
a single detection might be slower than dynamic_lang_subset(), but for several strings will be faster. If 'save' option
is true (default), the new ngrams subset will be stored and cached for next time.
"""
detector.lang_subset(lang_subset)
# Returns object {success: True, languages: ['de', 'en', ...], error: None, file: 'ngramsM60...'}

# to remove the subset
detector.lang_subset(False)
detector.lang_subset(None)

print(detector.VERSION)

""" Finally the fastest option to regularly use the same language subset, will be to add as an argument the file
stored by lang_subset(), when creating an instance of the class. In this case the subset Ngrams database will
""" Finally the optimal way to regularly use the same language subset, will be to add as an argument the file stored
(and returned) by lang_subset(), when creating an instance of the class. In this case the subset Ngrams database will
be loaded directly, and not the default database. Also, you can use this option to load different ngram databases
stored at src/ngrams/
stored at eld/resources/ngrams
"""
langSubsetDetect = LanguageDetector('ngrams_2f37045c74780aba1d36d6717f3244dc025fb935')
langSubsetDetect = LanguageDetector('ngramsM60-6_5ijqhj4oecs310zqtm8u9pgmd9ox2yd')

0 comments on commit 84f30f8

Please sign in to comment.