Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more languages #3

Open
rashmiranjanrrs opened this issue Dec 22, 2019 · 2 comments
Open

Add more languages #3

rashmiranjanrrs opened this issue Dec 22, 2019 · 2 comments

Comments

@rashmiranjanrrs
Copy link

Hey can you add more Indic language or can you share the pattern or the structure of subset so that I can able to add new languages as per my requirement. How to add new subset ?

@landrok
Copy link
Owner

landrok commented Jan 28, 2020

Hey,

How to add new subset ?

  • clone repository
  • create your subset file in src/LanguageDetector/subsets/ folder
  • write at least one test in tests/LanguageDetector/LanguageDetectionTest.php file to validate your subset
  • then you can push with a commit message Add new language {the new language}

Subset structure

A subset file is a JSON encoded file with the following structure:

{
  "freq":{"D":662077, [...], "tha":240340},
  "n_words":[260942223,308553243,224934017],
  "name":"en"
}
  • freq contains a list of key => value pairs where key is the ngram and value is an integer that represents the number of occurences found in source files.
    LanguageDetector accepts unigrams, bigrams and trigrams.
  • n_words is a serie of 3 integers that represents total number of occurences ordered by ngram size (1,2,3)
  • name is the name of the language

More

A you may guess, a "learning" tool has to be written to generate a subset. It's not yet packaged with the library but might be in the future.
An advise: to generate a reliable subset file, you have to collect a large number of files in the desired language and, if possible, from various language variations.

Hope this helps

@devope
Copy link

devope commented Aug 25, 2024

@landrok hey! can you give any advice how to extend your library to support georgian (ka) language (https://en.wikipedia.org/wiki/Georgian_language)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants