Skip to content

Identify between English, French, and Italian with 99% accuracy. Uses language modeling techniques including LaPlace and Good-Turing smoothing.

Notifications You must be signed in to change notification settings

Coder1400/LanguageIdentification

Repository files navigation

Language Modeling: letter & word bi-grams for language identification.



========================== SETUP INSTRUCTIONS ===============================

1.) clone this repository https://github.com/Arken94/LanguageIdentification.git from 

2.) Within the newly cloned repository on your local machine, there should be 3 python files: “letterLangId.py”, “wordLangId.py”, and “wordLandId2.py”

3.) letterLangId.py is the letter bigram implementation. wordLangId.py is the word bigram implementation. wordLangId2.py is the word bigram implementation with an advanced smoothing technique (extra credit). 

4.) To run any of these python files simply run the file using the python command, for example: 

	“python wordLangId.py“

the code will open the proper training and test data files (hardcoded, no arguments to the program are needed) and implement the language model for that specific implementation. 

5.) NOTE: when you run any of the python files MAKE SURE that each of the training files and the test files exist in the same directory that you are running the python file from. I have included them in the github repository, so they should already be there. 

6.) the output of each program is printed to an output file with the same name as the python file, except with a “.out” extension. For example, wordLangId.py will print its output to wordLangId.out. These files should already contain the output of each program. 








About

Identify between English, French, and Italian with 99% accuracy. Uses language modeling techniques including LaPlace and Good-Turing smoothing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages