Skip to content

Kudryavets/typesafer-nlp-r

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TSafer (TypeSafer)

A Natural Language Processing app which predicts the next word you want to enter.

For training were used 50% blogs, 40% news and 60% twitter. That allowed to get 75% of unique words from all corpuses.

alt text

TSafer uses interpolated Kneser-Ney smoothing for 4,3,2,1 grams and back-off model for unseen words. The higher ngrams coefficients are computed with the formula: alt text

the lowest:
alt text

TSafer uses

  • precomputed values for Kneser-Ney coefficients for every word
  • stored in R data.table with hashed index
  • with each query processed by recursive function

All these make it work realy fast.

For text processing regexp was used, RWeka for ngramization.

Learning and processing are paralled with doParallel.

About

Predict next word

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages