Skip to content

ostegm/predict-language

Repository files navigation

Predicting Language From Text

This repo contains the scripts and notebooks used to train a convolutional neural network (+LSTM) on a corpus of text and predict the language of that text with 99% accuracy.

Training & Architecture

Initially, I based the neural network on a WildMl.org blog post that aims to predict sentiment from movie reviews. The WildML post has references to 2014 paper by Kim Yoon. I found this to only achieve ~98% acc and 11+ hours to train, so I chose to try out other architectures. The final architecture I settled on was a combination of CNN + LSTM, which achieved 99.7% accuracy on a held out set of 21k records provided by Startup.ml. Interestingly, the second architecture only required about 58mins to train.

Layers used in training the model:

  • Embedding layer: generate vector representations of individual words. (+ dropout)
  • A single 1D convolutional layer -- convole the input data using a 5x64 filter with valid padding. -- Apply a relu for potential non-linearity -- Apply max pooling over the output of the relu (stride of 4)
  • An LSTM layer
  • Finally, a fully connected layer with sigmoid activation.

Results

Overall accuracy on the test set was 99.7%. The Normalized Confusion Matrix from the CNN + LSTM Architecture is shown below, which demonstrates some areas where the model is weaker. In particular, the model has hard time with Finnish and Romanian. This could provide some interesting next steps - pursuing more training data for those two languages, or investigating peculiarities about their structure.

Below, is a plot demonstrating the plateauing training results from the first achitecture. The training time was 11+ hours, compared to 58mins for the second architecture of CNN + LSTM:

Info on the raw data

The data is pulled from statmt.org and comes as zipped directory containing many text files grouped into subdirectories by language.

The follwing is an example of the original data:

├── en
|   ├── ep-00-01-17.txt
|   └── ep-00-01-18.txt
|   └──...
├── es
|   ├── ep-00-01-17.txt
|   └── ep-00-01-18.txt
|   └──...
├── ...

Preparation Steps

Prior to running the CNN against any labeled data (for testing or training), we need to do some cleaning steps. The "exploratory work" notebook is the notebook used to perform these steps.

  • Strip XML from the raw data since many lines in the raw data weren't meant to be used in training.
  • Split each sentence at 70 words. Pad any sentences shorter than 70 words with the special token to all other sentences to make them 70 words. Padding sentences to the same length is useful because it allows us to efficiently batch our data since each example in a batch must be of the same length.
  • Build a vocabulary index and map each word to an integer between 0 and N (the vocabulary size).
  • Convert each sentence into a vector of integers.

Links to Files to Reproduce (too large to host here)

About

Playground for predicting language from text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published