Skip to content

A repository describing the construction of a unigram language model from the Fisher corpus

Notifications You must be signed in to change notification settings

emeinhardt/fisher-lm-srilm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

fisher-lm-srilm

The notebook in this repository documents the use of SRILM to construct an add-1 smoothed unigram model based on the Fisher corpus. It exists principally because kenlm categorically does not calculate unigram models.

The variation in models tested here is more documentation of SRILM options/use than meaningful testing. (Especially in light of how little variation there is in the test set results.)

Dependencies

  • You need a Unix-like environment, supporting basic shell/bash commands, sed, and awk, although all of this functionality could be easily (but tediously) replaced by pure Python commands.
  • The most essential parts of the notebook are actually bash cells calling SRILM shell commands. You need SRILM installed.
  • You need your own copy of the Fisher corpus, processed analogously to my other Fisher corpus language model repository: a text file with one utterance per line (fisher_utterances_main.txt). The notebook currently assumes that you've also divided this into training and test sets (fisher_training_utterances.txt, fisher_test_utterances.txt).
  • At the bottom of the notebook, you'll need three modules to process and export SRILM's results:
    • funcy and numpy
    • probdist a module from another repository of mine, currently assumed to be located on the path ../wr/ relative to the current repository directory.

About

A repository describing the construction of a unigram language model from the Fisher corpus

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published