forked from ai-ku/fastsubs
-
Notifications
You must be signed in to change notification settings - Fork 1
Generate most likely substitutes for words in a given text based on an n-gram language model.
License
orenmel/fastsubs
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
FASTSUBS Copyright (c) 2012-2014 Deniz Yuret Usage: fastsubs [-n <n> | -p <p>] model.lm[.gz] < input.txt Fastsubs is a program that finds the most likely substitutes for words in input.txt using the language model in model.lm (ARPA format). The number of substitutes is controlled by two parameters: -n specifies the number of substitutes for each word, -p specifies a threshold on the sum of the substitute probabilities. The program stops generating substitutes for a word when either limit is satisfied. The output has an input word and its substitutes with the logarithm (base 10) of their unnormalized probabilities in the following format: word <tab> sub1 <space> lprob1 <tab> sub2 <space> lprob2 <tab> ... Please see the file LICENSE for terms of use. A paper with the description of the algorithm and other related resources can be found at http://goo.gl/jzKH0. The code is mostly C99 compliant and has been compiled and tested on a linux system with gcc. The few GNU extensions used are optional and can be turned off by editing the beginning of dlib.h. Update of Jan 9, 2014: The latest version gets rid of all glib dependencies (glib is broken, it blows up your code without warning if your arrays or hashes get too big). It also fixes a bug: The previous versions of the code and the paper assumed that the log back-off weights were upper bounded by 0 which they are not. In my standard test of generating the top 100 substitutes for all the 1,222,974 positions in the PTB, this caused a total of 66 low probability substitutes to be missed and 31 to be listed out of order. Finally a multithreaded version, fastsubs-omp is implemented. The number of threads can be controlled using the environment variable OMP_NUM_THREADS. fastsubs-omp can be used with a few additional run-time options: Usage: fastsubs-omp [-n <n> | -p <p> | -m <max-threads> | -t | -z ] model.lm[.gz] < input.txt -m specifies the maximum number of threads that can be used -z specifies that the output probability values would be normalized probabilities that sum up to 1 (instead of the default unnormalized log probabilities) -t specifies a one-target-per-text-line input format (instead of the default free text format). In this format every input line is: target_name <tab> target_id <tab> target_index <tab> text_line Then for every input line fastsubs-omp outputs two lines: 1) the input line 2) substitutes for the text_line[target_index] (i.e. the word in the target_index position in text_line) in the format: sub1 <space> lprob1 <tab> sub2 <space> lprob2 <tab> ... Other test and utility executables that can be built using the Makefile: * fastsubs-omp: Multithreaded version. * wordsub: Takes the output of fastsubs and samples random substitutes for each word. * normalize-subs.pl: Takes fastsubs output and converts the unnormalized log10(p) entries to normalized p entries that add up to 1.0 (which may not be accurate if fastsubs did not output the whole vocabulary.) * fastsubs-test: Takes a model file and interactively runs fastsubs on sentences typed by the user. * lmheap-test: Takes a model file and tests the heaps in the lm data structure by interactively reading n-grams from the user and printing out the contents of the heap for each position. * sentence-test: Takes a model file and tests the sentence and lm data structures by reading sentences from the user and printing out logp for each word. * lm-test: Takes a model file and tests the lm data structure by reading n-grams from the user and printing out their logp and back-off weights.
About
Generate most likely substitutes for words in a given text based on an n-gram language model.
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published
Languages
- C 91.9%
- C++ 4.2%
- Makefile 3.3%
- Perl 0.6%