Random utilities for NLP. Many of them were designed for MT (Machine Translation) experiments, but they can still be used for general purposes.
Name | Script | Description |
---|---|---|
Word Count | word-count.py | Count (OOV/IV) words |
Probability Histogram | probability-histogram.py | Generate a probability histogram |
Vertical Statistics | vertical-statistics.py | Calculate statistics vertically for values (with fixed patterns) |
Sequence Diff | sequence-diff.py | Compare sequences and display diffs |
Name | Script | Description |
---|---|---|
Bitext Identical Pairs | bitext-identical-pairs.py | Detect (and remove) identical pairs from bitext |
Bitext Cleaning | bitext-cleaning.py | Clean bitext by heuristic rules |
Corpus Name | Script | Description |
---|---|---|
ICWSM 2009 Spinn3r Blog Dataset | Spinn3r-2009-extract.py | Extract select (and clean) text |
MSLT (Microsoft Speech Language Translation) | MSLT-repack.sh, MSLT-extract.py | Extract monolingual/parallel data |
PPDB (Paraphrase Database) | PPDB-extract.py | Extract select paraphrases |
XLIFF files | XLIFF-extract.py | Extract bitext from XLIFF files |
It can also be used for counting/getting OOV (out-of-vocabulary) or IV (in-vocabulary) words.
Sample output:
, 63751725
. 61725497
the 60873114
to 35743675
and 34360371
a 29438769
of 28104862
i 27116174
in 20234743
" 16950626
Usage: word-count.py [-i INPUT] [-w WHITE_LIST] [-b BLACK_LIST] [-s]
Example: cat file | python word-count.py -w list -s > output
cat file | python word-count.py -b vocabulary > oov
cat file | python word-count.py -w vocabulary > iv
Optional arguments:
-i INPUT, --input INPUT
input file(s) (glob patterns are supported)
-w WHITE_LIST, --white-list WHITE_LIST
only count words in the write list
-b BLACK_LIST, --black-list BLACK_LIST
ignore words in the black list
-s, --statistics print statistics (default: False)
- Dependency: NumPy
Sample output:
-1.0 0.015589
-0.8 0.047416
-0.6 0.077869
-0.4 0.137002
-0.2 0.195826
0.0 0.102647
0.2 0.114418
0.4 0.134427
0.6 0.131316
0.8 0.043490
1.0
Usage: probability-histogram.py [-i INPUT] [-c COLUMN] [-n] [-l LOWER] [-u UPPER] [-b BINS] [-p]
Example: cat file | python probability-histogram.py -c 1 -n -p
Optional arguments:
-i INPUT, --input INPUT
input file(s) (glob patterns are supported)
-c COLUMN, --column COLUMN
the index of column that contains values (default: 0)
-n, --normalize normalize scores to [-1,1] (default: False)
-l LOWER, --lower LOWER
the lower range of bins
-u UPPER, --upper UPPER
the upper range of bins
-b BINS, --bins BINS the number of bins (default: 10)
-p, --plot plot the histogram (default: False)
- Dependency: NumPy
Sample input:
BLEU = 33.99, 64.8/42.0/30.6/23.3 (BP=0.911, ratio=0.915, hyp_len=22925, ref_len=25061)
BLEU = 32.78, 65.5/40.9/28.2/20.2 (BP=0.933, ratio=0.935, hyp_len=55947, ref_len=59823)
BLEU = 37.29, 68.7/44.5/31.8/23.2 (BP=0.963, ratio=0.963, hyp_len=76162, ref_len=79064)
Sample output:
mean BLEU = 34.69, 66.3/42.5/30.2/22.2 (BP=0.936, ratio=0.938, hyp_len=51678, ref_len=54649)
median BLEU = 33.99, 65.5/42.0/30.6/23.2 (BP=0.933, ratio=0.935, hyp_len=55947, ref_len=59823)
Usage: vertical-statistics.py [-i INPUT] [-l] [-c COLUMN]
[-m {mean,min,max,range,median,sum,std,var,sub} [{mean,...,sub} ...]]
Example: cat file1 file2 file3 | python vertical-statistics.py -l -m mean median > output
Optional arguments:
-i INPUT, --input INPUT
input file(s) (glob patterns are supported)
-m, --metrics {mean,min,max,range,median,sum,std,var,sub} [{mean,min,max,range,median,sum,std,var,sub} ...]
statistic metrics (default: ['mean'])
-l, --label print metrics labels (default: False)
-c COLUMN, --column COLUMN
analyse a specified whitespace-split column (c-th) (default: None)
Sample output:
1 CONST-1 you can remove it .
....................................................................................................
1 SEQUE-B you can take it off .
1 SEQUE-1 you can withdraw .
====================================================================================================
3 CONST-1 but , let 's face it , underachiever , dead @-@ end life , okay ?
....................................................................................................
3 SEQUE-B let us be frank . he 's got a lousy job , he ain 't got no prospects .
^ ---- ----------- -
3 SEQUE-1 let us be frank . he has a lousy job , he no longer has any prospect .
^^ +++++++++++++++
====================================================================================================
Usage: sequence-diff.py -f FILE [FILE ...] [-ft FILE_TAG [FILE_TAG ...]]
[-c CONST [CONST ...]] [-ct CONST_TAG [CONST_TAG ...]] [-d] [-m {char,token}] [-v]
Example: python sequence-diff.py -c source_file -f reference_file hypothesis_file
Optional arguments:
-f FILE [FILE ...], --file FILE [FILE ...]
input files of sequences to be compared (the first file is the base to be compared with,
such as reference translations) (default: None)
-c CONST [CONST ...], --const CONST [CONST ...]
files of sequences not participating in the comparison,
such as source sentences to be translated (default: [])
-ft FILE_TAG [FILE_TAG ...], --file-tag FILE_TAG [FILE_TAG ...]
tags of input files (default: None)
-ct CONST_TAG [CONST_TAG ...], --const-tag CONST_TAG [CONST_TAG ...]
tags of const files (default: None)
-d, --condense condense the comparison of multiple sequences without showing diffs (default: False)
-m {char,token}, --mode {char,token}
compute diffs at character level or token level (default: char)
-v, --verbose print all sequences in the condense mode (default: False)
Sample output:
19056 inclusion=True
FILE-1 We' re on our way , way , way , we' re on our way
FILE-2 ♪ We 're on our way , way , way ♪ ♪ We 're on our way , way , way , we 're on our way ... ♪
====================================================================================================
21584 similarity=0.68
FILE-1 We want to make a place we can learn to love , anywhere we can be proud of .
FILE-2 ♪ We wanna make a place where we can learn to love ♪ ♪ Build a world that we can be proud of ♪
====================================================================================================
27541623 bitext pairs were read
770532 pairs (2.80%) were identical with inclusion and threshold=0.50
Usage: bitext-identical-pairs.py [-f FILE [FILE ...]] [-o OUTPUT [OUTPUT ...]] [-i]
[-t THRESHOLD] [-c] [-p] [-l] [-u] [-v]
Example: python bitext-identical-pairs.py -f file1 file2 -o output1 output2 -i -t 0.5 -p -l -v
Optional arguments:
-f FILE [FILE ...], --file FILE [FILE ...]
input bitext file(s) to be compared (default: None)
-o OUTPUT [OUTPUT ...], --output OUTPUT [OUTPUT ...]
output bitext file(s) without identical pairs (default: None)
-i, --inclusion treat inclusion as identity (default: False)
-t THRESHOLD, --threshold THRESHOLD
similarity threshold to determine identity ([0,1]) (default: 0.9)
-c, --character calculate character-level similarity (default: False)
-p, --punc-digit exclude punctuations and digits from comparison (default: False)
-l, --lowercase compare lowercased sequences (default: False)
-u, --capitalized compare capitalized sequences (default: False)
-v, --verbose print identical pairs (default: False)
Sample output:
3 length-ratio=2.85
FILE-1 Саvеndіѕh , mais la totalité s' élève à ... 2,343 livres et 16 cts .
FILE-2 2,343 pounds and 16 pence .
====================================================================================================
15 uppercase=True
FILE-1 ЅΑΝ FRΑΝСΙЅСΟ , 1973
FILE-2 SAN FRANCISCO , 1973
====================================================================================================
27541623 bitext pairs were read
2960667 pairs (10.75%) were filtered out
- 2944687 pairs (10.69%) were imbalanced with length-ratio >= 2.00
- 19386 pairs (0.07%) were uppercased (both source and target)
233423 pairs (0.85%) have been capitalized
Usage: bitext-cleaning.py [-f FILE [FILE ...]] [-o OUTPUT [OUTPUT ...]] [-r RATIO] [-i] [-u] [-v]
Example: python bitext-cleaning.py -f file1 file2 -o output1 output2 -r 2.0 -u -v
Optional arguments:
-f FILE [FILE ...], --file FILE [FILE ...]
input bitext file(s) (default: None)
-o OUTPUT [OUTPUT ...], --output OUTPUT [OUTPUT ...]
output bitext file(s) (default: None)
-r RATIO, --ratio RATIO
remove pairs which length ratios are no less than a threshold (default: None)
-i, --incomplete remove pairs if they contain incomplete sentences,
i.e. no .!?" at the end (default: False)
-u, --uppercase remove pairs if both source and target are uppercased,
otherwise capitalize uppercase strings (default: False)
-v, --verbose print identified pairs (default: False)
Usage: Spinn3r-2009-extract.py -f FILE [FILE ...] [-l LANGUAGES [LANGUAGES ...]]
-e ELEMENTS [ELEMENTS ...] [-u] [-c]
Examples: python Spinn3r-2009-extract.py -f BLOGS-tiergroup-1.tar.gz -e title description -l en -u -c > output.en
Optional arguments:
-f FILE [FILE ...], --file FILE [FILE ...]
Spinn3r tar.gz file(s)
-l LANGUAGES [LANGUAGES ...], --languages LANGUAGES [LANGUAGES ...]
language(s) to be extracted (e.g. en)
-e ELEMENTS [ELEMENTS ...], --elements ELEMENTS [ELEMENTS ...]
element(s) to be extracted (e.g. title, description)
-u, --unescape unescape text (e.g. "&"->"&") (default: False)
-c, --clean clean text (drop <*>/URLs, condense spaces) (default: False)
- Repack MSLT text (Python has an issue in handling original zip file).
bash MSLT-repack.sh /absolute/path/to/MSLT_Corpus.zip
- Extract parallel or monolingual data from MSLT_Corpus.tgz
Usage: MSLT-extract.py -f FILE -s SOURCE [-t TARGET] [-c CATEGORY] [-o OUTPUT]
Examples: python MSLT-extract.py -f MSLT_Corpus.tgz -s fr -t en -c dev -o MSLT.fr-en
python MSLT-extract.py -f MSLT_Corpus.tgz -s fr > MSLT.fr
Optional arguments:
-f FILE, --file FILE input repacked tgz file
-s SOURCE, --source SOURCE
source language (e.g. fr)
-t TARGET, --target TARGET
target language (e.g. en)
-c CATEGORY, --category CATEGORY
dev or test? (default: dev)
-o OUTPUT, --output OUTPUT
output file (used for parallel data)
Usage: PPDB-extract.py [-f FILE] [-a FEATURE] [-t THRESHOLD] [-e ENTAILMENT]
Examples: gzip -dc ppdb-2.0-s-lexical.gz | python PPDB-extract.py -e Equivalence > output
Optional arguments:
-f FILE, --file FILE unzipped input file(s) (glob patterns are supported)
-a FEATURE, --feature FEATURE
the feature used for filtering
-t THRESHOLD, --threshold THRESHOLD
the threshold used for filtering (feature value >= threshold are kept)
-e ENTAILMENT, --entailment ENTAILMENT
the entailment type(s) used for filtering (regular expression)
Usage: XLIFF-extract.py [-h] -f FILE [-s {source,target,both}]
Examples: python XLIFF-extract.py -f RAPID_2019.de-en.xlf > output
Optional arguments:
-f FILE, --file FILE XLIFF file (default: None)
-s {source,target,both,reverse}, --side {source,target,both,reverse}
side(s) of the bitext to be extracted (default: both)