We compare, and contrast two part-of-speech taggers’ (HMM and Brill) performance on in-domain and out-of-domain text samples.
Input data: POS tagged sentences from The Georgetown University Multilayer Corpus (GUM)
The training and test files have a .txt format. Each line has a word and POS tag and each sentence is separated by an empty line.Below is an example of the structure:
Always RB
wear VB
ballet NN
slippers NNS
. .
Stretch VB
your PRP$
...
The training data is under data/train.txt
The in-domain test data is under data/test.txt
The out-of-domain test data is under data/test_ood.txt
The POS tags follow the Penn Treebank (PTB) tagging scheme, described here
- We trained the HMM and Brill tagger on the training set and tuned each to find the best performance.
- We measured the performance of the taggers on in-domain and out-of-domain test sets.
The program’s output file is a .txt file in the same format as the input training file.
Further details and results can be found here
Leen Alzebdeh @Leen-Alzebdeh
Sukhnoor Khehra @Sukhnoor-K
- https://gist.github.com/blumonkey/007955ec2f67119e0909
- https://stats.stackexchange.com/questions/366552/nlp-various-probabilities-estimators-in-nltk
- https://www.nltk.org/_modules/nltk/tag/hmm.html
- https://gist.github.com/h-alg/4ec991f90a682c6d0a0b
- https://www.nltk.org/_modules/nltk/tag/brill.html
- https://www.nltk.org/api/nltk.tag.brill_trainer.html
- Github Copilot
main.py L:4, 13
usedargparse
for extracting command line args.main.py L:8, 104
usedos
for creating directory of output.
-
Ensure Python is installed, as well as the Python Standard Library. To download Python if it is not already installed, follow the instructions on the following website: https://www.python.org/downloads/.
-
Ensure you have training and test input data in the format outlined above and in a directory 'data/' Example usage: use the following commands in the current directory.
For using the HMM tagger on in-domain data:
python3 src/main.py --tagger hmm --train data/train.txt --test data/test.txt --output output/test_hmm.txt
For using the HMM tagger in out-of-domain data:
python3 src/main.py --tagger hmm --train data/train.txt --test data/test_ood.txt --output output/test_ood_hmm.txt
For using the Brill tagger on in-domain data:
python3 src/main.py --tagger brill --train data/train.txt --test data/test.txt --output output/test_brill.txt
For using the Brill tagger on out-of-domain data:
python3 src/main.py --tagger brill --train data/train.txt --test data/test_ood.txt --output output/test_ood_brill.txt