NLP-Taggers

We compare, and contrast two part-of-speech taggers’ (HMM and Brill) performance on in-domain and out-of-domain text samples.

Data

Input data: POS tagged sentences from The Georgetown University Multilayer Corpus (GUM)

The training and test files have a .txt format. Each line has a word and POS tag and each sentence is separated by an empty line.Below is an example of the structure:

Always	 RB
wear VB
ballet NN
slippers NNS
. .

Stretch VB
your PRP$
...

The training data is under data/train.txt
The in-domain test data is under data/test.txt
The out-of-domain test data is under data/test_ood.txt
The POS tags follow the Penn Treebank (PTB) tagging scheme, described here

Tasks

Task 1: Train and Tune the Taggers

We trained the HMM and Brill tagger on the training set and tuned each to find the best performance.

Task 2: Compare results

We measured the performance of the taggers on in-domain and out-of-domain test sets.

Output

The program’s output file is a .txt file in the same format as the input training file.

Report and Results

Further details and results can be found here

Contributors

Leen Alzebdeh @Leen-Alzebdeh

Sukhnoor Khehra @Sukhnoor-K

Resources Consulted

Libraries

main.py L:4, 13 used argparse for extracting command line args.
main.py L:8, 104 used os for creating directory of output.

Instructions to execute code

Ensure Python is installed, as well as the Python Standard Library. To download Python if it is not already installed, follow the instructions on the following website: https://www.python.org/downloads/.
Ensure you have training and test input data in the format outlined above and in a directory 'data/' Example usage: use the following commands in the current directory.

For using the HMM tagger on in-domain data: python3 src/main.py --tagger hmm --train data/train.txt --test data/test.txt --output output/test_hmm.txt

For using the HMM tagger in out-of-domain data: python3 src/main.py --tagger hmm --train data/train.txt --test data/test_ood.txt --output output/test_ood_hmm.txt

For using the Brill tagger on in-domain data: python3 src/main.py --tagger brill --train data/train.txt --test data/test.txt --output output/test_brill.txt

For using the Brill tagger on out-of-domain data: python3 src/main.py --tagger brill --train data/train.txt --test data/test_ood.txt --output output/test_ood_brill.txt

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
REPORT.md		REPORT.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP-Taggers

Data

Tasks

Task 1: Train and Tune the Taggers

Task 2: Compare results

Output

Report and Results

Contributors

Resources Consulted

Libraries

Instructions to execute code

About

Releases

Packages

Languages

License

Leen-Alzebdeh/NLP-Taggers

Folders and files

Latest commit

History

Repository files navigation

NLP-Taggers

Data

Tasks

Task 1: Train and Tune the Taggers

Task 2: Compare results

Output

Report and Results

Contributors

Resources Consulted

Libraries

Instructions to execute code

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages