This program extracts bigrams that are relevant terminology for a domain from a given corpus. To do this, the program weighs the domain relevance and the domain consensus of a term and adds it to the terminology if it exceeds a threshold.
ntlkt
's reuters corpus is used as a neutral reference corpus.
All files should be utf-8 encoded.
- Python 3.8.5
- NLTK (Natural Language Tool KIT) - See installation instructions here
NLTK Data:
- Reuters Corpus - See here for more information
>>> import nltk
>>> nltk.download('reuters')
- NLTK's Averaged Perceptron Tagger - See here for more information
>>> import nltk
>>> nltk.download('averaged_perceptron_tagger')
- NLTK's Word Punctuation Tokenizer - See here for more information
>>> import nltk
>>> nltk.download('punkt')
Move the domain corpus (standard: acl_texts
) to this directory. The corpus should be a directory of text files.
To extract terminology for a domain, you have to choose possible candidates first.
A predefined list of candidates can be found in the file data/candidates1.txt
.
To generate your own list run:
main.py candidates [--stops <stopword file>] [--min_count <integer>] <domain dir> <output file> [<tag> [<tag> ...]]
Explanation:
--stops <stopword file>
: A file with stopwords that are not allowed to occur in a candidate. Bigrams that contain a word from this file are filtered out. If argument is left out, no stopwords will be used.--min_count <integer>
: The minimum absolute frequency a bigram has to have to be considered a candidate. The default is 4.<domain dir>
: The directory of the domain corpus.<output file>
: The name for your output file containing the candidates.[<tag> [<tag> ...]]
: Any number of Penn Treebank Tags. A tagged bigram needs to contain at least one of these tags to be considered a candidate. If argument is left out, no tagging will be used.
To reproduce the candidates in data/candidates.txt
run:
main.py candidates --stops data/stops_en.txt --min_count 3 acl_texts/ <your file name> NNS NN NNP
Use a file with candidates and the domain corpus to extract relevant terminology. Your results will be saved to a csv
file with ;
as a delimiter. The first two lines contain the value for alpha and theta. After that, each line has three columns <term>;<value>;<True/False>
. The first contains the term, the second the value of the decision function and the third whether the term is considered terminology or not. Run:
main.py extract -a <value for alpha> -t <value for theta> <domain dir> <candidates file> <output file>
Explanation:
-a <value for alpha>
: A float between 0 and 1. Used to weigh domain consensus and domain relevance. If greater than 0.5 domain relevance has more weight, if less than 0.5 domain consenus has more weight.-t <value for theta>
: A positive float. Used as a threshold when determining terminology.<domain dir>
: Directory of domain corpus. Standard should beacl_texts
.<candidates file>
: A file with candidates, generated bymain.py candidates
.<output file>
: The name for the output file where extracted terms are stored.
Example:
main.py extract -a 0.5 -t 2 acl_texts/ data/candidates1.txt output/output1.csv
Compare extracted terminology to a gold standard by computing recall, precision and F1-score. To evaluate extracted terms run:
main.py evaluate --extracted <term file> --gold <gold file> [--high <int>] [--low <int>]
Explanation:
--extracted <term file>
: A file with extracted terms, generated bymain.py extract
.--gold <gold file>
: A file with gold standard terminology. Standard should begold_terminology.txt
. Each line should contain on term.--high <int>
: Optionally, define an integer and print out the n highest scored terms.--low <int>
: Optionally, define an integer and print out the n lowest scored terms.
Example:
main.py evaluate --extracted output/output1.csv --gold data/gold_terminology.txt --high 30
To get a demo of the functionalities run:
main.py demo
To get a demo of the different classes and their key methods, run the respective file. For example, to get a demo of the Evaluation
class, run evaluation.py
.
Run unittests for a class by running the respective test file. For example, to get the tests for the Terminology
class, run test_terminology.py
.
Katja Konermann
Ein Projekt für die Veranstaltung Computerlinguistische Techniken im Wintersemester 20/21