TermExtractor

This is a term extraction tool for Icelandic written in Python. It was implemented using three different term extraction methods, namely the RAKE algorithm, PoS pattern matching (NP chunking) and tf-idf (term frequency * inverted document frequency). It can be used to produce a candidate list of terms from a given text document.

Dependencies

The program runs in Python 3 and makes use of two natural language processing libraries, NLTK and IceNLP. It also uses the rake-nltk implementation of the RAKE algorithm. It is recommended to run the project in a virtual environment with Python 3.

To install the needed packages, run:

pip install -r requirements.txt

For IceNLP, download the .zip folder on https://github.com/hrafnl/icenlp/releases. You also need to download a JDK to be able to run Java. The program needs the IceNLP project folder to be located in the same directory as the TermExtractor project folder in order to prepare the data, otherwise, adjustments are needed in the shell script prepare.sh.

Running the program

To use, run main.sh in the console with a .txt file containing some text to extract terms from, using the following command:

$ ./main.sh <filename.txt>

The script executes a series of scripts and produces four termlists in the folder output/<filename>/termlists, one for each approach used and a combined list from the other three.

Each Python script can also be run independently using Python 3.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
corpus		corpus
output		output
stopwords		stopwords
termlists_for_metrics		termlists_for_metrics
testfiles		testfiles
utility_scripts		utility_scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
PoS_matching.py		PoS_matching.py
README.md		README.md
main.sh		main.sh
metrics.py		metrics.py
prepare.sh		prepare.sh
rake.py		rake.py
requirements.txt		requirements.txt
tf_idf.py		tf_idf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TermExtractor

Dependencies

Running the program

About

Releases

Packages

Contributors 2

Languages

svanhvitlilja/TermExtractor

Folders and files

Latest commit

History

Repository files navigation

TermExtractor

Dependencies

Running the program

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages