Information-Retrieval-HW-1

Description: A dataset (614 KB on disk) of 404 transcripts is made available on canvas course page (Modules/Week3/transcripts.zip).

Write a program to gather information about word tokens in the sample database. You may use any programming language.

What to text processing steps you should do?

Remove stopwords
Remove special characters
Use Porter Stemming

Use your program to generate the following information:

The number of word tokens in the database (after all text processing steps).

The number of unique words in the database;

The number of words that occur only once in the database;

The average number of word tokens per document.

For 30 most frequent words in the database, provide:

TF, IDF, TF*IDF and probabilities

in a tabular format (rows = terms, columns = values)

what to submit

Document with answers to above 5 questions (pdf only).

Github code (use shanusushmita to share on github)

Porter Stemmer Algorithm

The Porter Stemmer Algorithm used in this project is created by Apache OpenNLP.

Here is the link for more information: https://opennlp.apache.org/

Stop Words

The stop words file used in this project is created by Alir3Z4.

Here is the link for more information: https://github.com/Alir3z4/stop-words/blob/master/english.txt

Author

Zelun Jiang 04/2018

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
bin		bin
src		src
README.md		README.md
opennlp-tools-1.8.4.jar		opennlp-tools-1.8.4.jar
stop-words.txt		stop-words.txt
transcripts.zip		transcripts.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information-Retrieval-HW-1

Porter Stemmer Algorithm

Stop Words

Author

About

Releases

Packages

Languages

jimzqw/Information-Retrieval-HW-1

Folders and files

Latest commit

History

Repository files navigation

Information-Retrieval-HW-1

Porter Stemmer Algorithm

Stop Words

Author

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages