Skip to content

jimzqw/Information-Retrieval-HW-1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Information-Retrieval-HW-1

Description: A dataset (614 KB on disk) of 404 transcripts is made available on canvas course page (Modules/Week3/transcripts.zip).

Write a program to gather information about word tokens in the sample database. You may use any programming language.

What to text processing steps you should do?

  1. Remove stopwords

  2. Remove special characters

  3. Use Porter Stemming

Use your program to generate the following information:

The number of word tokens in the database (after all text processing steps).

The number of unique words in the database;

The number of words that occur only once in the database;

The average number of word tokens per document.

For 30 most frequent words in the database, provide:

TF, IDF, TF*IDF and probabilities

in a tabular format (rows = terms, columns = values)

what to submit

Document with answers to above 5 questions (pdf only).

Github code (use shanusushmita to share on github)

Porter Stemmer Algorithm

The Porter Stemmer Algorithm used in this project is created by Apache OpenNLP.

Here is the link for more information: https://opennlp.apache.org/

Stop Words

The stop words file used in this project is created by Alir3Z4.

Here is the link for more information: https://github.com/Alir3z4/stop-words/blob/master/english.txt

Author

Zelun Jiang 04/2018

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages