Capstone ML ZoomCamp: Sentiment analysis of IMDb movie reviews

Using sentiment analysis to classify documents based on their polarity. In particular, this project works with a dataset of 50,000 movie reviews from the Internet Movie Database (IMDb) and build a predictor that can distinguish between positive and negative review.

IMDb Dataset

This project uses a dataset with 50,000 reviews provided by Maas and others.

Large Movie Review Dataset v1.0

Overview

This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.

Dataset

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

Will be transformed using script __txt2csv.py__ from pos=positive=1 & neg=negative=0

Exploratory Data Analysis

See the IMDB_EDA.ipynb notebook for this task.

Wordcloud

Number of Caracters in Corpus

Number of Words in Corpus

Distribution of number of words per reviews

Distribution of AVG word length in each review

Most common words in Corpus

Ngram {1,2,3}

Modeling

See the IMDB_Modeling.ipynb notebook for this task.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dataset		dataset
screenshots		screenshots
IMDB_EDA.ipynb		IMDB_EDA.ipynb
IMDB_Modeling.ipynb		IMDB_Modeling.ipynb
README.md		README.md
imdb_dataset.md		imdb_dataset.md
txt2csv.py		txt2csv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Capstone ML ZoomCamp: Sentiment analysis of IMDb movie reviews

IMDb Dataset

Overview

Dataset

Exploratory Data Analysis

Wordcloud

Number of Caracters in Corpus

Number of Words in Corpus

Distribution of number of words per reviews

Distribution of AVG word length in each review

Most common words in Corpus

Ngram {1,2,3}

Modeling

About

Releases

Packages

Languages

ayoub-berdeddouch/capstone-mlzoomcamp

Folders and files

Latest commit

History

Repository files navigation

Capstone ML ZoomCamp: Sentiment analysis of IMDb movie reviews

IMDb Dataset

Overview

Dataset

Exploratory Data Analysis

Wordcloud

Number of Caracters in Corpus

Number of Words in Corpus

Distribution of number of words per reviews

Distribution of AVG word length in each review

Most common words in Corpus

Ngram {1,2,3}

Modeling

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages