To install the required packages for this project, use the following command after creating a virtual environment:
pip install -r requirements.txt
Note: The model was trained on Google Colab with GPU support.
After installing the necessary packages, run the following command from the project root directory to start the app:
uvicorn app.main:app
Visit http://127.0.0.1:8000/docs from your browser to access Swagger. You can upload an image through the predict endpoint and receive a JSON response. Use the --reload
argument to see immediate effects when changing code.
Ensure you are in the project root directory and Docker is running. Use the following command to create a Docker image:
docker build -t image-classifier-api .
Once the image is built successfully, run the container with the following commands:
docker run -p 8000 image-classifier-api
Visit http://127.0.0.1:8000/docs from your browser to access Swagger. You can upload an image through the predict endpoint and receive a JSON response.
The project involves sentiment analysis on the IMDb dataset, which comprises 50,000 movie reviews. The dataset is structured for binary sentiment classification, with 25,000 reviews designated for training and an additional 25,000 for testing. The objective is to predict the sentiment of movie reviews, distinguishing between positive and negative sentiments.
-
Exploratory Data Analysis (EDA): We analyzed the dataset to understand its characteristics. The analysis revealed an equal distribution of positive and negative sentiments, indicating a balanced dataset. Most reviews were observed to be relatively short, with shorter reviews being more frequent. Additionally, a WordCloud analysis highlighted common terms associated with positive and negative sentiments.
-
Text Preprocessing: We performed text preprocessing to clean and prepare the text data for modeling. This included removing unwanted characters, URLs, converting text to lowercase, tokenization, removing stopwords, and lemmatization. We also generated uni, bi, tri, and four-gram features from the text data.
-
Feature Engineering: We used TF-IDF vectorization to convert the text data into numerical features while preserving the importance of words in the documents. The target variable was encoded using label encoding.
-
Model Training and Evaluation: We trained logistic regression and multinomial naive Bayes models on the preprocessed data. Both models were evaluated using classification reports, which provided insights into precision, recall, and F1-score for each class (positive and negative sentiments). Based on the results, logistic regression performed better and was chosen as the final model.
The sentiment analysis project successfully developed a model capable of accurately predicting the sentiment of movie reviews. The logistic regression model achieved an accuracy of approximately 89%, with balanced performance across both positive and negative sentiments. The project highlights the effectiveness of natural language processing techniques and machine learning algorithms in analyzing textual data and extracting valuable insights. Overall, the project demonstrates the importance of sentiment analysis in understanding user opinions and preferences in the industry.