This project implements a machine learning pipeline to classify IMDB movie reviews into Positive or Negative sentiments. The pipeline includes data preprocessing, model training, evaluation, and a web-based interface for predictions. The project is fully containerized using Docker and uses Neptune.ai for experiment tracking and visualization.
- Data Preprocessing: Cleaning and vectorizing movie reviews using TF-IDF.
- Model Training: Logistic Regression with hyperparameter tuning using Random Search.
- Evaluation Metrics: Accuracy, F1-score, Confusion Matrix, and ROC-AUC curve.
- Neptune.ai Integration: Logs experiments, metrics, and visualizations.
- Web Interface: Simple frontend (HTML, CSS, JS) for users to input reviews and get predictions.
- Containerization: Backend and frontend are containerized with Docker and orchestrated using Docker Compose.
- Python 3.9: Main programming language.
- FastAPI: Backend framework for serving predictions.
- Scikit-learn: ML library for Logistic Regression and TF-IDF.
- Neptune.ai: For experiment tracking.
- Docker & Docker Compose: Containerization of the application.
- HTML, CSS, JavaScript: Frontend interface.
- Matplotlib & Seaborn: Visualization tools.
- Pandas & NumPy: Data handling and processing.
IMDB-Review-Classifier/
โ
โโโ data/
โ โโโ raw/ # Raw dataset (IMDB Dataset.csv)
โ โโโ processed/ # Processed TF-IDF data and labels
โ โโโ X_train_tfidf.npz
โ โโโ X_test_tfidf.npz
โ โโโ y_train.csv
โ โโโ y_test.csv
โ โโโ tfidf_vectorizer.pkl
โ
โโโ model/
โ โโโ best_sentiment_model.pkl # Trained Logistic Regression model
โ
โโโ backend/
โ โโโ requirements.txt # Python dependencies
โ โโโ api.py # FastAPI backend for predictions
โ โโโ train_model.py # Model training with Random Search and Neptune logging
โ โโโ data_processing.py # Data cleaning and TF-IDF processing
โ
โโโ frontend/
โ โโโ index.html # Web interface
โ โโโ style.css # Styling for the web interface
โ โโโ static/ # Static assets like images (snowflakes, icons)
โ
โโโ docker/
โ โโโ Dockerfile.backend # Dockerfile for the backend
โ โโโ Dockerfile.frontend # Dockerfile for the frontend
โ โโโ docker-compose.yml # Docker Compose configuration
โ
โโโ README.md # Project documentation
- Dataset: IMDB Dataset of 50K Movie Reviews
- Size: 50,000 rows (Positive and Negative reviews).
- Format: CSV
git clone https://github.com/himarygr/IMDB-Review-Classifier.git
cd IMDB-Review-Classifier
cd backend
pip install -r requirements.txt
No installation is required for the static frontend.
Run the following script to clean and vectorize data:
python backend/data_processing.py
Run model training with Random Search and log metrics to Neptune.ai:
python backend/train_model.py
To build and run the project using Docker:
cd docker
docker-compose up --build
- Backend will run on:
http://localhost:8000
- Frontend will run on:
http://localhost:8501
Method | Endpoint | Description |
---|---|---|
POST | /predict/ |
Predict sentiment of a review |
Example Request:
{
"review": "The movie was absolutely fantastic! Great acting and direction."
}
Example Response:
{
"sentiment": "positive"
}
The frontend provides a simple interface where users can:
- Enter a movie review.
- Click the "Analyze Sentiment" button.
- See whether the review is classified as Positive ๐ or Negative ๐.
All experiments, metrics, and visualizations are logged to Neptune.ai.
- Hyperparameters:
C
,solver
,max_iter
. - Metrics: Accuracy, F1-score.
- Confusion Matrix: Uploaded as an image.
- ROC-AUC Curve: Uploaded as an image.
- CPU & Memory Usage: System resource monitoring.
- Confusion Matrix
- ROC-AUC Curve
- Accuracy and F1-Score
- Hyperparameter values
- CPU/Memory usage during training
- Add more classifiers (e.g., SVM, Random Forest) for comparison.
- Integrate Grid Search for exhaustive hyperparameter tuning.
- Deploy the project to a cloud service (AWS, GCP, etc.).
- Enhance the frontend with a modern framework (React or Vue.js).
Feel free to fork the repository, create a branch, and submit pull requests for new features or bug fixes!
This project is licensed under the MIT License.
For any questions or suggestions:
- Email: lilley@ya.ru
- GitHub: himarygr