This project involves sentiment analysis of YouTube video comments by extracting data from the most liked, disliked, and random videos. Using machine learning models such as Random Forest, Ridge, and Gradient Boosting, it aims to predict video performance metrics and views based on sentiment analysis scores. The project also includes hyperparameter tuning and model stacking for improved predictive accuracy.
The dataset is collected from YouTube using the Google API. Comments are extracted for three categories:
- Most Liked Videos: Top 8,000 videos with the highest number of likes.
- Most Disliked Videos: Top 8,000 videos with the highest number of dislikes.
- Random Videos: 8,000 randomly selected videos from the dataset.
We use the VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis tool to analyze the extracted comments. The sentiment scores are then used to help predict video performance in the machine learning phase.
Several regression models are applied to predict video metrics (likes/dislikes):
- Random Forest Regression
- Ridge Regression
- K-Nearest Neighbors Regression
- Gradient Boosting Regression
These models are trained and tested on the dataset, with metrics such as R² scores used to evaluate their performance.
To improve model performance, hyperparameter tuning is done for both Random Forest and Gradient Boosting models. Stacking is also implemented to combine the strengths of multiple models for better results.
Stacking, an advanced ensemble technique, excels in generalization as demonstrated by its robust performance on the testing set. By leveraging the complementary strengths of multiple base regression models, stacking aggregates their predictions to achieve superior accuracy compared to any single model. This not only mitigates the limitations of individual models but also optimizes predictive performance, making it highly suitable for applications prioritizing precision. In future, there are promising avenues for further improvement and exploration. One key focus could be on reducing the Mean Squared Error (MSE) by refining feature selection techniques or introducing novel features that capture additional relevant information. Furthermore, enhancing the interpretability of the stacked model would provide valuable insights into how different base models contribute to predictions, thereby fostering trust and facilitating informed decision-making in real-world scenarios. Overall, continuous refinement and innovation in stacking methodology hold significant potential to deliver more precise and actionable predictions in diverse applications.
To run this notebook, you will need the following libraries installed:
- pandas
- vaderSentiment
- matplotlib
- google-api-python-client
- scikit-learn
You can install them using pip:
pip install pandas vaderSentiment matplotlib google-api-python-client scikit-learn