This repository contains machine learning projects implemented in Python. The projects demonstrate application of ML algorithms for tasks like classification, regression, clustering, etc.
Throughout my research journey, I explored the complex world of online communities and delved into the challenges of tackling toxic behavior. Here's a sneak peek into my dissertation, where I addressed the problem of toxicity in Reddit gaming communities and developed innovative solutions using Python, #nlp #machinelearning Gephi and Polinode.
The rise of online gaming communities has brought both joy and challenges. However, one significant challenge is the prevalence of toxic behavior that can negatively impact the gaming experience. My dissertation aimed to identify and understand this toxic behavior, ultimately leading to the development of effective strategies for fostering healthier and more inclusive gaming communities.
To tackle this problem, I embarked on an extensive data collection process. I gathered vast amounts of Reddit posts and comments from Activision Subreddit, creating a comprehensive dataset that allowed me to capture the diverse interactions and behaviors.
Using the power of Python, I employed sophisticated techniques to analyze the collected data and identify negative content and toxic users. I leveraged natural language processing (NLP) algorithms and machine learning models to distinguish toxic behavior, such as hate speech, harassment, and disrespectful language, from regular interactions.
In order to gain deeper insights into the emotional undercurrents of the gaming communities, I conducted sentiment analysis on the collected data. By employing sentiment analysis algorithms, I was able to discern the prevailing sentiments within the community, ranging from positive and neutral to negative emotions, shedding light on the overall tone and atmosphere of these communities.
One of the highlights of my research was the development of a predictive model using Python. By integrating the insights obtained from the previous stages, I created a robust model capable of predicting the likelihood of toxic behavior within Reddit gaming communities.
We analyzed a vast amount of data and constructed a comprehensive sentiment score network map. It reveals the interconnectedness of sentiments expressed across various topics and provides a holistic view of public opinion. The map showcases the intricate relationships between positive, negative, and neutral sentiments, giving us valuable insights into the underlying sentiment landscape.
To further enhance our analysis, we segmented comments based on sentiment scores. This allowed us to categorize comments as positive, negative, or neutral. By understanding the distribution and intensity of sentiments, we can gain a deeper understanding of the sentiments expressed by users, enabling us to address specific areas for improvement or commendation.
We examined the posting activity of our top 50 positive users and top 50 negative users across different hours of the day. This analysis offers valuable insights into the temporal patterns of posting behavior. By understanding when users are most active and when positive or negative sentiments are prevalent, we can tailor our engagement strategies to effectively target and engage with our audience.
We tracked the sentiment score trends of our top 50 positive and negative users over a specified period. By analyzing the temporal dynamics, we can identify shifts in sentiment patterns, track the impact of certain events, and assess the overall sentiment trajectory of influential users. These insights empower us to adapt our strategies and cultivate a positive online environment while addressing potential issues head-on.
By closely examining the sentiment data, we constructed a network map specifically focusing on negative users. This map unveils the connections and interactions between individuals expressing negative sentiments, shedding light on influential users and potential clusters. Understanding the network dynamics of negative sentiment can help us identify areas for intervention and address concerns promptly and effectively.
In parallel to the negative sentiment network map, we also built a network map dedicated to positive users. This map uncovers the relationships and conversations among individuals who consistently express positive sentiments. By studying the positive sentiment network, we gain insights into the key influencers, supportive communities, and content that resonates with positivity. This knowledge empowers us to nurture a culture of optimism and enhance user experiences.
To bring these network maps to life, we utilized the powerful visualization tool, Gephi. These visualizations enable us to perceive the intricate connections between users, identify clusters, and detect central nodes. Through interactive exploration, we can better comprehend the sentiment landscape and devise strategies that foster positive engagement.
In addition to Gephi, we also employed Polinode to visualize our sentiment networks. Polinode provides a user-friendly interface and intuitive visualizations that enhance our understanding of the sentiment patterns. With its advanced features, we can easily identify influential users, analyze sentiment flows, and delve deeper into the dynamics of positive and negative sentiment networks.
This project involved analyzing Formula 1 race data from the 2020 season using Pandas, Matplotlib, Seaborn, and Scikit-Learn in Python.
- Explore and visualize race statistics like lap times, points, positions etc.
- Identify insights like top performers, teamwise comparisons, lap time distributions
- Build a regression model to predict race points based on grid position
- Apply clustering to segment drivers based on performance metrics
- Classify race finish positions using machine learning models
The analysis involved data preprocessing techniques like handling missing values, converting formats, and feature engineering. Visualizations included histograms, boxplots, scatterplots, and bar charts.
- The average points scored by drivers showed high variance due to retirements
- Mercedes had the highest total points scored among all teams
- Grid position had a strong negative correlation with final race position
- Logistic regression achieved the best accuracy in classifying race finish categories
This project involved building a machine learning model to classify breast cancer tumors as benign or malignant based on cell measurements. The dataset was obtained from the UCI Machine Learning Repository.
- Importing and exploring the breast cancer dataset
- Visualizing the feature distributions using histograms, pie charts, and boxplots
- Identifying correlations between features using a heatmap
- Preprocessing the data by handling missing values and encoding categorical labels
- Splitting the data into training and test sets
- Training and evaluating SVM and Random Forest classification models
- Comparing the precision of the models to identify the better performer
- Making predictions on a new sample data point
- The Random Forest model achieved the highest precision with a score of 95%. This indicates the - model correctly classified 95% of malignant tumors in the testing data.
The Jupyter Notebook and Python files for each project are in their respective folders. To run them, install the required libraries and execute the scripts.
Machine learning is the study of computer algorithms that can improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.
This repository showcases my hands-on application of ML techniques to solve real-world problems using Python. The projects cover various algorithms and techniques like supervised learning, unsupervised learning, dimensionality reduction, neural networks etc.