This repository includes all tasks' notebooks, final project, and final exam work created as part of the Ben-Gurion University of the Negev course "The Art of Analyzing Big Data - The Data Scientist’s Toolbox".
The course discusses principles, methodologies, and techniques for mining massive datasets. In this course, we learned how to perform common data mining tasks, such as classification, clustering analysis, and recommendation on large datasets using the principles of parallel and distributed processing such as Map-Reduce. During the course, I had the opportunity to use state-of-the-art technologies for massive data mining. Each of the tasks tackles different machine learning and big data issues. Course web site.
In the notebook, I tackled four different tasks.
It includes visualizing the daily Covid-19 cases in Ohio over time that gives us a high-level perspective of the infection rate and the so-called "waves". There are further in-depth analyses of hospitalization, vaccination, unemployment, and school data.
Identifing teams of 3 members, competing together for more then 10 competitions in Kaggle competiotions. For each community, multiple centrality measures where calculated: degree_centrality - Average degree, Most central node pagerank - Average pagerank, Most central node closeness_centrality - Average score, Most central node
In this section a model is trained to predict an athlete achievement based on the physical features, sport type, and the athlete's country. If a sufficiently trained model predicts that an athlete would loss, and yet the athlete wins a gold medal, this is an extra ordinary achievement. This holds to the other direction as well. If a trained model predicts that the athlete would win a gold medal, and the athlete losses all medals, then this is a disappointing loss.
Assignment 1 - DB, SQL, various datasets, sqlite3 package
Assignment 2 - Scraping with beautiful soup, working with API's and pandas, networkx.
Assignment 3 - Data visualization using turicreate, pandas and seaborn.
Assignment 4 - Working with graphs.
Assignment 5 - Link predictions and graph analysis.
Assignment 6 - NLP and Sentiment analysis and classification.
Assignment 7 - From Unstructured Text to Structured Data - NLP, entity extraction, networks and visualization.
Assignment 8 - Geopandas, plotly express and foluim.
Assignment 9 - Extracting Data from Images and Sounds - working with pySpark, classifiers map visualization and more.
Assignment 10 - pySpark, heatmap visualizations and folium.