Skip to content

"Big Data II" subject of the master's degree in data science at the University of Granada.

Notifications You must be signed in to change notification settings

AlArgente/BigData-II

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

BigData-II

"Big Data II" subject of the master's degree in data science at the University of Granada.

In this practice I had to use Spark to solve some problems of Big Data. The dataset we had to work with is the Higg's dataset, but using 1 million of the data to don't overload the servers from the university.

This dataset is unbalanced, containing 90% of the data for the mayority class, and the remaining 10% of the minority class.

To solve this problem we have to use some preprocessing algorithms to oversample the minority class. I used both Random Oversampling (ROS) to create more data for the minority class, and Random Undersampling (RUS) to decrease the amount of data from the mayority class. Also I used HMe_BD to remove noise from the data by using Random Forest, and I used FCNN_MR for feature selection.

The base methods I used were Decision Tree and Random Forest. With the idea of using more tree ensembles, I used PCARD. I could use kNN as the final method, but I chose Naïve Bayes, because we saw knn in the practical classes.

The results are in the ResultFiles folder, while the code used is in the ScriptUsed folder.

About

"Big Data II" subject of the master's degree in data science at the University of Granada.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages