BigData-II

"Big Data II" subject of the master's degree in data science at the University of Granada.

In this practice I had to use Spark to solve some problems of Big Data. The dataset we had to work with is the Higg's dataset, but using 1 million of the data to don't overload the servers from the university.

This dataset is unbalanced, containing 90% of the data for the mayority class, and the remaining 10% of the minority class.

To solve this problem we have to use some preprocessing algorithms to oversample the minority class. I used both Random Oversampling (ROS) to create more data for the minority class, and Random Undersampling (RUS) to decrease the amount of data from the mayority class. Also I used HMe_BD to remove noise from the data by using Random Forest, and I used FCNN_MR for feature selection.

The base methods I used were Decision Tree and Random Forest. With the idea of using more tree ensembles, I used PCARD. I could use kNN as the final method, but I chose Naïve Bayes, because we saw knn in the practical classes.

The results are in the ResultFiles folder, while the code used is in the ScriptUsed folder.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
ResultFiles		ResultFiles
ScriptsUsed		ScriptsUsed
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BigData-II

About

Releases

Packages

Languages

AlArgente/BigData-II

Folders and files

Latest commit

History

Repository files navigation

BigData-II

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages