Data Analysis and Machine Learning on Enron Dataset

This notebook shows how data analysis on enron dataset can be done. The goal of the analysis is to find the best machine learning algorithm with the best precision and recal metric values. Each algorithms' job on the way is to correctly classify poi(person of interest) from the dataset. POIs are who I am interested in since I think they are strongly related to Enron Scandal. POIs are chosen mannually and provided by Udacity's "Intro to Machine Learning" course. You can think of this notebook as a part of the assignment for the final project from the course.

The way it is organized

Choose features of my interest
Perform basic data analysis
Find outliers, and remove them when needed
Perform various machine learning algorithms
Compare each results
Confirm the best result

Machine Learning Part

Perform basic DecisionTree classifier on raw data
Perform basic DecisionTree classifier on data that outliers are removed
Define a function to measure accuracy, precision, and recall metrics
Define a function to run Pipeline with SelectKBest, and GridSearchCV
Run different kinds of ML algorithms with a number of different parameters

Decision Tree Classifier
Adaboost Classifier
Random Forest Classifier
Support Vector Machine Classifier
Gaussian Naive Bayse Classifier

Result

F1 Score Result

Accuracy Score Result

Conclusion

The best model I could find is 'Adaboost'. The parameters with the below, it did the best job.

feature list: 'poi', 'bonus'
algotirthm: 'SAMME.R'
learning rate: 0.05
number of estimators: 30

And the scores are

accuracy: 0.827
f1: 0.7159

This model achieved the best f1 score comparing to other models, DecisionTree, Gaussian Naive Bayes, Random Forest, and Support Vector Machine. While having the best f1 score, the number of feature used is only 2. I think it could mean this model is not overfitted much. Furthermore, it achieved the best accuracy as well in the group of other models under the same number of features.

Reference

conceptual study

https://www.udacity.com

programming reference

error reference

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
tools		tools
README.md		README.md
_config.yml		_config.yml
acc.png		acc.png
enron61702insiderpay.pdf		enron61702insiderpay.pdf
enron_data_analysis.html		enron_data_analysis.html
enron_data_analysis.ipynb		enron_data_analysis.ipynb
f1.png		f1.png
final_project_dataset.pkl		final_project_dataset.pkl
poi_id.py		poi_id.py
poi_names.txt		poi_names.txt
references.txt		references.txt
tester.py		tester.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Analysis and Machine Learning on Enron Dataset

The way it is organized

Machine Learning Part

Result

Conclusion

Reference

About

Releases

Packages

Languages

deep-diver/Enron-Data-Analysis

Folders and files

Latest commit

History

Repository files navigation

Data Analysis and Machine Learning on Enron Dataset

The way it is organized

Machine Learning Part

Result

Conclusion

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages