First project for Introduction to ML with TensorFlow Nanodegree Program at Udacity
Optimized several different supervised learners to predict highest donation yield (3.7x fscore (0.75 vs 0.2 naive predictor) +15% accuracy (0.869 vs 0.752 naive predictor).
- Business Understanding
- Data Understanding: Explored data collected from the 1994 US Census with 45222 observations and 13 variables + target. This dataset is a modified version of the dataset published in the paper "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", by Ron Kohavi. You may find this paper online, with the original dataset hosted on UCI.
- Data Preparation: Normalized numerical features, transformed skewed continuous features plus one-hot encoded categorical variables.
- Data Modeling: Compared and Optimized different ensemble methods using GridCV.
- Results Evaluation: Discussed effects of feature selection.
- Python Version: 3.8.5
- Packages: pandas, numpy, sklearn, matplotlib, seaborn
- visuals.py: A few auxiliary plot functions.
- finding_donors.ipynb: Main and only notebook.