Context
Consider the situation where you are working for Zillow as a data scientist
Housing pricing predictions is the goal.
We know 80 things about each house to use as inputs to be able to predict the price of a house.
Your goal is to isolate the important features from the dataset and build a model which can be used to predict the price of the houses.
Since there are too many features, PCA can be applied to reduce the number of features used for the actual prediction model, without any loss of information.
Data Cleaup and Exploratory Data Analysis
- Explore Basic Statistics of each feature
- Outlier Detection
- Missing value imputations
- Correlation Analysis
- Drop unnecessary Columns
- Apply Scaling to dataset to bring all variables to the same scale
- Feature Selection for isolating final set of variables for PCA
- Threshold for Variance
- Balance the number of features selected
- Fit model to cleaned-up dataset
- Comparative Study of with and without PCA