In this project I analysed the data sample on house price sales in King County, Washington.
In King_County_House_prices.csv you can find the Raw Data, which serves as the basis for the analysis.
column_names.md contains explanations and details for the columns.
The notebook 200303_Karsten_Yan_Project1.ipnyb contains the main analysis of the data. I mainly focused on newly renovated or newly built (<= 5 Years) Houses and modeled my predictions and visualisations for that subset.
You can find the non technical presentation under king_county_presentation.pdf or directly on Google Presentations
The notebook is split into 9 subcategories.
- Business understanding:
- Formulation of target
- Data mining:
- Import of necessary modules and accessing raw data into pandas data frame
- Data Cleaning:
- removal of unnecessar columns (view and id) and conversion of sqft_basement to numerical values
- Feature engeneering:
- years since last renovation or construction
- bathroom/bedroom ratio
- zip code price ranks
- quality
- dummy variables
- cleanup
- definition of parameters
- Data exploration:
- correlation heatmap
- pairplots
- Statistic modeling:
- brute force approach, iterating through each variable, choosing highest r sqaure adj, begin loop from beginning including formerly chosen variable
- modeling for all homes according to search parameters
- modeling for newly constructed homes (less than 5 years)
- Visualisation:
- visualisations for newly constructed homes
- comparing quality and quantity and some basic features
- Summary
- Future Work