Skip to content

This repository contains a Phase 2 Project for the Data Science Flex Program at the Flatiron School. This project uses linear regression, pandas, numpy and exploratory data analysis using matplotlib and seaborn to predict and analyze home prices in the King County data set..

Notifications You must be signed in to change notification settings

dataeducator/real_estate_linear_regression

Repository files navigation

Real Estate Linear Regression

Image by Avi Waxman on Unsplash

Business Understanding

LandingPad Realtors is a real estate business that helps families with school-aged children relocate to King County and find the perfect home to meet their families needs. LandingPad provides potential homeowners with home purchase options within their ideal budget.

  • Stakeholder: LandingPad Realtors
  • Busines Case: I have been hired by LandingPad to accurately predict the housing prices within the King County Housing Market. Executives at LandingPad want to launch a multimedia campaign to reach their target audience of young families moving to the Kings County Area and want a reliable model that can be refined over time as more information becomes available.

Primarily, I will start by identifying the characteristics of homes that increase housing costs. The effect of each relevant feature will then be identified and communicated to the team at LandingPad. This project will be grounded in performing a statistical analysis of the price of houses in the King County House dataset and creating a multiple linear regression model that accurately predicts the sale price of a house in King County.

Data Undersanding

In this project I will use the CRISP DM method. The dataset selected in this project are from the :

  • King County House Sales Dataset found in kc_house_data.csv

The dataset can be found in the data folder of this repository along with a file called column_names.md which provides description of the features within the dataset. More information about the features on the site of the King County Assessor.

The King County House Sales Dataset includes sales data for 21,597 homes with 20 features including but not limited to:

Name Description Final Datatype Numeric or Categorical Target or Feature
id Unique identifier for a house int Numeric Feature
date Date house was sold datetime Numeric Feature
price Sale price (prediction target) int Numeric Target
bedrooms Number of bedrooms int Numeric Feature
bathrooms Number of bathrooms float Numeric Feature
sqft_living Square footage of living space in the home int Numeric Feature
sqft_lot Square footage of the lot int Numeric Feature
floors Number of floors(levels) in house float Numeric Feature
waterfront Whether the house is on a waterfront float Categorical Feature
view Quality of view from house float Categorical Feature
condition How good the overall condition of the house is. Related to the maintenance of house int Numeric Feature
grade Overall grade of the house. Related to the construction and design of the house int Numeric Feature
yr_built Year when house was built int Numeric Feature
yr_renovated Year when house was renovated int Numeric Feature
lat Latitude coordinate float Numeric Feature
long Longitude coordinate float Numeric Feature

Data Preparation

Import libraries and Visualization Packages

Importing libraries at the beginning allows access to modules and other tools throughout this project that help to make the tasks within this project manageable to implement. The main libraries that will be used within this project include:

  • pandas: a data analysis and manipulation library which allows for flexible reading, writing, and reshaping of data
  • numpy: a key library that brings the computationaly power of languages like C to Python
  • matplotlib: a comprehensive visualization library
  • seaborn: a data visualization library based on matplotlib

Select Data

Read in data from kc_house_data.csv using .read_csv() from the pandas library.

Clean the Data

In order to clean the data, I typically address missing data, place holders and datatypes. This is the most important step of this project because if data is not appropriate for the model, the results will be inherently inaccuarate and my model will result in lackluster predictions.

To dig deeper into the data, I will:

  • Review the datatypes found within the entire dataframe
  • Address duplicates, missing and placeholder data
  • Address incorrect or incongruous datatypes for the model
  • Explore correlation between features

Build a Simple Linear Regression model

First, I set the dependent variable (y) to be the price. Then I chose the most highly correlated features from the dataframe to be the baseline independent variable (X). Finally, I followed this methodology:

  • Build a linear regression using statsModels
  • Describe the overall model performance
  • Interpret its coefficients. Simple_Linear_model_trendline

Evaluation

This simple linear regression model is statistically significant overall, and explains 36.5% of the variance in house price. Both the intercept and the coefficient for sqft_living are statistically significant.

The intercept is a small negative number, meaning a home with 0 square feet of living would cost around $0.

The coefficient for sqft_living is about 157, which means that for each additional square foot of living space, I expect the price to increase about $157.

The results Summary from the statsmodel ordinary least squares shows:

OLS_regression_simple

Insights

Q1. Which neighborhoods have the highest average home price?


For the first question I looked for correlations between attributes and used price as my target variable. I explored data related to this question using visualizations created with seaborn, plotly express and matplotlib. Housing Prices in King County_edit_edit

Q2. How does the number of bedrooms affect the sale price of a home?


For the second question, I removed features with high p-values and correlations, truncated the data so that it was more suitable for a linear regression model:

  • linear : one or more predictor features have a linear relationship with the target
  • normal : one or more features (random variables from the data) all have a bell shaped curve
  • homoscedasticity : little to no multicollinearity (highly correlated variables) . I explored data related to this question using visualizations created with seaborn, plotly express and matplotlib.

Q2_visualization

Q3. How does proximity to a highly rated school affect the sale price of a home?


For the third question, I used data from the greatschools website and created a function that calculated the closest distance to a school that was rated Above Average or higher (betwen 7 and 10, inclusive): Q3_visualization

Recommendations

  • Curate a set of listings using the interactive map that are a between 2 and 5 bedroom homes.

  • Use the interactive map to narrow down homes that have a minimum sale price of 470K dollars.

  • Show families homes that and are within 10 miles of an Above Average school. This will allow Landing Pad to reach a broader set of home owners who are within our target market.

Future Work

Moving forward I would like to explore the effect of distance to local attractions (ex. parks, third places, places of worship) on sale price in the King County dataset.

Repository Structure

.
└── real_estate_linear_regression/
    ├── README.md
    ├── final_project_phase_2.ipynb
    ├── notebook.pdf
    ├── presentation.pdf
    ├── Images/
    └── .gitignore

About

This repository contains a Phase 2 Project for the Data Science Flex Program at the Flatiron School. This project uses linear regression, pandas, numpy and exploratory data analysis using matplotlib and seaborn to predict and analyze home prices in the King County data set..

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages