Real Estate Linear Regression

Business Understanding

LandingPad Realtors is a real estate business that helps families with school-aged children relocate to King County and find the perfect home to meet their families needs. LandingPad provides potential homeowners with home purchase options within their ideal budget.

Stakeholder: LandingPad Realtors
Busines Case: I have been hired by LandingPad to accurately predict the housing prices within the King County Housing Market. Executives at LandingPad want to launch a multimedia campaign to reach their target audience of young families moving to the Kings County Area and want a reliable model that can be refined over time as more information becomes available.

Primarily, I will start by identifying the characteristics of homes that increase housing costs. The effect of each relevant feature will then be identified and communicated to the team at LandingPad. This project will be grounded in performing a statistical analysis of the price of houses in the King County House dataset and creating a multiple linear regression model that accurately predicts the sale price of a house in King County.

Data Undersanding

In this project I will use the CRISP DM method. The dataset selected in this project are from the :

King County House Sales Dataset found in kc_house_data.csv

The dataset can be found in the data folder of this repository along with a file called column_names.md which provides description of the features within the dataset. More information about the features on the site of the King County Assessor.

The King County House Sales Dataset includes sales data for 21,597 homes with 20 features including but not limited to:

Name	Description	Final Datatype	Numeric or Categorical	Target or Feature
`id`	Unique identifier for a house	`int`	Numeric	Feature
`date`	Date house was sold	`datetime`	Numeric	Feature
`price`	Sale price (prediction target)	`int`	Numeric	Target
`bedrooms`	Number of bedrooms	`int`	Numeric	Feature
`bathrooms`	Number of bathrooms	`float`	Numeric	Feature
`sqft_living`	Square footage of living space in the home	`int`	Numeric	Feature
`sqft_lot`	Square footage of the lot	`int`	Numeric	Feature
`floors`	Number of floors(levels) in house	`float`	Numeric	Feature
`waterfront`	Whether the house is on a waterfront	`float`	Categorical	Feature
`view`	Quality of view from house	`float`	Categorical	Feature
`condition`	How good the overall condition of the house is. Related to the maintenance of house	`int`	Numeric	Feature
`grade`	Overall grade of the house. Related to the construction and design of the house	`int`	Numeric	Feature
`yr_built`	Year when house was built	`int`	Numeric	Feature
`yr_renovated`	Year when house was renovated	`int`	Numeric	Feature
`lat`	Latitude coordinate	`float`	Numeric	Feature
`long`	Longitude coordinate	`float`	Numeric	Feature

Data Preparation

Import libraries and Visualization Packages

Importing libraries at the beginning allows access to modules and other tools throughout this project that help to make the tasks within this project manageable to implement. The main libraries that will be used within this project include:

pandas: a data analysis and manipulation library which allows for flexible reading, writing, and reshaping of data
numpy: a key library that brings the computationaly power of languages like C to Python
matplotlib: a comprehensive visualization library
seaborn: a data visualization library based on matplotlib

Select Data

Read in data from kc_house_data.csv using .read_csv() from the pandas library.

Clean the Data

In order to clean the data, I typically address missing data, place holders and datatypes. This is the most important step of this project because if data is not appropriate for the model, the results will be inherently inaccuarate and my model will result in lackluster predictions.

To dig deeper into the data, I will:

Review the datatypes found within the entire dataframe
Address duplicates, missing and placeholder data
Address incorrect or incongruous datatypes for the model
Explore correlation between features

Build a Simple Linear Regression model

First, I set the dependent variable (y) to be the price. Then I chose the most highly correlated features from the dataframe to be the baseline independent variable (X). Finally, I followed this methodology:

Build a linear regression using statsModels
Describe the overall model performance
Interpret its coefficients.

Evaluation

This simple linear regression model is statistically significant overall, and explains 36.5% of the variance in house price. Both the intercept and the coefficient for sqft_living are statistically significant.

The intercept is a small negative number, meaning a home with 0 square feet of living would cost around $0.

The coefficient for sqft_living is about 157, which means that for each additional square foot of living space, I expect the price to increase about $157.

The results Summary from the statsmodel ordinary least squares shows:

Insights

Q1. Which neighborhoods have the highest average home price?

For the first question I looked for correlations between attributes and used price as my target variable. I explored data related to this question using visualizations created with seaborn, plotly express and matplotlib.

Q2. How does the number of bedrooms affect the sale price of a home?

For the second question, I removed features with high p-values and correlations, truncated the data so that it was more suitable for a linear regression model:

linear : one or more predictor features have a linear relationship with the target
normal : one or more features (random variables from the data) all have a bell shaped curve
homoscedasticity : little to no multicollinearity (highly correlated variables) . I explored data related to this question using visualizations created with seaborn, plotly express and matplotlib.

Q3. How does proximity to a highly rated school affect the sale price of a home?

For the third question, I used data from the greatschools website and created a function that calculated the closest distance to a school that was rated Above Average or higher (betwen 7 and 10, inclusive):

Recommendations

Curate a set of listings using the interactive map that are a between 2 and 5 bedroom homes.
Use the interactive map to narrow down homes that have a minimum sale price of 470K dollars.
Show families homes that and are within 10 miles of an Above Average school. This will allow Landing Pad to reach a broader set of home owners who are within our target market.

Future Work

Moving forward I would like to explore the effect of distance to local attractions (ex. parks, third places, places of worship) on sale price in the King County dataset.

Repository Structure

.
└── real_estate_linear_regression/
    ├── README.md
    ├── final_project_phase_2.ipynb
    ├── notebook.pdf
    ├── presentation.pdf
    ├── Images/
    └── .gitignore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real Estate Linear Regression

Business Understanding

Data Undersanding

Data Preparation

Import libraries and Visualization Packages

Select Data

Clean the Data

Build a Simple Linear Regression model

Evaluation

Insights

Q1. Which neighborhoods have the highest average home price?

Q2. How does the number of bedrooms affect the sale price of a home?

Q3. How does proximity to a highly rated school affect the sale price of a home?

Recommendations

Future Work

Repository Structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Images		Images
data		data
.gitignore		.gitignore
README.md		README.md
final_project_phase_2.ipynb		final_project_phase_2.ipynb
notebook.pdf		notebook.pdf
presentation.pdf		presentation.pdf

dataeducator/real_estate_linear_regression

Folders and files

Latest commit

History

Repository files navigation

Real Estate Linear Regression

Business Understanding

Data Undersanding

Data Preparation

Import libraries and Visualization Packages

Select Data

Clean the Data

Build a Simple Linear Regression model

Evaluation

Insights

Q1. Which neighborhoods have the highest average home price?

Q2. How does the number of bedrooms affect the sale price of a home?

Q3. How does proximity to a highly rated school affect the sale price of a home?

Recommendations

Future Work

Repository Structure

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages