Drivers of HDB Resale Price

Author: TeYang, Lau
Last Updated: 26 May 2023

Please refer to this notebook for a more detailed analysis of the project. If it takes a long time to load, the html file can also be downloaded.

Check out the interactive web app for Singapore HDB resale price prediction here!

Project Goals

Start a end-to-end project, from scraping data, to cleaning, modelling, and deploying the model
To identify the drivers of HDB resale prices in Singapore.
To scrape and engineer additional features from online public datasets that might also influence resale prices
To deploy the model onto a web app, allowing for HDB resale prices prediction for different HDB features

About the Data

The HDB resale price data was downloaded from Data.gov.sg, containing ~800k resale transactions from 1990 to 2020.

Data Scraping and Feature Engineering

The names of schools, supermarkets, hawkers, shopping malls, parks and MRTs were downloaded/scraped from Data.gov.sg and Wikipedia and fed through a function that uses OneMap.sg api to get their coordinates (latitude and longitude). These coordinates were then fed through other functions that use the geopy package to get the distance between locations. By doing this, the nearest distance of each amenity from each house can be computed, as well as the number of each amenity within a 2km radius of each flat.

The script for this can be found here.

EDA

Between 2015 to 2019, 4 Room, 3 Room, 5 Room and Executive flat types made up the majority of resales, and their prices did not change much throughout the years. Their resale price did increase as the number of rooms increase, as well as for floor area.

The changes in median price amongst the towns are not very large from 2018 to 2019, although prices for Toa Payoh and Central Area 4-room flats dropped by about 20%. Other factors might also influence the resale price in addition to the neighborhood/town location of the flats.

Unsurprisingly, flat models also have an effect on the resale price. The special models like the Type S1S2 (The Pinnacle@Duxton) and Terraces tend to fetch higher prices while the older models from the 1960s to 1970s (Standard and New Generation models) tend to go lower.

The median distance of each town appears to be negatively correlated with its median resale price, suggesting that distance to the most frequented station of Singapore is a likely driver to how much people pay for HDB flats. Distances from the nearest amenities like hawker centers and malls also appear to have a small relationship.

Linear Regression and Random Forest Performance

Linear regression was done using a statistical approach with no train-test splitting. The model achieved an adjusted R² of 0.90. For the random forest, the data was split into a 9:1 train test ratio, and validated using both Out-Of-Bag and K-fold cross validation methods. Both achieved a test R² of 0.96 and mean absolute error of ~$20,000.

Feature Importance

Feature importance from the 2 models are slightly different. Linear regression showed that region and floor area are the best predictors of resale prices while for random forest, floor area and distance from Dhoby Ghaut MRT are the best predictors.

SHAP values also provide local interpretability to the data. Below shows the SHAP force plots for a low, medium and high predicted priced flats, allowing interpretation of how much each features are contributing to each of the flat.

Model Deployment to Web App

The random forest model was deployed onto a web app using Streamlit. Try out the app here. It allows users to input HDB features into the app and get the predicted resale price. It shows the map of Singapore, with the location of the flat, and the nearby amenities within a 2km radius. In addition, it also displays a user controlled interactive map that shows the median HDB resale prices throughout the years from 1990 to 2020.

Conclusion

In this project, linear regression and random forest were used to looked at the drivers of HDB resale prices. Linear regression is powerful because it allows one to interpret the results of the model by looking at its coefficients for every feature. However, it assumes a linear relationship between the features and the outcome, which isn't always the case in real life. It also tends to suffer from bias due to its parametric nature. Conversely, non-parametric methods do not assume any function or shape, and random forest is a powerful non-linear machine learning model which uses bootstrap aggregating (bagging) and ensembling methods. A single decision tree has high variance as it tends to overfit to the data. Through bagging and ensembling, it is able to reduce the variance of each tree by combining them.

Looking at the output of the models, linear regression showed that regions, floor area, flat model, lease commencement date and distance from hawker are the top 5 drivers of HDB prices. However, random forest gave a slightly different result. floor area, and lease commencement date and distance from hawker still in the top 5 while distance from Dhoby Ghaut MRT and flat type has also came up on top. This could be due to tree-based models giving lower importance to categorical variables (region and flat model) due to the way it computes importance.

Nevertheless, the size of the flat, lease date, and certain aspects of location appears to be consistently the most important drivers of HDB resale prices.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
Data		Data
Pictures		Pictures
.gitattributes		.gitattributes
.gitignore		.gitignore
GD_download.py		GD_download.py
LICENSE		LICENSE
Presentation_Slides_TeYang_Lau.pptx		Presentation_Slides_TeYang_Lau.pptx
README.md		README.md
flat_prices.html		flat_prices.html
flat_prices.ipynb		flat_prices.ipynb
get_coordinates.ipynb		get_coordinates.ipynb
logsheet.txt		logsheet.txt
predict_hdb_prices_streamlit.py		predict_hdb_prices_streamlit.py
requirements.txt		requirements.txt
utils_functions.py		utils_functions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Drivers of HDB Resale Price

Please refer to this notebook for a more detailed analysis of the project. If it takes a long time to load, the html file can also be downloaded.

Check out the interactive web app for Singapore HDB resale price prediction here!

Project Goals

About the Data

Data Scraping and Feature Engineering

EDA

Linear Regression and Random Forest Performance

Feature Importance

Model Deployment to Web App

Conclusion

About

Releases 1

Packages

Languages

License

teyang-lau/HDB_Resale_Prices

Folders and files

Latest commit

History

Repository files navigation

Drivers of HDB Resale Price

Please refer to this notebook for a more detailed analysis of the project. If it takes a long time to load, the html file can also be downloaded.

Check out the interactive web app for Singapore HDB resale price prediction here!

Project Goals

About the Data

Data Scraping and Feature Engineering

EDA

Linear Regression and Random Forest Performance

Feature Importance

Model Deployment to Web App

Conclusion

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages