Photo credit: Yacine Petitprez
This project focuses on predicting house prices in Pasig City, Philippines. The data used for training is sourced from real estate listings in the Philippines (via Kaggle). The dataset is then used to train and build a model using two main approaches: TensorFlow Decision Forests and Simple ML for Sheets, with the Gradient Boosted Trees algorithm being a key component of our model. A web application was also developed so that users can input a property and receive an estimate of how much their preferred house will cost. This project is inspired by the California House Prices (Kaggle).
demo.mp4
To use the model, you can load it with Google's Yggdrasil Decision Forests (YDF) and make predictions with it. Yggdrasil Decision Forests powers TensorFlow Decision Forests. See the jupyter notebook located in the notebooks
directory for the code implementation of training and testing using TensorFlow. Here's an example of how to do this:
# Load the model with YDF
import ydf
model = ydf.from_tensorflow_decision_forests("../models/pasig-model")
# Make predictions with the model
examples = {
"Bedrooms" : [2],
"Bath" : [2],
"Floor_area_sqm" : [104],
"Latitude" : [14.575822],
"Longitude" : [121.064324],
}
model.predict(examples)
# Output: array([32212794.], dtype=float32)
See flask_app
directory for the code implementation on how to deploy the model as a web application.
To set up the project on your local machine, follow these steps:
- Clone the repository:
git clone https://github.com/ralphcajipe/pasig-house-prices-prediction.git
- Navigate to the project directory:
cd pasig-house-prices-prediction
- Install the required dependencies:
pip install -r requirements.txt
The project uses the following data source:
- Philippine Real Estate (Last updated 2022, 2 years ago)
The data for this project comes from the PH_houses_v2.csv
file, which contains information about house prices in the Philippines. The dataset includes the following columns:
Description
: A brief description of the house.Location
: The city where the house is located.Price (PHP)
: The price of the house in Philippine Pesos.Bedrooms
: The number of bedrooms in the house.Bath
: The number of bathrooms in the house.Floor_area (sqm)
: The floor area of the house in square meters.Land_area (sqm)
: The land area of the house in square meters.Latitude
: The latitude coordinate of the house.Longitude
: The longitude coordinate of the house.Link
: A link to the online listing of the house.
The data for this project was extracted from a Kaggle dataset. Then it was placed in a csv
file to split it between training and testing dataset (70%-30%)
Before using the data to train our model, we performed extensive data cleaning. This included pandas for cleaning the dataset, removing any features that are not usable for training the model and only focusing the city of Pasig as there are other datas that are located outside Metro Manila and Pasig itself.
We used the cleaned data to train our house price prediction model. The model was built using TensorFlow Decision Forests and Simple ML for Sheets, with the Gradient Boosted Trees algorithm being a key component.
To compare real-time property price data from Lamudi in Pasig City with our project, you'll need the latitude and longitude of the specific location you're interested in. Here's how you can obtain these coordinates:
- Identify the address of the property you're interested in.
- Enter this address into Google Maps.
- Right-click on the red pin that marks the location of the address.
- The latitude and longitude of the location will be displayed. Note that the latitude is the first number (on the left), and the longitude is the second number (on the right).
You can then input these coordinates into our project to perform the comparison.
The project is organized as follows:
.
├── data/ - Contains the raw data files.
├── models/ - Contains the trained models.
├── notebooks/ - Contains the Jupyter notebooks.
├── scripts/ - Contains the Python scripts.
├── requirements.txt - Lists the Python dependencies.
└── README.md - The file that you are currently reading.
The project achieves a Root Mean Squared Error (RMSE) of 1162.05 on the training set. The evaluation methodology used is RMSE because the label (Price_PHP) of the model is a numerical column, the model is trained to do regression, and the reported metrics will include such as RMSE. See the jupyter notebook located in the notebooks
directory for the full details of the evaluation.
1. Go to the Google Workspace Marketplace and search for Simple ML for Sheets
a. After downloading, open the Google Sheet that contains your data (you can download the train.csv
and test.csv
respectively).
b. Click on "Extensions" in the menu, then select "Simple ML for Sheets" > "Start".
c. In the Simple ML for Sheets sidebar, select the range of cells that contains your data.
d. Choose the column that you want to predict, then click "Train Model". Simple ML for Sheets will automatically choose a model based on your data. (Remember to tick off Location
or remove the column Location
as that is not needed for training the model. And you must be in the train.csv
file.) The label is also ticked off as it is the one being trained to make predictions later on.
e. After the model has been trained, switch to test.csv
, to make predictions. (Also to remove Price_PHP in the test.csv
, whenever making predictions, as it will cause data leakage if so.)
f. The predictions are then added to your sheets (generally on the rightmost part of your dataset)
python server.py
If you want to run the server in production, you can use the following command:
waitress-serve --host 127.0.0.1 server:app
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.