Skip to content

Supervised Regression Machine Learning Model Building #2

Notifications You must be signed in to change notification settings

b1llywitant0/Philadelphia-Property-Value-Prediction

Repository files navigation

AlphaEngineer_JC_DS_FT_BSD_JKT_15_FinalProject

Philadelphia Property Value Prediction

By: Billy Witanto, Muhammad Rivaldi Prabowo, Vinsensia Fresian Meiliana

These notebooks serve as the final project of Job Connector-Data Science and Machine Learning program at Purwadhika Start-up and Coding School.

Background

Philadelphia city is one of the hottest real estate market in the US. With high increment every year, its property price is suprisingly lower compared to other cities in the US. That said, with lowered home demand index in 2022 compared to 2021, this is the best time for property agent to actively promote its remaining properties listed. In this project, we will aid the promotion by providing predicted market value of properties obtained by machine learning model. Hopefully, it will reduce the cost of property appraisal. And also, with a justified market value based on property characteristics and location, both buyer and seller will be benefitted, thus hypothetically will lead to increase in success transaction of properties.

Dataset Source

This is a real-world data of properties in Philadelphia city. You can download it from Kaggle or City of Philadelphia: Metadata Catalog.

Methodology

Data Understanding

With such an enormous dataset with a lot of columns and rows, we first need to know what information the dataset has. From a thorough investigation, we summarize them in Spreadsheet.

Data Cleaning

First, we pick columns which informative enough to help us fill the missing value and anomalies in the dataset based on its description. Thus, reducing the columns. Then, initial exploratory data analysis (EDA) was performed to match the column description with the data, also to detect and correcting missing values and anomalies. Furthermore, after correcting missing values, this section resulting in clean dataset, free from missing values. Jupyter

Detailed EDA

Detailed EDA to further understand the characteristics and correlations of the features and label to determine the proper preprocessing of the data. This section will also help us in gaining insights about feature engineering. Jupyter Tableau Vizualization

Feature Selection

We select the features that will be used in the model building based on evidence in detailed EDA and also on the basis of domain knowledge.

Modeling

In building process, we use linear regression, random forest regression and xgboost regression as potential models. First, we made a base model without any feature engineering, with only handling contextual outliers. In the second improved model, we include feature engineering. In each model, we select the best model based on MAPE (mean absolute percentage error) and also considering its standard deviation. After that, we will use the best model to predict the test data. Then, comparing the predicted value with the actual value of property to evaluate our prediction model. The model will be improved as necessary.

Results

Base Model

Jupyter

Model Selection with Cross-Validation

From the result above, it is not suprising for linear regression to have MAPE around 85% since the outliers in the data really exists. With slightly lower standard deviation than linear regression, XGBoost regression have MAPE around 34% which is still not good enough. Reference Random forest regressor proved to be the best model with good MAPE (14.475%) and the lowest standard deviation (0.00447) compared to two other models. Thus, we will use random forest model to predict the test data.

Predicting Test Data

From the result above, the model was able to predict the test data nicely with MAPE of 13.18%. MAPE below 10% indicating an excellent accuracy of the prediction model, while 10-20% indicating a good accuracy. Reference Can we improve the model?

Model Improvement with Feature Engineering

Jupyter

Model Selection with Cross-Validation

After adding and removing certain features, the MAPE of random forest regressor is increased by little, but still comparable to the model before (14.58%).

Predicting Test Data

From the result above, based on the MAPE value, the feature engineering didn't improve the model (13.35%). However, it's important to note that in this step, we dropped building code description with more than 400 unique values (that probably explain why the previous model better than this one). But since the difference is only 0.17%, we can say that there's no change in the quality of the model. However, it is also important to note that the goodness of fit of this model is better than the previous model, it is more representable to the test data. Also, there is a reduction in MSE that probably because improvement in the features used reducing the effect of outliers.

Limitations and Suggestions

1. Since the data are enormous, the process of random forest regression was taking too much time. When trying to improve the model with hyperparameter tuning, our computational power are not enough. Our suggestion for this problem is whether to explore other unconvential model that may work faster with comparable performance with random forest or use device with more computational power.

2. Our model can only predict market values of properties in range of 6800-1.4799*1e8, with range of numerical values: a.) Total livable area: 0 - 798189; b.) Total area: 600 - 99964; c.) Number stories: 0 - 40; d.) Number rooms: 0 - 154; e.) Number bedrooms: 0 - 93; f.) Number bathrooms: 0 - 84; g.) Year built: 1652 - 2020; h.) Sale year: 1918 - 2020.

Conclusion

From the model performance, we conclude that the property agent can use this model to predict the market value of properties in Philadelphia based on their characteristics and location. Even though that this model can't determine the sentimental/historical value of a property, we hope that this model can objectively and closely match the market value made by a professional property appraiser, thus reducing the cost of property appraisal. By using this model, we also hope that the justified market value can improve the success in property transaction between saler and buyer.

Releases

No releases published

Packages

No packages published