Skip to content

ekapope/Predicting-condominium-price-using-data-from-webscraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting condominium price using data from web scraping

1. Data set and explanation web scraping process

This project uses Selenium library to firstly obtain all condominiums listed on the https://www.hipflat.com/ website, and extracts information for each page using BeautifulSoup package. Hipflat is one of the biggest property listing website in Thailand. This project is focused on condominium listings in Bangkok, both new and resale. Refer to below links for Python scripts.

001_Retrieve_all_urls.py

002_Scrape_info_for_each_condo.py

2. Pre-processing and data cleaning

Check NAs and data types for each column. Perform data manipulation by clean each column using regex, change numbers from strings to numeric, impute missing values, and convert lists of strings into columns. Refer to the link below.

003_data_cleaning_pred_current_price.py

3. Data scaling and hyperparameter tuning & ML

Robust Scaler is used in the pipeline before passing through the ML models. It uses a similar method to the Min-Max scaler but it instead uses the interquartile range, rather than the min-max, so that it is robust to outliers.

004_ML_pred_current_price.py

Three machine learning algorithms were used in the project

  1. Ridge
  2. RandomForestRegressor
  3. GradientBoostingRegressor

The results are shown below. Result table

Scatter plot the result of Gradient Boosting Regressor

Summary and suggestion for future improvement

Even this dataset is quite small with lots of features and we can only predict the price per square meters for each condo, however, this study is very useful for buyers, resellers, agents and even developers to justify the 'fair price' as a starting point based on the current actual market data.

In the web scraping step, we should acquire all listings available in each condo, not only average price per sqm. This should increase numerous numbers of records and it would be very useful to estimate the price for every single room in the future.

We dropped the name of public transports, supermarkets, restaurants, schools, hospitals from the basetable before feeding data to the models. With finer feature engineering and variable selections, it could help improve the predicting performance in the future.

Finally, we have scraped some quarterly historical prices but still did not use in this project since there were some unreliability issues in the data. It required some more detailed verification and data cleaning. This historical data can be really useful to visualize the trends for each condo/area (which areas are growing rapidly, which area are reaching plateau stage or declining).

For detail explanation, please refer to the PDF report.

About

Scrape Bangkok condominium listing from hipflat.com and compare ML performance

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published