This project uses Selenium library to firstly obtain all condominiums listed on the https://www.hipflat.com/ website, and extracts information for each page using BeautifulSoup package. Hipflat is one of the biggest property listing website in Thailand. This project is focused on condominium listings in Bangkok, both new and resale. Refer to below links for Python scripts.
002_Scrape_info_for_each_condo.py
Check NAs and data types for each column. Perform data manipulation by clean each column using regex, change numbers from strings to numeric, impute missing values, and convert lists of strings into columns. Refer to the link below.
003_data_cleaning_pred_current_price.py
Robust Scaler is used in the pipeline before passing through the ML models. It uses a similar method to the Min-Max scaler but it instead uses the interquartile range, rather than the min-max, so that it is robust to outliers.
- Ridge
- RandomForestRegressor
- GradientBoostingRegressor
Even this dataset is quite small with lots of features and we can only predict the price per square meters for each condo, however, this study is very useful for buyers, resellers, agents and even developers to justify the 'fair price' as a starting point based on the current actual market data.
In the web scraping step, we should acquire all listings available in each condo, not only average price per sqm. This should increase numerous numbers of records and it would be very useful to estimate the price for every single room in the future.
We dropped the name of public transports, supermarkets, restaurants, schools, hospitals from the basetable before feeding data to the models. With finer feature engineering and variable selections, it could help improve the predicting performance in the future.
Finally, we have scraped some quarterly historical prices but still did not use in this project since there were some unreliability issues in the data. It required some more detailed verification and data cleaning. This historical data can be really useful to visualize the trends for each condo/area (which areas are growing rapidly, which area are reaching plateau stage or declining).
For detail explanation, please refer to the PDF report.