This analysis was completed as part of the Udacity Data Science Nanodegree program. The dataset and business problems were chosen based on past experiences of visiting the city of Istanbul.
For a full list of all requirements to run the code in this repository, please see the requirements.txt file.
The CRISP-DM framework was adopted for this analysis project.
The project was to be centered around using airbnb data to analyse listings in the city of Istanbul. Three questions were identified that could be answered using data science methods. These questions were:
- Which side of the city has cheaper accommodation?
- Which month of the year has the best deals on accommodation?
- What do customers say in accommodation reviews?
To be able to answer the business questions above, opensource airbnb data was extracted from http://insideairbnb.com. Listings data, calendar data and review data were used from the latest version of Istanbul data found on the website.
For the first two questions above, the raw data was prepared into a dataset called Data.Price_Data.csv. Likewise the raw data was prepared for the final question and this was saved as Review_Data.csv.
Data analysis and modeling were applied to answer the business questions listed above. The relevant code files for these can be found in the Code folder of this project. Each file has been documented to show the steps involved in obtaining the end results.
The results of the analysis were evaluated and discussed in a Medium blog post, the link of which can be found in the Summary section below.
There were a small fraction of rows in the dataset that either had missing values for the price of a listing or had missing values for review comments. For the respective analysis, these rows were removed from the dataset as they would not have contributed to any results from the analysis.
The first two questions below were analysed and answered using statistical analysis and data visualisation methods. The final The final question was analysed and answered using NLP and unsupervised learning methods. In particularly TF-IDF, K-means and Word2Vec.
Analysis was performed on opensource airbnb data for the city of Istanbul. Three questions were considered as aprt of the analysis:
- Which side of the city has cheaper accommodation?
- Which month of the year has the best deals on accommodation?
- What do customers say in accommodation reviews?
Findings from the analysis can be found on Medium blog linked below.
This repository contains the code that was used to develop analysis for the Medium blog post https://medium.com/@d.grewal1503/how-to-choose-your-accommodation-on-your-next-airbnb-getaway-7e0711eae51b.
Some pieces of code have been copied and adapted for use from other published resources such as https://towardsdatascience.com/unsupervised-sentiment-analysis-a38bf1906483 and https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34. Where this is the case, it has been stated in the docstring of the relevant functions.