Predicting Next Booking Destinations for Airbnb Users

A Classification Project

Obs: The business problem is fictitious, although both company and data are real.

The in-depth Python code explanation is available in this Jupyter Notebook.

1. Airbnb and Business Problem

Airbnb is an online marketplace for short-term homestays, and their business model consists of charging a commission for each booking. So they can better understand their customers behaviors and most desired booking locations a Data Scientist was hired, in order to predict the five most likely countries for a USA user to make their next booking. Airbnb provided data from over 200 thousand users, split in two different datasets (more information in Section 2), so the predictions could be made for around 61 thousand users. There are 12 possible outcomes of the destination country: 'USA', 'France', 'Canada', 'Great Britain', 'Spain', 'Italy', 'Portugal', 'New Zealand', 'Germany' and 'Australia', as well as 'NDF' (which means there wasn't a booking) and 'other countries'.

2. Data Overview

The data was split in users and sessions data, which is the internet browsing information. The Initial features descriptions are available below:

Users

Sessions

Feature	Definition
id	user id
date_account_created	the date of account creation
timestamp_first_active	timestamp of the first activity
date_first_booking	date of first booking
gender	user's gender
age	user's age
signup_method	method of signing up e.g. facebook, google
signup_flow	the page a user came to signup up from
language	international language preference
affiliate_channel	what kind of paid marketing
affiliate_provider	where the marketing is e.g. google, craigslist
first_affiliate_tracked	first marketing the user interacted with
signup_app	signup app e.g. Web, Android
first_device_type	first device type used e.g. Windows, IPhone, Android
first_browser	first browser used e.g. Chrome, FireFox, Safari
country_destination	target variable

Feature	Definition
user_id	same as 'id' in users table
action	action performed e.g. show, search_results
action_type	action type performed e.g. view, click
action_detail	action detail e.g. confirm_email_link
device_type	device used on each action
secs_elapsed	the time between two actions recorded

The data was collected from Kaggle.

3. Assumptions

Out of 'action', 'action_type', 'action_detail' only 'action_type' was kept due to their high correlation and because they seem to represent similar events. The choice for 'action_type' is due to it having only 28 unique values, unlike 'action' and 'action_detail' that have hundreds, which made encoding easier later.
Missing values on 'first_affiliate_tracked' were replaced with 'untracked', as it would be the most logical replacement in this instance.
Missing values on 'age' were replaced with the ages median.
'date_first_booking' was dropped since it doesn't exist in the new users dataset.

4. Solution Plan

4.1. How was the problem solved?

To predict the five most likely countries for a USA user to make their next booking the following steps were performed:

Understanding the Business Problem: Understanding the main objective we are trying to achieve and plan the solution to it.
Collecting Data: Collecting data from Kaggle.
Data Cleaning: Checking data types and Nan's. Other tasks such as: renaming columns, dealing with outliers, fixing missing values, changing data types, etc.
Feature Engineering: Creating new features from the original ones, so that those could be used in the ML model. The full new features created with their definitions are available here.
Exploratory Data Analysis (EDA): Exploring the data in order to obtain business experience, look for data inconsistencies, useful business insights and find important features for the ML model. This process is split in Univariate, Bivariate (Checking Hypotheses) and Multivariate Analysis. The univariate analysis was done by using the Pandas Profiling library. The report is available for download here. The top business insights found are available in Section 5.
Data Preparation: Applying Rescaling Techniques in the data, as well as Enconding Methods, to deal with categorical variables.
Feature Selection: Selecting the best features to use in the ML model by using Random Forest.
Machine Learning Modeling and Model Evaluation: Training Classification Algorithms. The best model was selected to be improved via Bayesian Optimization with Optuna. More information in Section 6.
Model Deployment and Results : Providing a list of the five most likely destinations predictions for 61 thousand USA Airbnb users, as well as graphical analysis of the predictions by age, gender and overall analysis. This is the project's Data Science Product, and it can be accessed from anywhere in a Streamlit App. In addition to that, if new data from new users comes in, it's easy to get new predictions, as a Flask application using Render Cloud was built. More information in Section 7.

4.2. Tools and techniques used:

5. Top Business Insights

1st - Users take less than 2 days, on average, from first active in the platform to creating an account, considering all destinations.

2nd - The number of accounts created goes up during the spring.

3rd - Women made over 15% more bookings for countries other than USA, in comparison to men.

6. Machine Learning Models

Initially, seven models were trained using cross-validation, so we can provide predictions on the five most likely countries for a US Airbnb user to book their next destinations: Logistic Regression, Decision Tree, Random Forest, Extra Trees, AdaBoost, XGBoost and Light GBM.

The initial cross validation performance of all seven algorithms are displayed below:

Model	NDCG at K
Light GBM	0.8496 +/- 0.0006
XGBoost	0.8482 +/- 0.0004
Random Forest	0.8451 +/- 0.0006
AdaBoost	0.8429 +/- 0.0019
Extra Trees	0.8390 +/- 0.0008
Logistic Regression	0.8377 +/- 0.0010
Decision Tree	0.7242 +/- 0.0023

Where K is equal to 5, given our business problem.

The Light GBM was chosen as a final model, since it's fast to train and tune, whilst being also the one with the best result without any tuning. In addition to that, it's much better for deployment, as it's much lighter than a XGBoost or Random Forest for instance, especially given the fact that we're using a free deployment cloud. More information in Section 7.

Instead of using cross-validation, which uses only the training dataset, we tuned the model's hyperparameters by comparing its performance on the test dataset, which was split before Data Preparation, to avoid Data Leakage. After tuning LGBM's hyperparameters using Bayesian Optimization with Optuna the model performance has improved, as expected:

Before Tuning

Final Model

Model	NDCG at K
Light GBM	0.8514

Model	NDCG at K
Light GBM	0.8542

Metrics Definition and Interpretation

As the goal in this project is to predict not only the most likely next booking destination for each user, but the five most likely ones the Normalized discounted cumulative gain (NDCG) at rank K was chosen.

NDCG at K “measures the performance of a recommendation system based on the graded relevance of the recommended entities. It varies from 0.0 to 1.0, with 1.0 representing the ideal ranking of the entities.” Therefore, for this instance (where k equals 5), it not only measures how well we can predict the five most likely next booking locations for each user, but also how well can rank them from the most likely to the least.

7. Model Deployment and Results

The model deployment was performed in three steps:

Step 1: The original data (both datasets in Section 2) was saved in a PostgreSQL Database from Neon.tech.
Step 2: A Flask application was built using Render Cloud , on which it extracts the original data from that PostgreSQL Database, cleans and transforms the data, loads the saved ML model, creates predictions for each user and adds these predictions back in a different table in the same Database. Let's name this table 'df_pred' for the sake of the explanation.
Step 3: Streamlit retrieves the df_pred data from the Database and displays it in a table inside Streamlit with filters, where you can find the five most likely destinations predictions for the 61 thousand USA Airbnb users. In addition to that, graphical analysis of the predictions were built, split by age, gender and overall analysis. This is the project's Data Science Product, and it can be accessed from anywhere in a Streamlit App.

Click on the respective icon to access the link

Streamlit App	Flask App

The Flask App is particularly useful for when new data comes in, as we can get new predictions with a click of a button, so it can be later retrieved by the Streamlit App. The Streamlit App code is available here and the Flask App code can be seen here.

Because the deployment was made in a free cloud (Render Cloud) the Flask App's functionality could be slow, in the other hand, the main deployment product, which is the Streamlit App, should work quickly.

8. Conclusion

In this project the main objective was accomplished:

We managed to provide a list of the five most likely destinations predictions for 61 thousand USA Airbnb users, as well as graphical analysis of the predictions by age, gender and overall analysis. This can all be found in a Streamlit App, for better visualization. Also, a Flask application was built for when new data comes in, making it possible to get new predictions easily. In addition to that, three interesting and useful insights were found through Exploratory Data Analysis (EDA), so that those can be properly used by Airbnb.

9. Next Steps

Further on, this solution could be improved by a few strategies:

Creating even more features from the existing ones.
Try other classification algorithms, such as Neural Networks.
Using a paid Cloud, such as AWS.

Contact

brunodifranco99@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
airbnb.ipynb		airbnb.ipynb
new_features.md		new_features.md
report.html		report.html
requirements.txt		requirements.txt

License

brunodifranco/project-airbnb-classification

Folders and files

Latest commit

History

Repository files navigation

Predicting Next Booking Destinations for Airbnb Users

1. Airbnb and Business Problem

2. Data Overview

Users

Sessions

id

user id

date_account_created

the date of account creation

timestamp_first_active

timestamp of the first activity

date_first_booking

date of first booking

gender

user's gender

age

user's age

signup_method

method of signing up e.g. facebook, google

signup_flow

the page a user came to signup up from

language

international language preference

affiliate_channel

what kind of paid marketing

affiliate_provider

where the marketing is e.g. google, craigslist

first_affiliate_tracked

first marketing the user interacted with

signup_app

signup app e.g. Web, Android

first_device_type

first device type used e.g. Windows, IPhone, Android

first_browser

first browser used e.g. Chrome, FireFox, Safari

country_destination

target variable

user_id

same as 'id' in users table

action

action performed e.g. show, search_results

action_type

action type performed e.g. view, click

action_detail

action detail e.g. confirm_email_link

device_type

device used on each action

secs_elapsed

the time between two actions recorded

3. Assumptions

4. Solution Plan

4.1. How was the problem solved?

4.2. Tools and techniques used:

5. Top Business Insights

1st - Users take less than 2 days, on average, from first active in the platform to creating an account, considering all destinations.

2nd - The number of accounts created goes up during the spring.

3rd - Women made over 15% more bookings for countries other than USA, in comparison to men.

6. Machine Learning Models

Metrics Definition and Interpretation

7. Model Deployment and Results

8. Conclusion

9. Next Steps

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages