A Classification Project
Obs: The business problem is fictitious, although both company and data are real.
The in-depth Python code explanation is available in this Jupyter Notebook.
Airbnb is an online marketplace for short-term homestays, and their business model consists of charging a commission for each booking. So they can better understand their customers behaviors and most desired booking locations a Data Scientist was hired, in order to predict the five most likely countries for a USA user to make their next booking. Airbnb provided data from over 200 thousand users, split in two different datasets (more information in Section 2), so the predictions could be made for around 61 thousand users. There are 12 possible outcomes of the destination country: 'USA', 'France', 'Canada', 'Great Britain', 'Spain', 'Italy', 'Portugal', 'New Zealand', 'Germany' and 'Australia', as well as 'NDF' (which means there wasn't a booking) and 'other countries'.
The data was split in users and sessions data, which is the internet browsing information. The Initial features descriptions are available below:
|
|
The data was collected from Kaggle.
-
Out of 'action', 'action_type', 'action_detail' only 'action_type' was kept due to their high correlation and because they seem to represent similar events. The choice for 'action_type' is due to it having only 28 unique values, unlike 'action' and 'action_detail' that have hundreds, which made encoding easier later.
-
Missing values on 'first_affiliate_tracked' were replaced with 'untracked', as it would be the most logical replacement in this instance.
-
Missing values on 'age' were replaced with the ages median.
-
'date_first_booking' was dropped since it doesn't exist in the new users dataset.
To predict the five most likely countries for a USA user to make their next booking the following steps were performed:
-
Understanding the Business Problem: Understanding the main objective we are trying to achieve and plan the solution to it.
-
Collecting Data: Collecting data from Kaggle.
-
Data Cleaning: Checking data types and Nan's. Other tasks such as: renaming columns, dealing with outliers, fixing missing values, changing data types, etc.
-
Feature Engineering: Creating new features from the original ones, so that those could be used in the ML model. The full new features created with their definitions are available here.
-
Exploratory Data Analysis (EDA): Exploring the data in order to obtain business experience, look for data inconsistencies, useful business insights and find important features for the ML model. This process is split in Univariate, Bivariate (Checking Hypotheses) and Multivariate Analysis. The univariate analysis was done by using the Pandas Profiling library. The report is available for download here. The top business insights found are available in Section 5.
-
Data Preparation: Applying Rescaling Techniques in the data, as well as Enconding Methods, to deal with categorical variables.
-
Feature Selection: Selecting the best features to use in the ML model by using Random Forest.
-
Machine Learning Modeling and Model Evaluation: Training Classification Algorithms. The best model was selected to be improved via Bayesian Optimization with Optuna. More information in Section 6.
-
Model Deployment and Results : Providing a list of the five most likely destinations predictions for 61 thousand USA Airbnb users, as well as graphical analysis of the predictions by age, gender and overall analysis. This is the project's Data Science Product, and it can be accessed from anywhere in a Streamlit App. In addition to that, if new data from new users comes in, it's easy to get new predictions, as a Flask application using Render Cloud was built. More information in Section 7.
- Python 3.10.9, Pandas, Matplotlib, Seaborn and Sklearn.
- SQL and PostgresSQL.
- Jupyter Notebook and VSCode.
- Flask and Render Cloud.
- Streamlit.
- Git and Github.
- Exploratory Data Analysis (EDA).
- Techniques for Feature Selection.
- Classification Algorithms (Logistic Regression, Decision Tree, Random Forest, ExtraTrees, AdaBoost, XGBoost and LGBM Classifiers).
- Cross-Validation Methods, Bayesian Optimization with Optuna and Performance Metrics (NDCG at rank K).
Initially, seven models were trained using cross-validation, so we can provide predictions on the five most likely countries for a US Airbnb user to book their next destinations: Logistic Regression, Decision Tree, Random Forest, Extra Trees, AdaBoost, XGBoost and Light GBM.
The initial cross validation performance of all seven algorithms are displayed below:
Model | NDCG at K |
---|---|
Light GBM | 0.8496 +/- 0.0006 |
XGBoost | 0.8482 +/- 0.0004 |
Random Forest | 0.8451 +/- 0.0006 |
AdaBoost | 0.8429 +/- 0.0019 |
Extra Trees | 0.8390 +/- 0.0008 |
Logistic Regression | 0.8377 +/- 0.0010 |
Decision Tree | 0.7242 +/- 0.0023 |
Where K is equal to 5, given our business problem.
The Light GBM was chosen as a final model, since it's fast to train and tune, whilst being also the one with the best result without any tuning. In addition to that, it's much better for deployment, as it's much lighter than a XGBoost or Random Forest for instance, especially given the fact that we're using a free deployment cloud. More information in Section 7.
Instead of using cross-validation, which uses only the training dataset, we tuned the model's hyperparameters by comparing its performance on the test dataset, which was split before Data Preparation, to avoid Data Leakage. After tuning LGBM's hyperparameters using Bayesian Optimization with Optuna the model performance has improved, as expected:
Before Tuning | Final Model | ||||||||
---|---|---|---|---|---|---|---|---|---|
|
|
As the goal in this project is to predict not only the most likely next booking destination for each user, but the five most likely ones the Normalized discounted cumulative gain (NDCG) at rank K was chosen.
NDCG at K “measures the performance of a recommendation system based on the graded relevance of the recommended entities. It varies from 0.0 to 1.0, with 1.0 representing the ideal ranking of the entities.” Therefore, for this instance (where k equals 5), it not only measures how well we can predict the five most likely next booking locations for each user, but also how well can rank them from the most likely to the least.
The model deployment was performed in three steps:
-
Step 1: The original data (both datasets in Section 2) was saved in a PostgreSQL Database from Neon.tech.
-
Step 2: A Flask application was built using Render Cloud , on which it extracts the original data from that PostgreSQL Database, cleans and transforms the data, loads the saved ML model, creates predictions for each user and adds these predictions back in a different table in the same Database. Let's name this table 'df_pred' for the sake of the explanation.
-
Step 3: Streamlit retrieves the df_pred data from the Database and displays it in a table inside Streamlit with filters, where you can find the five most likely destinations predictions for the 61 thousand USA Airbnb users. In addition to that, graphical analysis of the predictions were built, split by age, gender and overall analysis. This is the project's Data Science Product, and it can be accessed from anywhere in a Streamlit App.
The Flask App is particularly useful for when new data comes in, as we can get new predictions with a click of a button, so it can be later retrieved by the Streamlit App. The Streamlit App code is available here and the Flask App code can be seen here.
Because the deployment was made in a free cloud (Render Cloud) the Flask App's functionality could be slow, in the other hand, the main deployment product, which is the Streamlit App, should work quickly.
In this project the main objective was accomplished:
We managed to provide a list of the five most likely destinations predictions for 61 thousand USA Airbnb users, as well as graphical analysis of the predictions by age, gender and overall analysis. This can all be found in a Streamlit App, for better visualization. Also, a Flask application was built for when new data comes in, making it possible to get new predictions easily. In addition to that, three interesting and useful insights were found through Exploratory Data Analysis (EDA), so that those can be properly used by Airbnb.
Further on, this solution could be improved by a few strategies:
- Creating even more features from the existing ones.
- Try other classification algorithms, such as Neural Networks.
- Using a paid Cloud, such as AWS.