Skip to content

brunodifranco/project-outleto-clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Customer Loyalty Program for E-commerce

A Clustering Project

drawing

Obs: The company and business problem are both fictitious, although the data is real.

The in-depth Python code explanation is available in this Jupyter Notebook.

1. Outleto and Business Problem

Outleto is a multibrand outlet company, meaning it sells second line products of various companies at lower prices, through an E-commerce platform. Outleto's Marketing Team noticed that some customers tend to buy more expensive products, in higher quantities and more frequently than others, therefore contributing to a higher share of Outleto's total gross revenue. Because of that, the Marketing Team wishes to launch a customer loyalty program, dividing the 5,702 customers in clusters, on which the best customers will be placed in a cluster named Insiders.

To achieve this goal, the Data Science Team was requested to provide a list of customers that will participate in Insiders, as well as a business report regarding the clusters, answering the following questions:

1) How many customers will be a part of Insiders?
2) How many clusters were created?
3) How are the customers distributed amongst the clusters?
4) What are these customers' main features?
5) What's the gross revenue percentage coming from Insiders? And what about other clusters?
6) How many items were purchased by each cluster?

With that list and report the Marketing Team will promote actions to each cluster, in order to increase revenue, but of course focusing mostly in the Insiders cluster.

2. Data Overview

The data was collected from Kaggle in the CSV format. The initial features descriptions are available below:

Feature Definition
InvoiceNo A 6-digit integral number uniquely assigned to each transaction
StockCode Product (item) code
Description Product (item) name
Quantity The quantities of each product (item) per transaction
InvoiceDate The day when each transaction was generated
UnitPrice Unit price (product price per unit)
CustomerID Customer number (unique id assigned to each customer)
Country The name of the country where each customer residers

3. Business Assumptions

  • All observations on which unit_price <= 0 were removed, as we're assuming those are gifts when unit_price = 0, and when unit_price < 0 it's described as "Adjust bad debt".
  • Some stock_code identifications weren't actual products, therefore they were removed.
  • Both description and country columns were removed, since those aren't relevant when modelling.
  • Customer number 16446 was removed because he (she) bought 80995 items and returned them in the same day, leading to extraordinary values in other features. Other 12 customers were removed because they returned all items bought. In addition to that, three other users were also removed because they were considered to be data inconsistencies, since they had their return values greater than quantity of items bought, which doesn't make sense. These 16 were named "bad users".

4. Solution Plan

4.1. How was the problem solved?

To provide the clusters final report the following steps were performed:

  • Understanding the Business Problem: Understanding the main objective we are trying to achieve and plan the solution to it.

  • Collecting Data: Collecting data from Kaggle.

  • Data Cleaning: Checking data types, treating Nan's, renaming columns, dealing with outliers and filtering data.

  • Feature Engineering: Creating new features from the original ones, so that those could be used in the ML model. More information in Section 5.

  • Exploratory Data Analysis (EDA): Exploring the data in order to obtain business experience, look for data inconsistencies, useful business insights and find important features for the ML model. This was done by using the Pandas Profiling library. Two EDA profile reports are available for download here, one still with the bad users and one without them.

  • Data Preparation: Applying Rescaling Techniques in the data.

  • Feature Selection: Selecting the best features to use in the Machine Learning algorithm.

  • Space Analysis and Dimensionality Reduction: PCA, UMAP and Tree-Based Embedding were used to get a better data separation.

  • Machine Learning Modeling: Selecting the number of clusters (K) and then training Clustering Algorithms. More information in Section 6.

  • Model Evaluation: Evaluating the model by using Silhouette Score and Silhouette Visualization.

  • Cluster Exploratory Data Analysis: Exploring the clusters to obtain business experience and to find useful business insights. In addition to that, this step also helped building the business report. The top business insights found are available in Section 7.

  • Final Report and Deployment: Providing a business report regarding the clusters, as well as a list of customers that will participate in Insiders. This report was built using Power BI, as well as Render Cloud and Google Drive, so that it could be accessed from anywhere. More information in Section 8.

4.2. Tools and techniques used:

5. Feature Engineering

In total, 10 new features were created by using the original ones:

New Feature Definition
Gross Revenue Gross revenue for each customer, which is equal to quantity times unit price
Average Ticket Average monetary value spent on each purchase
Recency Days Period of time from current time to the last purchase
Max Recency Days Max time a customer's gone without making any purchases
Frequency Average purchases made by each customer during their latest purchase period and first purchase period
Purchases Quantity Amount of times a customers's made any purchase
Quantity of Products Total of products purchased
Quantity of Items Total quantity of items purchased
Returns Amount of items returned
Purchased and Returned Difference Natural log of difference between items purchased and items returned

6. Machine Learning Modeling

In order to get better data separation a few dimensionality reduction techniques were tested: PCA, UMAP and Tree-Based Embedding. The Results were satisfactory with Tree-Based Embedding, which consists of:

  • Setting gross_revenue as a response variable, so it becomes a supervised learning problem.
  • Training a Random Forest (RF) model to predict gross_revenue using all other features.
  • Plotting the embedding based on RF's leaves.

In total four clustering algorithms were tested, for a cluster number varying from 2 to 24:

  • K-Means
  • Gaussian Mixture Models (GMM) with Expectation–Maximization (EM)
  • Agglomerative Hierarchical Clustering (HC)
  • DBSCAN

The models were evaluated by silhouette score, as well as clustering visualization. Our maximum cluster number was set to 11 due to practical purposes for Outleto's Marketing Team, so they can come up with exclusive actions for each cluster. DBSCAN had its parameters optimized with Bayesian Optimization, however, because it provided a very high number of clusters it was withdrawn as a possible final model. The results were quite similar for K-Means, GMM and HC with clusters from 8 to 11, however KMeans were chosen with 8 clusters because its silhouette score is slightly better, being equal to 0.6168. Those 8 cluster names are Insiders, Runners Up, Promising, Potentials, Need Attention, About to Sleep, At Risk and About to Lose .

Metrics Definition

There're two properties we look for when creating clusters:

  • Compactness: Observations from the same cluster must be compact with one another, meaning the distance between them should be as small as possible.

  • Separation: Different clusters must be as far apart from one another as possible.

Silhouette Score covers both these properties. It goes from -1 to 1, the higher the better.

7. Top Business Insights

  • 1st - Customers from Insiders are responsible for 58.3% of the total items purchased.

drawing


  • 2nd - Customers from Insiders are responsible for 53.5% of the total gross revenue.

drawing


  • 3rd - Customers from Insiders have a number of returns higher than other customers, on average.

drawing

That's probably because those customers buy a really high amount of items.


8. Final Report and Deployment

The final business report was built using Power BI, containing answers to the following questions previously demanded by Outleto's Marketing Team. This is how the report was built:

  • Firstly, a new PostgreSQL database was created in Render Cloud.
  • Then, the final data containing all customers already classified in their respective clusters was saved in this PostgreSQL database.
  • Continuing, the database was added in Power BI, making it possible to create the final business report, where it was saved in PDF format.
  • Finally, the report was uploaded to Google Drive, so it could be shared.
Click below to access the report
Power BI

Whereas for the list of customers that made it to insiders it was saved in CSV format, and it's available here .

The complete list of Outleto's customers is also available for download here .

9. Conclusion

In this project the main objective was accomplished:

We managed to provide a business report using Power BI, containing answers to the questions previously demanded by Outleto's Marketing Team, as well as a list of eligible customers to be a part of Insiders. With that report the Marketing Team will promote actions to each cluster, in order to increase revenue, but of course focusing mostly in the Insiders cluster, since they represent 53.5% of the total gross revenue. In addition to that, some useful business insights were found.

10. Next Steps

Further on, this solution could be improved by a few strategies:

  • Requesting more data from Outleto, such as product associated data.

  • Creating even more features.

  • Making the final report even more automatic for when new data comes in, so it could be run every time requested and the data instantly saved in the PostgreSQL database. This could be done by using the Papermill library.

Contact