Customer Loyalty Program for E-commerce

A Clustering Project

Obs: The company and business problem are both fictitious, although the data is real.

The in-depth Python code explanation is available in this Jupyter Notebook.

1. Outleto and Business Problem

Outleto is a multibrand outlet company, meaning it sells second line products of various companies at lower prices, through an E-commerce platform. Outleto's Marketing Team noticed that some customers tend to buy more expensive products, in higher quantities and more frequently than others, therefore contributing to a higher share of Outleto's total gross revenue. Because of that, the Marketing Team wishes to launch a customer loyalty program, dividing the 5,702 customers in clusters, on which the best customers will be placed in a cluster named Insiders.

To achieve this goal, the Data Science Team was requested to provide a list of customers that will participate in Insiders, as well as a business report regarding the clusters, answering the following questions:

1) How many customers will be a part of Insiders?

2) How many clusters were created?

3) How are the customers distributed amongst the clusters?

4) What are these customers' main features?

5) What's the gross revenue percentage coming from Insiders? And what about other clusters?

6) How many items were purchased by each cluster?

With that list and report the Marketing Team will promote actions to each cluster, in order to increase revenue, but of course focusing mostly in the Insiders cluster.

2. Data Overview

The data was collected from Kaggle in the CSV format. The initial features descriptions are available below:

Feature	Definition
InvoiceNo	A 6-digit integral number uniquely assigned to each transaction
StockCode	Product (item) code
Description	Product (item) name
Quantity	The quantities of each product (item) per transaction
InvoiceDate	The day when each transaction was generated
UnitPrice	Unit price (product price per unit)
CustomerID	Customer number (unique id assigned to each customer)
Country	The name of the country where each customer residers

3. Business Assumptions

All observations on which unit_price <= 0 were removed, as we're assuming those are gifts when unit_price = 0, and when unit_price < 0 it's described as "Adjust bad debt".
Some stock_code identifications weren't actual products, therefore they were removed.
Both description and country columns were removed, since those aren't relevant when modelling.
Customer number 16446 was removed because he (she) bought 80995 items and returned them in the same day, leading to extraordinary values in other features. Other 12 customers were removed because they returned all items bought. In addition to that, three other users were also removed because they were considered to be data inconsistencies, since they had their return values greater than quantity of items bought, which doesn't make sense. These 16 were named "bad users".

4. Solution Plan

4.1. How was the problem solved?

To provide the clusters final report the following steps were performed:

Understanding the Business Problem: Understanding the main objective we are trying to achieve and plan the solution to it.
Collecting Data: Collecting data from Kaggle.
Data Cleaning: Checking data types, treating Nan's, renaming columns, dealing with outliers and filtering data.
Feature Engineering: Creating new features from the original ones, so that those could be used in the ML model. More information in Section 5.
Exploratory Data Analysis (EDA): Exploring the data in order to obtain business experience, look for data inconsistencies, useful business insights and find important features for the ML model. This was done by using the Pandas Profiling library. Two EDA profile reports are available for download here, one still with the bad users and one without them.
Data Preparation: Applying Rescaling Techniques in the data.
Feature Selection: Selecting the best features to use in the Machine Learning algorithm.
Space Analysis and Dimensionality Reduction: PCA, UMAP and Tree-Based Embedding were used to get a better data separation.
Machine Learning Modeling: Selecting the number of clusters (K) and then training Clustering Algorithms. More information in Section 6.
Model Evaluation: Evaluating the model by using Silhouette Score and Silhouette Visualization.
Cluster Exploratory Data Analysis: Exploring the clusters to obtain business experience and to find useful business insights. In addition to that, this step also helped building the business report. The top business insights found are available in Section 7.
Final Report and Deployment: Providing a business report regarding the clusters, as well as a list of customers that will participate in Insiders. This report was built using Power BI, as well as Render Cloud and Google Drive, so that it could be accessed from anywhere. More information in Section 8.

4.2. Tools and techniques used:

Python 3.10.8, Pandas, Matplotlib, Seaborn, Sklearn, SciPy and Pandas Profiling.
SQL and PostgresSQL.
Jupyter Notebook and VSCode.
Power BI.
Render Cloud and Google Drive.
Git and Github.
Exploratory Data Analysis (EDA).
Techniques for Feature Selection.
Clustering Algorithms (K-Means, Gaussian Mixture Models, Agglomerative Hierarchical Clustering and DBSCAN).

5. Feature Engineering

In total, 10 new features were created by using the original ones:

New Feature	Definition
Gross Revenue	Gross revenue for each customer, which is equal to quantity times unit price
Average Ticket	Average monetary value spent on each purchase
Recency Days	Period of time from current time to the last purchase
Max Recency Days	Max time a customer's gone without making any purchases
Frequency	Average purchases made by each customer during their latest purchase period and first purchase period
Purchases Quantity	Amount of times a customers's made any purchase
Quantity of Products	Total of products purchased
Quantity of Items	Total quantity of items purchased
Returns	Amount of items returned
Purchased and Returned Difference	Natural log of difference between items purchased and items returned

6. Machine Learning Modeling

In order to get better data separation a few dimensionality reduction techniques were tested: PCA, UMAP and Tree-Based Embedding. The Results were satisfactory with Tree-Based Embedding, which consists of:

Setting gross_revenue as a response variable, so it becomes a supervised learning problem.
Training a Random Forest (RF) model to predict gross_revenue using all other features.
Plotting the embedding based on RF's leaves.

In total four clustering algorithms were tested, for a cluster number varying from 2 to 24:

K-Means
Gaussian Mixture Models (GMM) with Expectation–Maximization (EM)
Agglomerative Hierarchical Clustering (HC)
DBSCAN

The models were evaluated by silhouette score, as well as clustering visualization. Our maximum cluster number was set to 11 due to practical purposes for Outleto's Marketing Team, so they can come up with exclusive actions for each cluster. DBSCAN had its parameters optimized with Bayesian Optimization, however, because it provided a very high number of clusters it was withdrawn as a possible final model. The results were quite similar for K-Means, GMM and HC with clusters from 8 to 11, however KMeans were chosen with 8 clusters because its silhouette score is slightly better, being equal to 0.6168. Those 8 cluster names are Insiders, Runners Up, Promising, Potentials, Need Attention, About to Sleep, At Risk and About to Lose .

Metrics Definition

There're two properties we look for when creating clusters:

Compactness: Observations from the same cluster must be compact with one another, meaning the distance between them should be as small as possible.
Separation: Different clusters must be as far apart from one another as possible.

Silhouette Score covers both these properties. It goes from -1 to 1, the higher the better.

7. Top Business Insights

1st - Customers from Insiders are responsible for 58.3% of the total items purchased.

2nd - Customers from Insiders are responsible for 53.5% of the total gross revenue.

3rd - Customers from Insiders have a number of returns higher than other customers, on average.

That's probably because those customers buy a really high amount of items.

8. Final Report and Deployment

The final business report was built using Power BI, containing answers to the following questions previously demanded by Outleto's Marketing Team. This is how the report was built:

Firstly, a new PostgreSQL database was created in Render Cloud.
Then, the final data containing all customers already classified in their respective clusters was saved in this PostgreSQL database.
Continuing, the database was added in Power BI, making it possible to create the final business report, where it was saved in PDF format.
Finally, the report was uploaded to Google Drive, so it could be shared.

Click below to access the report

Whereas for the list of customers that made it to insiders it was saved in CSV format, and it's available here .

The complete list of Outleto's customers is also available for download here .

9. Conclusion

In this project the main objective was accomplished:

We managed to provide a business report using Power BI, containing answers to the questions previously demanded by Outleto's Marketing Team, as well as a list of eligible customers to be a part of Insiders. With that report the Marketing Team will promote actions to each cluster, in order to increase revenue, but of course focusing mostly in the Insiders cluster, since they represent 53.5% of the total gross revenue. In addition to that, some useful business insights were found.

10. Next Steps

Further on, this solution could be improved by a few strategies:

Requesting more data from Outleto, such as product associated data.
Creating even more features.
Making the final report even more automatic for when new data comes in, so it could be run every time requested and the data instantly saved in the PostgreSQL database. This could be done by using the Papermill library.

Contact

brunodifranco99@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
lists		lists
notebooks		notebooks
pandas-profiling-reports		pandas-profiling-reports
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Customer Loyalty Program for E-commerce

1. Outleto and Business Problem

1) How many customers will be a part of Insiders?

2) How many clusters were created?

3) How are the customers distributed amongst the clusters?

4) What are these customers' main features?

5) What's the gross revenue percentage coming from Insiders? And what about other clusters?

6) How many items were purchased by each cluster?

2. Data Overview

3. Business Assumptions

4. Solution Plan

4.1. How was the problem solved?

4.2. Tools and techniques used:

5. Feature Engineering

6. Machine Learning Modeling

Metrics Definition

7. Top Business Insights

1st - Customers from Insiders are responsible for 58.3% of the total items purchased.

2nd - Customers from Insiders are responsible for 53.5% of the total gross revenue.

3rd - Customers from Insiders have a number of returns higher than other customers, on average.

8. Final Report and Deployment

9. Conclusion

10. Next Steps

Contact

About

Releases

Packages

Contributors 2

Languages

License

brunodifranco/project-outleto-clustering

Folders and files

Latest commit

History

Repository files navigation

Customer Loyalty Program for E-commerce

1. Outleto and Business Problem

1) How many customers will be a part of Insiders?

2) How many clusters were created?

3) How are the customers distributed amongst the clusters?

4) What are these customers' main features?

5) What's the gross revenue percentage coming from Insiders? And what about other clusters?

6) How many items were purchased by each cluster?

2. Data Overview

3. Business Assumptions

4. Solution Plan

4.1. How was the problem solved?

4.2. Tools and techniques used:

5. Feature Engineering

6. Machine Learning Modeling

Metrics Definition

7. Top Business Insights

1st - Customers from Insiders are responsible for 58.3% of the total items purchased.

2nd - Customers from Insiders are responsible for 53.5% of the total gross revenue.

3rd - Customers from Insiders have a number of returns higher than other customers, on average.

8. Final Report and Deployment

9. Conclusion

10. Next Steps

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages