Project: Optimizing Wholesale Wine business by segmenting customers behavior using Machine Learning - Kmeans powered by Tableau
This project is based on a book called Data Smart, written by John Foreman, head of product for MailChimp. This book is an excellent business analytics primer that walks you through a variety of machine learning use cases, complete with sample data sets and detailed instructions on how to set up and solve each case in Excel.
One of these cases, John describes a fictitious wholesale wine business looking to optimize sales by segmenting customers — restaurant and liquor store owners, into groups based on buying behavior. This is a very common problem faced by all kinds of businesses from large retailers like Walmart and Target to your friendly neighborhood liquor store.
The dataset contains information on marketing newsletters/e-mail campaigns (e-mail offers sent to customers) and transaction level data from customers. The transactional data shows which offer customers responded to, and what the customer ended up buying. The data is presented as an Excel workbook containing two worksheets. Each worksheet contains a different dataset.
We see that the first dataset WineKMC.xlsx sheet 0
contains information about each offer such as the month it is in effect and several attributes about the wine that the offer refers to: the variety, minimum quantity, discount, country of origin and whether or not it is past peak.
The second dataset WineKMC.xlsx sheet 1
in the second worksheet contains transactional data -- which offer each customer responded to.
We're trying to learn more about how our customers behave, so we can use their behavior (whether or not they purchased something based on an offer) as a way to group similar minded customers together. We can then study those groups to look for patterns and trends which can help us formulate future offers.
The first thing we need is a way to compare customers. To do this, we're going to create a matrix that contains each customer and a 0/1 indicator for whether or not they responded to a given offer.
- Merge
df_transactions
&df_offers
on=offer_id
- Make a pivot table, to show a matrix with customer names in the row index and offer numbers as columns headers and fill NANs with zeros.
Recall that in K-Means Clustering we want to maximize the distance between centroids and minimize the distance between data points and the respective centroid for the cluster they are in. True evaluation for unsupervised learning would require labeled data; however, we can use a variety of intuitive metrics to try to pick the number of clusters K. We will introduce two methods: the Elbow method, the Silhouette method and the gap statistic.
-
Choosing 𝐾 : The Elbow Method:
- Distortion: It is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.
-
Inertia: it will be calaculated in two methods:
- It is the sum of squared distances of samples to their closest cluster center. Typically, inertia_ attribute from kmeans is used.
- Lastly, we look at the sum-of-squares error in each cluster against
$K$ . We compute the distance from each data point to the center of the cluster (centroid) to which the data point was assigned.
-
Choosing 𝐾 : The Silhouette Method
There exists another method that measures how well each datapoint 𝑥𝑖 "fits" its assigned cluster and also how poorly it fits into other clusters. This is a different way of looking at the same objective. Denote 𝑎𝑥𝑖 as the average distance from 𝑥𝑖 to all other points within its own cluster 𝑘 . The lower the value, the better. On the other hand 𝑏𝑥𝑖 is the minimum average distance from 𝑥𝑖 to points in a different cluster, minimized over clusters. That is, compute separately for each cluster the average distance from 𝑥𝑖 to the points within that cluster, and then take the minimum.
The silhouette score is computed on every datapoint in every cluster. The silhouette score ranges from -1 (a poor clustering) to +1 (a very dense clustering) with 0 denoting the situation where clusters overlap. Some criteria for the silhouette coefficient is provided in the table below:
Range | Interpretation |
---|---|
0.71 - 1.0 | A strong structure has been found. |
0.51 - 0.7 | A reasonable structure has been found. |
0.26 - 0.5 | The structure is weak and could be artificial. |
< 0.25 | No substantial structure has been found. |
- Elbow Method using Distortion from Scipy confirms that the
elbow point k=3
so thebest k will be=4
(plot starts descending much more slowly after k=3) - Elbow Method using Inertia from kmeans confirms that the
elbow point k=3
so thebest k will be=4
(plot starts descending much more slowly after k=3) - Elbow Method using Sum of Squares - SSE , confirms that the
elbow point k=3
so thebest k will be=4
(plot starts descending much more slowly after k=3) - Again, The Silhouette Method, there does not appear to be a clear way to find best K in this method, which means it is not straightforward to find the number of clusters using this method. however, from above we can see
k=10, 3 and 9
are our best option for now (Closest to 1 = best Silhouette Score, but clearly Silhouette Scores are relatively small suggesting we have a weak structure!!!)
How do we visualize clusters? If we only had two features, we could likely plot the data as is. But we have 100 data points each containing 32 features (dimensions). Principal Component Analysis (PCA) will help us reduce the dimensionality of our data from 32 to something lower. For a visualization on the coordinate plane, we will use 2 dimensions. In this exercise, we're going to use it to transform our multi-dimensional dataset into a 2 dimensional dataset.
This is only one use of PCA for dimension reduction. We can also use PCA when we want to perform regression but we have a set of highly correlated variables. PCA untangles these correlations into a smaller number of features/predictors all of which are orthogonal (not correlated). PCA is also used to reduce a large set of variables into a much smaller one.
-
Customers Behavior – Unsupervised Machine Learning K-means Clustering (K=4)
- “Customer Group 1”: Represents customers who loves and appreciates both French Red Wine and Sparkling Wine (89% consumes >= 72 Min Qty) specifically:
Cabernet Sauvignon & Champagne
. - “Customer Group 2”: Represents customers who loves Sparkling Wine (75% consumes >= 72 Min Qty) mainly
Champagne
but in general they enjoys French Sparkling Wine. - “Customers from both Group 1 & 2”: More than 75% of them who consumes >= 72 Min Qty, love
Cabernet Sauvignon & Champagne
specially if there’s high Discounts since they’re heavy drinkers. - “Customer Group 3”: Represents customers who are light Wine drinkers (Almost 100% of them, consumes = 6 Min Qty) but still they appreciates French Wine in general (sparkling, Red, and some white).
- “Customer Group 4”: Represents customers who loves Red Wine specifically
Pinot Noir
. These customer will buyPinot Noir
regardless if there’s big Discount or not and they don’t care about the origin either. They’re justPinot Noir
lovers!.
- “Customer Group 1”: Represents customers who loves and appreciates both French Red Wine and Sparkling Wine (89% consumes >= 72 Min Qty) specifically:
-
High Yield Segments That’s need our attention for their growth potential
- “Customers from both Group 1 & 2”: Focus group of customers to increase sales by introducing more Discounts on
Cabernet Sauvignon & Champagne
. - “Customer Group 3”: Focus group of customers to increase revenue by introducing more wine varieties because these customers are NOT wine specific who are not settled yet and they would try more or new varieties.
- “Customer Group 4”: Focus group of customers to increase revenue by increasing
Pinot Noir
price!.
- “Customers from both Group 1 & 2”: Focus group of customers to increase sales by introducing more Discounts on
- Affinity propagation
- Spectral clustering
- Agglomerative clustering
- DBSCAN
k-means is only one of a ton of clustering algorithms. Below is a brief description of several clustering algorithms, and the table provides references to the other clustering algorithms in scikit-learn.
-
Affinity Propagation does not require the number of clusters
$K$ to be known in advance! AP uses a "message passing" paradigm to cluster points based on their similarity. -
Spectral Clustering uses the eigenvalues of a similarity matrix to reduce the dimensionality of the data before clustering in a lower dimensional space. This is tangentially similar to what we did to visualize k-means clusters using PCA. The number of clusters must be known a priori.
-
Ward's Method applies to hierarchical clustering. Hierarchical clustering algorithms take a set of data and successively divide the observations into more and more clusters at each layer of the hierarchy. Ward's method is used to determine when two clusters in the hierarchy should be combined into one. It is basically an extension of hierarchical clustering. Hierarchical clustering is divisive, that is, all observations are part of the same cluster at first, and at each successive iteration, the clusters are made smaller and smaller. With hierarchical clustering, a hierarchy is constructed, and there is not really the concept of "number of clusters." The number of clusters simply determines how low or how high in the hierarchy we reference and can be determined empirically or by looking at the dendogram.
-
Agglomerative Clustering is similar to hierarchical clustering but but is not divisive, it is agglomerative. That is, every observation is placed into its own cluster and at each iteration or level or the hierarchy, observations are merged into fewer and fewer clusters until convergence. Similar to hierarchical clustering, the constructed hierarchy contains all possible numbers of clusters and it is up to the analyst to pick the number by reviewing statistics or the dendogram.
-
DBSCAN is based on point density rather than distance. It groups together points with many nearby neighbors. DBSCAN is one of the most cited algorithms in the literature. It does not require knowing the number of clusters a priori, but does require specifying the neighborhood size.
How do their results compare? Which performs the best? Tell a story why you think it performs the best.
Method name | Parameters | Scalability | Use Case | Geometry (metric used) |
---|---|---|---|---|
K-Means | number of clusters | Very largen_samples, medium n_clusters with MiniBatch code | General-purpose, even cluster size, flat geometry, not too many clusters | Distances between points |
Affinity propagation | damping, sample preference | Not scalable with n_samples | Many clusters, uneven cluster size, non-flat geometry | Graph distance (e.g. nearest-neighbor graph) |
Mean-shift | bandwidth | Not scalable with n_samples | Many clusters, uneven cluster size, non-flat geometry | Distances between points |
Spectral clustering | number of clusters | Medium n_samples, small n_clusters | Few clusters, even cluster size, non-flat geometry | Graph distance (e.g. nearest-neighbor graph) |
Ward hierarchical clustering | number of clusters | Large n_samples and n_clusters | Many clusters, possibly connectivity constraints | Distances between points |
Agglomerative clustering | number of clusters, linkage type, distance | Large n_samples and n_clusters | Many clusters, possibly connectivity constraints, non Euclidean distances | Any pairwise distance |
DBSCAN | neighborhood size | Very large n_samples, medium n_clusters | Non-flat geometry, uneven cluster sizes | Distances between nearest points |
Gaussian mixtures | many | Not scalable | Flat geometry, good for density estimation | Mahalanobis distances to centers |
Birch | branching factor, threshold, optional global clusterer. | Large n_clusters and n_samples | Large dataset, outlier removal, data reduction. | Euclidean distance between points |
Affinity propagation Clustering | Agglomerative clustering |
---|---|
Spectral clustering Clustering | Density-Based Spatial Clustering of Applications with Noise |
---|---|