clustfeatimp
is a module for measuring feature importance for any clustering method.
Table of Contents:
The aim of this project was to create a tool for measuring feature importance for any clustering method.
The idea is simple. By providing the ClusteringExplainer
object with data for the clustering model and its results, we can construct a multiclass classifier. The classifier learns the appropriate dependencies in the data, and a side effect of learning is a list of variables with their significance.
More specific:
The idea was to transform unsupervised learning methods to specific supervised methods which are easily interpreted (like tree-based methods). In this implementation I used the XGBoost model.
As segmentation models often produce clusters that are unbalanced, the classifier uses the parameter, which control the balance of many classes weights. As a result, we do not have to worry that our clusters differ in the number of observations.
Since the XGBoost model contains many hyperparameters that must be determined before starting the learning process, I used a simple Bayesian hyperparamert optimization with a small number of iterations (which you can define yourself). This allows you to quickly and efficiently find the best set of hyperparameters for a classifier.
It is also possible to skip the hyperparameter optimization process. This speeds up the operation of the algorithm, however, it should be remembered that the default set of parameters does not always give good results.
However, I recommend using Bayesian optimization
(fit_hiperparams = True)
even with a small number of iterations (5 by default).
The specific form of the XGBoost model allows to measure the significance of variables. This implementation uses the Gain measure.
The
clustfeatimp
module also allows you to validate the quality of the created classifier using the balanced_accuracy_score and confusion_matrix.How to get the feature importances for any clustering method?
- Perform clustering with any method.
- Use
ClusteringExplainer
to measure feature importances. - Get the results 😃
Below are plots showing two-dimensional relationships between the clusters for the two most important variables and two irrelevant variables.
You can see a clear division into 3 segments with two most important variables in the left plot.
It is not possible to make a clear clustering into 3 segments when using two irrelevant variables.
- Use pip to install module from github
pip install -e git+https://github.com/msoczi/clustfeatimp#egg=clustfeatimp
- Run python and import module:
import clustfeatimp as cfi
import clustfeatimp as cfi
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Create dataset
X, _ = make_blobs(n_samples=300, centers=5, n_features=2, random_state=7)
# Clustering with KMeans
kmeans = KMeans(n_clusters=5)
kmeans.fit(X)
# Assign cluster values to the variable y
y = kmeans.labels_
# Create ClusteringExplainer object and fit to the data
clust_explnr = cfi.ClusteringExplainer()
clust_explnr.fit(X, y)
# Feature importance for clustering variables
print('--- Feature Importance for KMeans clustering ---')
print(clust_explnr.feature_importance)
# Plot with feature importance
clust_explnr.plot_importances();plt.show()
# Plot 2D
plt.scatter(X[:,0], X[:,1], c=y)
plt.xlabel('f0');plt.ylabel('f1');plt.show()
Here is a notebook with example.
Mateusz Soczewka - msoczewkas@gmail.com
Thank you for any comments. 😃