Potentially confusing behavior/documentation when using Kmeans clustering with spherical=True #2466

FrankPortman · 2022-09-13T19:11:11Z

When using spherical=True during Kmeans clustering, the centroids are normalized after each iteration. The documentation can be found here: https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization

spherical: perform spherical k-means -- the centroids are L2 normalized after each iteration

The same page also says:

The values of the objective function (total square error in case of k-means) along iterations is stored in the variable kmeans.obj, and more extensive statistics in kmeans.iteration_stats.

However, under the hood KMeans with spherical=True we see it setting an InnerProduct index in the Python bindings:

faiss/faiss/python/__init__.py

Line 1661 in 1e4586a

if self.cp.spherical:

and later maximizing the sum of inner products (where the code correctly handles that a higher inner product is better) here:

faiss/faiss/Clustering.cpp

Line 537 in 26abede

if ((lower_is_better && obj < best_obj) ||

To summarize the behavior, if you set spherical=True the objective that faiss Kmeans is optimizing and printing out is the sum of inner products of vectors to centroids (would be a sum of cosine sims if the vectors themselves were normed too).

I think Spherical KMeans is potentially synonymous with clustering via cosine/dot similarity as the distance metric, but it is ambiguous enough within this library that several in my org have been tripped up by this. In particular the objective going up for spherical=True could stand to be more clear in documentation.

Would you be open to some documentation and/or code (perhaps printing something?) changes to make this more clear? If you give me a few pointers on how you might prefer this to be cleared up I can make the change and put up a PR.

The text was updated successfully, but these errors were encountered:

mdouze · 2022-09-14T07:51:04Z

Hmm right, the problem is that the behaviour is different between the Python Kmeans object and the C++ Clustering object (that does not automatically set the INNER_PRODUCT index).
Since I think it would be quite disruptive to change the behavior, we should indeed make this very clear in the docstrings and wiki. Marking this as an enhancement, feel free to submit a PR if you wish.

mdouze added documentation enhancement labels Sep 14, 2022

asadoughi added the Implementation label Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potentially confusing behavior/documentation when using Kmeans clustering with spherical=True #2466

Potentially confusing behavior/documentation when using Kmeans clustering with spherical=True #2466

FrankPortman commented Sep 13, 2022

mdouze commented Sep 14, 2022 •

edited

Loading

Potentially confusing behavior/documentation when using Kmeans clustering with spherical=True #2466

Potentially confusing behavior/documentation when using Kmeans clustering with spherical=True #2466

Comments

FrankPortman commented Sep 13, 2022

mdouze commented Sep 14, 2022 • edited Loading

mdouze commented Sep 14, 2022 •

edited

Loading