Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potentially confusing behavior/documentation when using Kmeans clustering with spherical=True #2466

Open
FrankPortman opened this issue Sep 13, 2022 · 1 comment

Comments

@FrankPortman
Copy link

When using spherical=True during Kmeans clustering, the centroids are normalized after each iteration. The documentation can be found here: https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization

spherical: perform spherical k-means -- the centroids are L2 normalized after each iteration

The same page also says:

The values of the objective function (total square error in case of k-means) along iterations is stored in the variable kmeans.obj, and more extensive statistics in kmeans.iteration_stats.

However, under the hood KMeans with spherical=True we see it setting an InnerProduct index in the Python bindings:

if self.cp.spherical:

and later maximizing the sum of inner products (where the code correctly handles that a higher inner product is better) here:

if ((lower_is_better && obj < best_obj) ||

To summarize the behavior, if you set spherical=True the objective that faiss Kmeans is optimizing and printing out is the sum of inner products of vectors to centroids (would be a sum of cosine sims if the vectors themselves were normed too).

I think Spherical KMeans is potentially synonymous with clustering via cosine/dot similarity as the distance metric, but it is ambiguous enough within this library that several in my org have been tripped up by this. In particular the objective going up for spherical=True could stand to be more clear in documentation.

Would you be open to some documentation and/or code (perhaps printing something?) changes to make this more clear? If you give me a few pointers on how you might prefer this to be cleared up I can make the change and put up a PR.

@mdouze
Copy link
Contributor

mdouze commented Sep 14, 2022

Hmm right, the problem is that the behaviour is different between the Python Kmeans object and the C++ Clustering object (that does not automatically set the INNER_PRODUCT index).
Since I think it would be quite disruptive to change the behavior, we should indeed make this very clear in the docstrings and wiki. Marking this as an enhancement, feel free to submit a PR if you wish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants