R Implementation of the Vendi Score: A Diversity Evaluation Metric for Machine Learning and Science

This repository contains the R implementation of the Vendi Score (VS), a metric for evaluating diversity in machine learning and the natural sciences. The Vendi Scores are a family of diversity metrics that are flexible, interpretable, and unsupervised. Defined as the exponential of the entropy of the eigenvalues of a similarity matrix $\mathbf{K}$, the Vendi Score only requires a pair-wise similarity measure. The Vendi Score is defined in this paper. The order $q$ of the Vendi Score weight the importance of rare and common elements in the diversity computation, as described in this paper.

Check out our Python implementation of the Vendi Scores here!

Installation

The Vendi Scores in R require no additional dependencies and can be directly installed with devtools.

devtools::install_github("vertaix/Vendi-Score-R")

Computing the Vendi Scores

The Vendi Scores have 3 inputs: your data, a pair-wise similarity metric $k$, and an order $q$.

The data can be any data frame, matrix, vector, list or higher-dimensional array for which index $i$ corresponds to the $i$ th sample. The pair-wise similarity function $k$ should be symmetric and $k(x,x)=1$. The order $q$ can be any non-negative value.

library(VendiScore)
library(datasets)
data(iris)
iris_mat <- data.matrix(iris)
iris_mat <- iris_mat[,colnames(iris_mat)!='Species']

rbf_kernel <- function(x, y, gamma = 0.1) exp(-gamma * sum((x - y)^2))

score(iris_mat, rbf_kernel, q=1.)
# 3.169735

A score of about $3$ was expected since we have data from $3$ species, but we did not need to have class labels to measure the diversity of the dataset.

We can also use the cosine kernel trick to speed up Vendi Score computation for larger datasets in numerical form that can use a cosine similarity kernel. Data must be normalized in this case.

norm_samples <- t(apply(iris_mat, 1, function(row) row / sqrt(sum(row^2))))
VS <- score_cosine(samples=norm_samples, q=1)
# 1.20783

In cases where already have pre-computed a similarity matrix:

K <- matrix(data=c(1,1,0,1,1,0,0,0,1), nrow=3, ncol=3)
VS <- score_K(K, q=1.)
# 1.88988

We provide documentation for all functions in the package.

Check out our vignette for a demonstration of the advantages of the Vendi Score over metrics like average similarity.

Citation

@article{friedman2022vendi,
  title={The Vendi Score: A Diversity Evaluation Metric for Machine Learning},
  author={Friedman, Dan and Dieng, Adji Bousso},
  journal={arXiv preprint arXiv:2210.02410},
  year={2022}
}

@article{pasarkar2023cousins,
      title={Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning}, 
      author={Pasarkar, Amey P and Dieng, Adji Bousso},
      journal={arXiv preprint arXiv:2310.12952},
      year={2023},
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
R		R
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
VendiScore.Rproj		VendiScore.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

R Implementation of the Vendi Score: A Diversity Evaluation Metric for Machine Learning and Science

Installation

Computing the Vendi Scores

Citation

About

Releases

Packages

Contributors 2

Languages

License

vertaix/Vendi-Score-R

Folders and files

Latest commit

History

Repository files navigation

R Implementation of the Vendi Score: A Diversity Evaluation Metric for Machine Learning and Science

Installation

Computing the Vendi Scores

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages