Skip to content

vertaix/Vendi-Score-R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

R Implementation of the Vendi Score: A Diversity Evaluation Metric for Machine Learning and Science

This repository contains the R implementation of the Vendi Score (VS), a metric for evaluating diversity in machine learning and the natural sciences. The Vendi Scores are a family of diversity metrics that are flexible, interpretable, and unsupervised. Defined as the exponential of the entropy of the eigenvalues of a similarity matrix $\mathbf{K}$, the Vendi Score only requires a pair-wise similarity measure. The Vendi Score is defined in this paper. The order $q$ of the Vendi Score weight the importance of rare and common elements in the diversity computation, as described in this paper.

Check out our Python implementation of the Vendi Scores here!

Installation

The Vendi Scores in R require no additional dependencies and can be directly installed with devtools.

devtools::install_github("vertaix/Vendi-Score-R")

Computing the Vendi Scores

The Vendi Scores have 3 inputs: your data, a pair-wise similarity metric $k$, and an order $q$.

The data can be any data frame, matrix, vector, list or higher-dimensional array for which index $i$ corresponds to the $i$ th sample. The pair-wise similarity function $k$ should be symmetric and $k(x,x)=1$. The order $q$ can be any non-negative value.

library(VendiScore)
library(datasets)
data(iris)
iris_mat <- data.matrix(iris)
iris_mat <- iris_mat[,colnames(iris_mat)!='Species']

rbf_kernel <- function(x, y, gamma = 0.1) exp(-gamma * sum((x - y)^2))

score(iris_mat, rbf_kernel, q=1.)
# 3.169735

A score of about $3$ was expected since we have data from $3$ species, but we did not need to have class labels to measure the diversity of the dataset.

We can also use the cosine kernel trick to speed up Vendi Score computation for larger datasets in numerical form that can use a cosine similarity kernel. Data must be normalized in this case.

norm_samples <- t(apply(iris_mat, 1, function(row) row / sqrt(sum(row^2))))
VS <- score_cosine(samples=norm_samples, q=1)
# 1.20783

In cases where already have pre-computed a similarity matrix:

K <- matrix(data=c(1,1,0,1,1,0,0,0,1), nrow=3, ncol=3)
VS <- score_K(K, q=1.)
# 1.88988

We provide documentation for all functions in the package.

Check out our vignette for a demonstration of the advantages of the Vendi Score over metrics like average similarity.

Citation

@article{friedman2022vendi,
  title={The Vendi Score: A Diversity Evaluation Metric for Machine Learning},
  author={Friedman, Dan and Dieng, Adji Bousso},
  journal={arXiv preprint arXiv:2210.02410},
  year={2022}
}
@article{pasarkar2023cousins,
      title={Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning}, 
      author={Pasarkar, Amey P and Dieng, Adji Bousso},
      journal={arXiv preprint arXiv:2310.12952},
      year={2023},
}

About

Implementation of the Vendi Scores in R

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages