UMAP Analysis plugin for the ManiVault visual analytics framework based on the libscran/umappp library.
Clone the repo, all dependencies will be downloaded during CMake configuration:
git clone https://github.com/ManiVaultStudio/UMAP-Plugin.git
Left: UMAP embedding of 10k MNIST test data. Right: UMAP embedding of Indian Pines data in 3 dimensions, with (top) showing x and y and (bottom) showing y and z embedding dimensions as well as (right) a re-coloring of the Indian Pines image based on the 3d embedding space interpreted as HSV colorspace.
Main settings:
Epochs
: Number of epochs for the gradient descent, i.e., optimization iterations. Larger values improve accuracy at the cost of computational work. For datasets with no more than 10000 observations, the number of epochs is set to 500. For larger datasets, the number of epochs decreases from 500 according to the number of cells beyond 10000, to a lower limit of 200. This choice aims to reduce computational work for very large datasets.Initialization
: How should the initial coordinates of the embedding be obtained?SPECTRAL
: attempts initialization based on spectral decomposition of the graph Laplacian. If that fails, we fall back to random draws from a normal distribution.RANDOM
: fills the embedding with random draws from a normal distribution.
Embedding dimensions
: Number of output dimensions.
Advanced settings:
-
local_connectivity
: The number of nearest neighbors that are assumed to be always connected, with maximum membership confidence. Larger values increase the connectivity of the embedding and reduce the focus on local structure. -
bandwidth
: Effective bandwidth of the kernel when converting the distance to a neighbor into a fuzzy set membership confidence. Larger values reduce the decay in confidence with respect to distance, increasing connectivity and favoring global structure. -
mix_ratio
: This symmetrizes the sets by ensuring that the confidence of$A$ belonging to$B$ 's set is the same as the confidence of$B$ belonging to$A$ 's set. A mixing ratio of 1 will take the union of confidences, a ratio of 0 will take the intersection, and intermediate values will interpolate between them. Larger values (up to 1) favor connectivity and more global structure. -
spread
: Scale of the coordinates of the final low-dimensional embedding. -
min_dist
: Minimum distance between observations in the final low-dimensional embedding. Smaller values will increase local clustering while larger values favors a more even distribution. This is interpreted relative to the spread of points inspread
. -
negative_sample_rate
: Rate of sampling negative observations to compute repulsive forces. This is interpreted with respect to the number of neighbors with attractive forces, i.e., for each attractive interaction,n
negative samples are taken for repulsive interactions. Smaller values can improve the speed of convergence but at the cost of stability. -
a
: Positive value for the$a$ parameter for the fuzzy set membership strength calculations. Larger values yield a sharper decay in membership strength with increasing distance between observations. If this or$b$ is set to zero, a suitable value for this parameter is automatically determined from the values provided tospread
andmin_dist
. -
b
: Value in$(0, 1)$ for the$b$ parameter for the fuzzy set membership strength calculations. Larger values yield an earlier decay in membership strength with increasing distance between observations. If this or$a$ is set to zero, a suitable value for this parameter is automatically determined from the values provided tospread
andmin_dist
. -
repulsion_strength
: Modifier for the repulsive force. Larger values increase repulsion and favor local structure. -
learning_rate
: Initial learning rate used in the gradient descent. Larger values can improve the speed of convergence but at the cost of stability. -
seed
: Seed to use for the Mersenne Twister when sampling negative observations.
knn Settings:
Algorithm
: Type of approximated knn algorithm/library to be used. Either Annoy or HNSW.Number knn
: Number of neighbors to use to define the fuzzy sets. Larger values improve connectivity and favor preservation of global structure, at the cost of increased computational work.Multithreading
: Whether to use all available threads for knn computation. This will be faster while using more memory.- (Annoy) Trees & Checks: correspond to
n_trees
andsearch_k
, see their docs - (HNSW): M & ef: are detailed in the respective docs
- libscran/umappp: Aaron Lun, BSD 2-Clause License
- UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, McInnes L, Healy J, Melville J (2020), arxiv: 1802.03426