UMAP Plugin

UMAP Analysis plugin for the ManiVault visual analytics framework based on the libscran/umappp library.

Clone the repo, all dependencies will be downloaded during CMake configuration:

git clone https://github.com/ManiVaultStudio/UMAP-Plugin.git

Left: UMAP embedding of 10k MNIST test data. Right: UMAP embedding of Indian Pines data in 3 dimensions, with (top) showing x and y and (bottom) showing y and z embedding dimensions as well as (right) a re-coloring of the Indian Pines image based on the 3d embedding space interpreted as HSV colorspace.

Settings

Main settings:

Epochs: Number of epochs for the gradient descent, i.e., optimization iterations. Larger values improve accuracy at the cost of computational work. For datasets with no more than 10000 observations, the number of epochs is set to 500. For larger datasets, the number of epochs decreases from 500 according to the number of cells beyond 10000, to a lower limit of 200. This choice aims to reduce computational work for very large datasets.
Initialization: How should the initial coordinates of the embedding be obtained?
- SPECTRAL: attempts initialization based on spectral decomposition of the graph Laplacian. If that fails, we fall back to random draws from a normal distribution.
- RANDOM: fills the embedding with random draws from a normal distribution.
Embedding dimensions: Number of output dimensions.

Advanced settings:

local_connectivity: The number of nearest neighbors that are assumed to be always connected, with maximum membership confidence. Larger values increase the connectivity of the embedding and reduce the focus on local structure.
bandwidth: Effective bandwidth of the kernel when converting the distance to a neighbor into a fuzzy set membership confidence. Larger values reduce the decay in confidence with respect to distance, increasing connectivity and favoring global structure.
mix_ratio: This symmetrizes the sets by ensuring that the confidence of $A$ belonging to $B$'s set is the same as the confidence of $B$ belonging to $A$'s set. A mixing ratio of 1 will take the union of confidences, a ratio of 0 will take the intersection, and intermediate values will interpolate between them. Larger values (up to 1) favor connectivity and more global structure.
spread: Scale of the coordinates of the final low-dimensional embedding.
min_dist: Minimum distance between observations in the final low-dimensional embedding. Smaller values will increase local clustering while larger values favors a more even distribution. This is interpreted relative to the spread of points in spread.
negative_sample_rate: Rate of sampling negative observations to compute repulsive forces. This is interpreted with respect to the number of neighbors with attractive forces, i.e., for each attractive interaction, n negative samples are taken for repulsive interactions. Smaller values can improve the speed of convergence but at the cost of stability.
a: Positive value for the $a$ parameter for the fuzzy set membership strength calculations. Larger values yield a sharper decay in membership strength with increasing distance between observations. If this or $b$ is set to zero, a suitable value for this parameter is automatically determined from the values provided to spread and min_dist.
b: Value in $(0, 1)$ for the $b$ parameter for the fuzzy set membership strength calculations. Larger values yield an earlier decay in membership strength with increasing distance between observations. If this or $a$ is set to zero, a suitable value for this parameter is automatically determined from the values provided to spread and min_dist.
repulsion_strength: Modifier for the repulsive force. Larger values increase repulsion and favor local structure.
learning_rate: Initial learning rate used in the gradient descent. Larger values can improve the speed of convergence but at the cost of stability.
seed: Seed to use for the Mersenne Twister when sampling negative observations.

knn Settings:

Algorithm: Type of approximated knn algorithm/library to be used. Either Annoy or HNSW.
Number knn: Number of neighbors to use to define the fuzzy sets. Larger values improve connectivity and favor preservation of global structure, at the cost of increased computational work.
Multithreading: Whether to use all available threads for knn computation. This will be faster while using more memory.
(Annoy) Trees & Checks: correspond to n_trees and search_k, see their docs
(HNSW): M & ef: are detailed in the respective docs

References

libscran/umappp: Aaron Lun, BSD 2-Clause License
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, McInnes L, Healy J, Melville J (2020), arxiv: 1802.03426

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
conanfile.py		conanfile.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UMAP Plugin

Settings

References

About

Releases

Packages

Contributors 4

Languages

License

ManiVaultStudio/UMAP-Plugin

Folders and files

Latest commit

History

Repository files navigation

UMAP Plugin

Settings

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages