Our work has been published at the The 39th Annual AAAI Conference on Artificial Intelligence!
LocalMAP (Pairwise Controlled Manifold Approximation with Local Adjusted Graph) is a new dimensionality reduction algorithm that dynamically and locally adjusts the graph to address the challenges of getting a suboptimal graph due to unreliable high-dimensional distances and the limited information extracted from the high-dimensional data.
Previous research within the Dimension Reduction (DR) methods often involves converting the original high-dimensional data into a graph. Each edge in the graph represents the similarity or dissimilarity between pairs of data points. However, this graph is frequently suboptimal due to unreliable high-dimensional distances and the limited information extracted from the high-dimensional data. Therefore, we introduce LocalMAP, a new dimensionality reduction algorithm that dynamically and locally adjusts the graph to address these challenges. LocalMAP is capable of identifying and separating real clusters within the data that other DR methods may overlook or combine.
Please see the release notes. This release note is correlated with PaCMAP.
LocalMAP method is currently embedded in PaCMAP package. To try LocalMAP, please install the PaCMAP package.
You can use conda or mamba to install PaCMAP from the conda-forge channel.
conda:
conda install pacmap -c conda-forge
mamba:
mamba install pacmap -c conda-forge
You can use pip to install pacmap from PyPI. It will automatically install the dependencies for you:
pip install pacmap
If you have any problems during the installation of dependencies, such as
Failed building wheel for annoy
, you can try to install these dependencies
with conda
or mamba
. Users have also reported that in some cases, you may
wish to use numba >= 0.57
.
conda install -c conda-forge python-annoy
pip install pacmap
The pacmap
package is designed to be compatible with scikit-learn
, meaning that it has a similar interface with functions in the sklearn.manifold
module. To run LocalMAP
on your own dataset, you should install the package following the instructions in installation, and then import the module. The following code clip includes a use case about how to use PaCMAP on the COIL-20 dataset:
from pacmap import LocalMAP
import numpy as np
import matplotlib.pyplot as plt
# loading preprocessed coil_20 dataset
# you can change it with any dataset that is in the ndarray format, with the shape (N, D)
# where N is the number of samples and D is the dimension of each sample
X = np.load("./data/coil_20.npy", allow_pickle=True)
X = X.reshape(X.shape[0], -1)
y = np.load("./data/coil_20_labels.npy", allow_pickle=True)
# initializing the pacmap instance
# Setting n_neighbors to "None" leads to an automatic choice shown below in "parameter" section
embedding = LocalMAP(n_components=2, n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0)
# fit the data (The index of transformed data corresponds to the index of the original data)
X_transformed = embedding.fit_transform(X, init="pca")
# visualize the embedding
fig, ax = plt.subplots(1, 1, figsize=(6, 6))
ax.scatter(X_transformed[:, 0], X_transformed[:, 1], cmap="Spectral", c=y, s=0.6)
The following images are visualizations of two datasets: MNIST (n=70,000, d=784) and USPS (n=9,298, d=256), generated by PaCMAP. The two visualizations demonstrate the local and global structure's preservation ability of LocalMAP respectively, and it shows better separatation of true clusters comparing to other methods.
The list of the most important parameters is given below.
-
n_components
: the number of dimensions of the output. Default to 2. -
n_neighbors
: the number of neighbors considered in the k-Nearest Neighbor graph. Default to 10. We also allow this parameter to be set toNone
to enable the auto-selection of numbers of neighbors: the number of neighbors will be set to 10 for datasets whose sample size is smaller than 10000. For large dataset whose sample size (n) is larger than 10000, the value is: 10 + 15 * (log10(n) - 4). -
MN_ratio
: the ratio of the number of mid-near pairs to the number of neighbors,n_MN
=n_neighbors * MN_ratio
. Default to 0.5. -
FP_ratio
: the ratio of the number of further pairs to the number of neighbors,n_FP
=n_neighbors * FP_ratio
Default to 2. -
[New for LocalMAP]
low_dist_thres
: the average low-dimension distance among all nearest clusters pair. Default to 10.
The initialization is also important to the result, but it's a parameter of the fit
and fit_transform
function.
init
: the initialization of the lower dimensional embedding. One of"pca"
or"random"
, or a user-provided numpy ndarray with the shape (N, 2). Default to"pca"
.
Other parameters include:
num_iters
: number of iterations. Default to 450. 450 iterations are enough for most datasets to converge.pair_neighbors
,pair_MN
andpair_FP
: pre-specified neighbor pairs, mid-near points, and further pairs. Allows user to use their own graphs. Default toNone
.verbose
: print the progress of pacmap. Default toFalse
lr
: learning rate of the AdaGrad optimizer. Default to 1.apply_pca
: whether localmap should apply PCA to the data before constructing the k-Nearest Neighbor graph. Using PCA to preprocess the data can largely accelerate the DR process without losing too much accuracy. Notice that this option does not affect the initialization of the optimization process.intermediate
: whether localmap should also output the intermediate stages of the optimization process of the lower dimension embedding. IfTrue
, then the output will be a numpy array of the size (n,n_components
, 13), where each slice is a "screenshot" of the output embedding at a particular number of steps, from [0, 10, 30, 60, 100, 120, 140, 170, 200, 250, 300, 350, 450].
Similar to the scikit-learn API, the LocalMAP instance can generate embedding for a dataset via fit
, fit_transform
and transform
method. We currently support numpy.ndarray format as our input. Specifically, to convert pandas DataFrame to ndarray format, please refer to the pandas documentation. For a more detailed walkthrough, please see the demo directory.
We have provided an option to allow users to use their own nearest neighbors when mapping large-scale datasets. Please see the demo for a detailed walkthrough about how to use LocalMAP with the user-specified nearest neighbors.
We have provided the code we use to run experiment for better reproducibility. The code are separated into three parts, in three folders, respectively:
data
, which includes part of the datasets we used, preprocessed into the file format each DR method use. Since some of the datasets are too large to put in Github. If you need a specific dataset, please send an email to yiyang.sun@duke.edu.experiments
, which includes all the scripts we use to produce DR results.evaluation
, which includes all the scripts we use to evaluate DR results.
After downloading the code, you may need to specify some of the paths in the script to make them fully functional.
LocalMAP will be released to Arxiv Soon!
Please see the license file.