UMAP algorithm does not work as expected #797

stefanhahmann · 2024-12-09T11:11:50Z

Describe the bug
When running the UMAP algorithm with sample data, e.g. two distinct 3d point clouds and trying to reduce them to 2 dimensions, the UMAP does only contain the embeddings of the larger of the two point clouds, but not the smaller one.

Expected behavior
The UMAP algorithm result should contain the embedding results for all input data points

Code snippet
#796

Additional context

JDK21
Smile version 4.0.0
Windows

kklioss · 2024-12-10T04:21:27Z

It is because the default spectral layout initialization can be applied only on connected components. I add PCA initialization if there are multiple connected components. Please try the master branch. On the other hand, the existence of multiple connected components implies that a global view of the data cannot be attained with this initialization. In this case, you may want to increase k of knn.

stefanhahmann · 2024-12-11T08:21:04Z

Thanks for the quick fix. The unit test contained in #796 works with this modification.
Thanks also for the hint regarding k of knn. It makes complete sense for the artificial example of the unit test.

Are you planning to publish a patch release of smile-core/smile-base?

Unrelated, but maybe interesting: I did a performance comparison between UMAP smile implementation and UMAP tagbio implementation (https://github.com/tag-bio/umap-java). Besides the fact that both implementation get slightly different results (which may be due to the random elements inside the UMAP algorithm), the tagbio-implementation seems to be 2-5 times faster, cf.:

kklioss · 2024-12-11T15:08:58Z

Have you checked if we use the same number of iterations for both implementations?

stefanhahmann · 2024-12-11T16:45:42Z

Yes. The number of iterations can be configured for both implementations and I have set in both cases to 500.

I am also surprised how different the results look for the different implementations on the same dataset. I would not have expected that. However, this may be explained due to different initilization strategies / different random seeds.

Sklearn (Python):
TagBio (Java):
Smile (Java)

kklioss · 2024-12-12T03:00:34Z

Significant time of Smile's UMAP is spent on spectral layout for initialization as it requires eigen value decomposition. It looks like that tagbio doesn't support spectral initialization. This may explain the difference on speed and final embedding.

stefanhahmann · 2024-12-12T12:06:29Z

Significant time of Smile's UMAP is spent on spectral layout for initialization as it requires eigen value decomposition. It looks like that tagbio doesn't support spectral initialization. This may explain the difference on speed and final embedding.

Yes, indeed. The tagbio implementation does not support the spectral initialization. It also does not support PCA initialization, which smile now uses, when there are multiple connected components.
And indeed this does shorten the time before doing the actual 500 iterations. The iterations for the optimization seem to take about the same time for both implementations, but the initialization takes longer for the smile implementation, which can be easily understood, since the tagbio initialization uses just a random initilization approach.
In

stefanhahmann · 2024-12-12T14:37:42Z

I did some further profiling. As done in https://github.com/haifengl/smile/blob/master/core/src/test/java/smile/manifold/UMAPTest.java I did some tests with the MNIST dataset, i.e. I run UMAP umap = UMAP.of(MNIST.x, 7); and compared it to equivalent tests with the tagbio implementation and the UMAP python implementation.

It took 18:36 minutes in total
18:10 minutes of this were used for initializing the graph NearestNeighborGraph cc = NearestNeighborGraph.largest(graph);
The actual optimization of the layout took about 25s
The UMAP Python implementation took ~60s and came close to what is stated in the docs (https://umap-learn.readthedocs.io/en/latest/reproducibility.html).
The tagbio implementation takes ~45s and reaches a qualitatively similar (but numerically different) result:
The results of the smile implementation seem to differ substantially:

kklioss · 2024-12-12T17:13:16Z

It takes only 4.34 seconds on my pc. And the result also looks different from yours. I don't know why it is so slow in your case.

stefanhahmann · 2024-12-12T17:43:40Z

It takes only 4.34 seconds on my pc. And the result also looks different from yours. I don't know why it is so slow in your case.

Ah, I see the reason. I did not run my test with https://github.com/haifengl/smile/blob/3a8b7f32572e45d496cbea5d5ba32ad90b8efd73/shell/src/universal/data/mnist/mnist2500_X.txt but instead with the original dataset with 70.000 points. Sorry for not making this clear in the first place. But to be fair I ran this test with the same amount of data points for all 3 considered implementations (smile, tagbio, python). Thus, it seems to me that there is something with quadratic complexity in the smile implementation, probably in the NearestNeighborGraph cc = NearestNeighborGraph.largest(graph); part.

If I run with minist2500, it also takes ~4.4s on my machine. I am surprised that I do not get the same layout, even though I even set MathEx.setSeed(19650218); as in the unit test (https://github.com/haifengl/smile/blob/master/core/src/test/java/smile/manifold/UMAPTest.java) before running it. I also chose k=7 and defaults otherwise. The layout on my side looks a bit different (I did not color the classes though):

In the result that you get, it would be hard to distinguish points of different classes, if they were not colored according to the class. Compared with the python implementation, it looks that this should actually be possible (https://umap-learn.readthedocs.io/en/latest/reproducibility.html)

kklioss · 2024-12-12T18:42:46Z

Can you please try to increase JVM memory like -Xmx=16G?

stefanhahmann · 2024-12-13T10:40:39Z

I pulled the latest commits (8289526), re-built, and tried with -Xmx=16G, but it is still slow with the mnist_70000 dataset:
mnist_70000.zip

The overall performance with the 2500 mnist dataset has now decreased from 4.6s to 15.1s on my machine.

kklioss · 2024-12-22T23:32:17Z

We have added a series of optimization so that UMAP can finish in 4 minutes with full size MNIST data on my laptop. We will look into the difference of embedding in the next.

kklioss · 2024-12-25T04:14:18Z

This is the latest embedding results on full size MNIST data. It takes less than 3 minutes on my PC.

stefanhahmann · 2025-01-06T16:47:21Z

This is the latest embedding results on full size MNIST data. It takes less than 3 minutes on my PC.

The embedding looks a lot better than before. However, I cannot reproduce it with the MINIST 70_000 dataset. Which initialization did you use?

I used:
UMAP.of( data, 15);
and get this result:

The performance improvement is also good to read! It is a bit of a pity that it is still 3-4 times slower than the original python implementation and the other existing java implementation.

kklioss · 2025-01-07T02:45:33Z

Can you please try v4 branch?

double[][] x = Read.csv("mnist_70000.csv").toArray();
long start = System.currentTimeMillis();
var map = UMAP.of(x, 15);
long end = System.currentTimeMillis();
System.out.format("UMAP takes %.2f seconds\n", (end - start) / 1000.0);
ScatterPlot.of(map, '@', Color.ORANGE).canvas().window();

The overall process takes 210s on my machine. UMAP algorithm itself takes about 60s, which is on par with python implementation. The rest of time is on nearest neighbor graph construction. I don't know if python implementation includes this part of time into their claim.

stefanhahmann · 2025-01-07T13:30:07Z

Can you please try v4 branch?

I could reproduce the figure now. It was not about the branch, but about the standardization of the data that I did before running the UMAP algorithm.

On my machine, it takes 140s. The Python implementation (https://github.com/lmcinnes/umap/blob/a012b9d8751d98b94935ca21f278a54b3c3e1b7f/umap/umap_.py#L2339) takes 51s, which includes the whole UMAP embedding. I did not get into the details how the embedding is computed there.

Sorry for being so picky. I really love the smile java library and like using it for my use cases, but I also need to have an eye on performance as well.

kklioss · 2025-01-07T14:28:11Z

Python implementation uses single precision float data type while Smile uses double. I think that it is a major source of performance difference. Java doesn't have c++ like templates. It is hard for us to support both data types with unified source code.

haifengl · 2025-01-12T18:39:22Z

This is included in release v4.1

stefanhahmann mentioned this issue Dec 9, 2024

Create failing UMAPTest to exemplify an issue with UMAP implementation #796

Closed

haifengl closed this as completed Jan 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UMAP algorithm does not work as expected #797

UMAP algorithm does not work as expected #797

stefanhahmann commented Dec 9, 2024

kklioss commented Dec 10, 2024

stefanhahmann commented Dec 11, 2024

kklioss commented Dec 11, 2024

stefanhahmann commented Dec 11, 2024

kklioss commented Dec 12, 2024 •

edited

Loading

stefanhahmann commented Dec 12, 2024

stefanhahmann commented Dec 12, 2024

kklioss commented Dec 12, 2024

stefanhahmann commented Dec 12, 2024 •

edited

Loading

kklioss commented Dec 12, 2024

stefanhahmann commented Dec 13, 2024

kklioss commented Dec 22, 2024

kklioss commented Dec 25, 2024

stefanhahmann commented Jan 6, 2025

kklioss commented Jan 7, 2025

stefanhahmann commented Jan 7, 2025

kklioss commented Jan 7, 2025

haifengl commented Jan 12, 2025

UMAP algorithm does not work as expected #797

UMAP algorithm does not work as expected #797

Comments

stefanhahmann commented Dec 9, 2024

kklioss commented Dec 10, 2024

stefanhahmann commented Dec 11, 2024

kklioss commented Dec 11, 2024

stefanhahmann commented Dec 11, 2024

kklioss commented Dec 12, 2024 • edited Loading

stefanhahmann commented Dec 12, 2024

stefanhahmann commented Dec 12, 2024

kklioss commented Dec 12, 2024

stefanhahmann commented Dec 12, 2024 • edited Loading

kklioss commented Dec 12, 2024

stefanhahmann commented Dec 13, 2024

kklioss commented Dec 22, 2024

kklioss commented Dec 25, 2024

stefanhahmann commented Jan 6, 2025

kklioss commented Jan 7, 2025

stefanhahmann commented Jan 7, 2025

kklioss commented Jan 7, 2025

haifengl commented Jan 12, 2025

kklioss commented Dec 12, 2024 •

edited

Loading

stefanhahmann commented Dec 12, 2024 •

edited

Loading