Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMAP algorithm does not work as expected #797

Closed
stefanhahmann opened this issue Dec 9, 2024 · 18 comments
Closed

UMAP algorithm does not work as expected #797

stefanhahmann opened this issue Dec 9, 2024 · 18 comments

Comments

@stefanhahmann
Copy link

Describe the bug
When running the UMAP algorithm with sample data, e.g. two distinct 3d point clouds and trying to reduce them to 2 dimensions, the UMAP does only contain the embeddings of the larger of the two point clouds, but not the smaller one.

Expected behavior
The UMAP algorithm result should contain the embedding results for all input data points

Code snippet
#796

Additional context

  • JDK21
  • Smile version 4.0.0
  • Windows
@kklioss
Copy link
Collaborator

kklioss commented Dec 10, 2024

It is because the default spectral layout initialization can be applied only on connected components. I add PCA initialization if there are multiple connected components. Please try the master branch. On the other hand, the existence of multiple connected components implies that a global view of the data cannot be attained with this initialization. In this case, you may want to increase k of knn.

@stefanhahmann
Copy link
Author

Thanks for the quick fix. The unit test contained in #796 works with this modification.
Thanks also for the hint regarding k of knn. It makes complete sense for the artificial example of the unit test.

Are you planning to publish a patch release of smile-core/smile-base?

Unrelated, but maybe interesting: I did a performance comparison between UMAP smile implementation and UMAP tagbio implementation (https://github.com/tag-bio/umap-java). Besides the fact that both implementation get slightly different results (which may be due to the random elements inside the UMAP algorithm), the tagbio-implementation seems to be 2-5 times faster, cf.:

@kklioss
Copy link
Collaborator

kklioss commented Dec 11, 2024

Have you checked if we use the same number of iterations for both implementations?

@stefanhahmann
Copy link
Author

Yes. The number of iterations can be configured for both implementations and I have set in both cases to 500.

I am also surprised how different the results look for the different implementations on the same dataset. I would not have expected that. However, this may be explained due to different initilization strategies / different random seeds.

  • Sklearn (Python):
    grafik

  • TagBio (Java):
    grafik

  • Smile (Java)
    grafik

@kklioss
Copy link
Collaborator

kklioss commented Dec 12, 2024

Significant time of Smile's UMAP is spent on spectral layout for initialization as it requires eigen value decomposition. It looks like that tagbio doesn't support spectral initialization. This may explain the difference on speed and final embedding.

@stefanhahmann
Copy link
Author

Significant time of Smile's UMAP is spent on spectral layout for initialization as it requires eigen value decomposition. It looks like that tagbio doesn't support spectral initialization. This may explain the difference on speed and final embedding.

Yes, indeed. The tagbio implementation does not support the spectral initialization. It also does not support PCA initialization, which smile now uses, when there are multiple connected components.
And indeed this does shorten the time before doing the actual 500 iterations. The iterations for the optimization seem to take about the same time for both implementations, but the initialization takes longer for the smile implementation, which can be easily understood, since the tagbio initialization uses just a random initilization approach.
In

@stefanhahmann
Copy link
Author

I did some further profiling. As done in https://github.com/haifengl/smile/blob/master/core/src/test/java/smile/manifold/UMAPTest.java I did some tests with the MNIST dataset, i.e. I run UMAP umap = UMAP.of(MNIST.x, 7); and compared it to equivalent tests with the tagbio implementation and the UMAP python implementation.

  • It took 18:36 minutes in total

  • 18:10 minutes of this were used for initializing the graph NearestNeighborGraph cc = NearestNeighborGraph.largest(graph);

  • The actual optimization of the layout took about 25s

  • The UMAP Python implementation took ~60s and came close to what is stated in the docs (https://umap-learn.readthedocs.io/en/latest/reproducibility.html).
    python_60s

  • The tagbio implementation takes ~45s and reaches a qualitatively similar (but numerically different) result:
    tagbio_45s

  • The results of the smile implementation seem to differ substantially:
    smile_1116s

@kklioss
Copy link
Collaborator

kklioss commented Dec 12, 2024

It takes only 4.34 seconds on my pc. And the result also looks different from yours. I don't know why it is so slow in your case.

umap

@stefanhahmann
Copy link
Author

stefanhahmann commented Dec 12, 2024

It takes only 4.34 seconds on my pc. And the result also looks different from yours. I don't know why it is so slow in your case.

Ah, I see the reason. I did not run my test with https://github.com/haifengl/smile/blob/3a8b7f32572e45d496cbea5d5ba32ad90b8efd73/shell/src/universal/data/mnist/mnist2500_X.txt but instead with the original dataset with 70.000 points. Sorry for not making this clear in the first place. But to be fair I ran this test with the same amount of data points for all 3 considered implementations (smile, tagbio, python). Thus, it seems to me that there is something with quadratic complexity in the smile implementation, probably in the NearestNeighborGraph cc = NearestNeighborGraph.largest(graph); part.

If I run with minist2500, it also takes ~4.4s on my machine. I am surprised that I do not get the same layout, even though I even set MathEx.setSeed(19650218); as in the unit test (https://github.com/haifengl/smile/blob/master/core/src/test/java/smile/manifold/UMAPTest.java) before running it. I also chose k=7 and defaults otherwise. The layout on my side looks a bit different (I did not color the classes though):
grafik

In the result that you get, it would be hard to distinguish points of different classes, if they were not colored according to the class. Compared with the python implementation, it looks that this should actually be possible (https://umap-learn.readthedocs.io/en/latest/reproducibility.html)

grafik

@kklioss
Copy link
Collaborator

kklioss commented Dec 12, 2024

Can you please try to increase JVM memory like -Xmx=16G?

@stefanhahmann
Copy link
Author

I pulled the latest commits (8289526), re-built, and tried with -Xmx=16G, but it is still slow with the mnist_70000 dataset:
mnist_70000.zip

The overall performance with the 2500 mnist dataset has now decreased from 4.6s to 15.1s on my machine.

@kklioss
Copy link
Collaborator

kklioss commented Dec 22, 2024

We have added a series of optimization so that UMAP can finish in 4 minutes with full size MNIST data on my laptop. We will look into the difference of embedding in the next.

@kklioss
Copy link
Collaborator

kklioss commented Dec 25, 2024

This is the latest embedding results on full size MNIST data. It takes less than 3 minutes on my PC.
umap

@stefanhahmann
Copy link
Author

This is the latest embedding results on full size MNIST data. It takes less than 3 minutes on my PC.

The embedding looks a lot better than before. However, I cannot reproduce it with the MINIST 70_000 dataset. Which initialization did you use?

I used:
UMAP.of( data, 15);
and get this result:
grafik

The performance improvement is also good to read! It is a bit of a pity that it is still 3-4 times slower than the original python implementation and the other existing java implementation.

@kklioss
Copy link
Collaborator

kklioss commented Jan 7, 2025

Can you please try v4 branch?

double[][] x = Read.csv("mnist_70000.csv").toArray();
long start = System.currentTimeMillis();
var map = UMAP.of(x, 15);
long end = System.currentTimeMillis();
System.out.format("UMAP takes %.2f seconds\n", (end - start) / 1000.0);
ScatterPlot.of(map, '@', Color.ORANGE).canvas().window();

The overall process takes 210s on my machine. UMAP algorithm itself takes about 60s, which is on par with python implementation. The rest of time is on nearest neighbor graph construction. I don't know if python implementation includes this part of time into their claim.

@stefanhahmann
Copy link
Author

Can you please try v4 branch?

I could reproduce the figure now. It was not about the branch, but about the standardization of the data that I did before running the UMAP algorithm.

On my machine, it takes 140s. The Python implementation (https://github.com/lmcinnes/umap/blob/a012b9d8751d98b94935ca21f278a54b3c3e1b7f/umap/umap_.py#L2339) takes 51s, which includes the whole UMAP embedding. I did not get into the details how the embedding is computed there.

Sorry for being so picky. I really love the smile java library and like using it for my use cases, but I also need to have an eye on performance as well.

@kklioss
Copy link
Collaborator

kklioss commented Jan 7, 2025

Python implementation uses single precision float data type while Smile uses double. I think that it is a major source of performance difference. Java doesn't have c++ like templates. It is hard for us to support both data types with unified source code.

@haifengl
Copy link
Owner

This is included in release v4.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants