-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UMAP algorithm does not work as expected #797
Comments
It is because the default spectral layout initialization can be applied only on connected components. I add PCA initialization if there are multiple connected components. Please try the master branch. On the other hand, the existence of multiple connected components implies that a global view of the data cannot be attained with this initialization. In this case, you may want to increase k of knn. |
Have you checked if we use the same number of iterations for both implementations? |
Yes. The number of iterations can be configured for both implementations and I have set in both cases to 500. I am also surprised how different the results look for the different implementations on the same dataset. I would not have expected that. However, this may be explained due to different initilization strategies / different random seeds. |
Significant time of Smile's UMAP is spent on spectral layout for initialization as it requires eigen value decomposition. It looks like that tagbio doesn't support spectral initialization. This may explain the difference on speed and final embedding. |
Yes, indeed. The tagbio implementation does not support the spectral initialization. It also does not support PCA initialization, which smile now uses, when there are multiple connected components. |
I did some further profiling. As done in https://github.com/haifengl/smile/blob/master/core/src/test/java/smile/manifold/UMAPTest.java I did some tests with the MNIST dataset, i.e. I run
|
Ah, I see the reason. I did not run my test with https://github.com/haifengl/smile/blob/3a8b7f32572e45d496cbea5d5ba32ad90b8efd73/shell/src/universal/data/mnist/mnist2500_X.txt but instead with the original dataset with 70.000 points. Sorry for not making this clear in the first place. But to be fair I ran this test with the same amount of data points for all 3 considered implementations (smile, tagbio, python). Thus, it seems to me that there is something with quadratic complexity in the smile implementation, probably in the If I run with minist2500, it also takes ~4.4s on my machine. I am surprised that I do not get the same layout, even though I even set In the result that you get, it would be hard to distinguish points of different classes, if they were not colored according to the class. Compared with the python implementation, it looks that this should actually be possible (https://umap-learn.readthedocs.io/en/latest/reproducibility.html) |
Can you please try to increase JVM memory like -Xmx=16G? |
I pulled the latest commits (8289526), re-built, and tried with -Xmx=16G, but it is still slow with the mnist_70000 dataset: The overall performance with the 2500 mnist dataset has now decreased from 4.6s to 15.1s on my machine. |
We have added a series of optimization so that UMAP can finish in 4 minutes with full size MNIST data on my laptop. We will look into the difference of embedding in the next. |
Can you please try v4 branch?
The overall process takes 210s on my machine. UMAP algorithm itself takes about 60s, which is on par with python implementation. The rest of time is on nearest neighbor graph construction. I don't know if python implementation includes this part of time into their claim. |
I could reproduce the figure now. It was not about the branch, but about the standardization of the data that I did before running the UMAP algorithm. On my machine, it takes 140s. The Python implementation (https://github.com/lmcinnes/umap/blob/a012b9d8751d98b94935ca21f278a54b3c3e1b7f/umap/umap_.py#L2339) takes 51s, which includes the whole UMAP embedding. I did not get into the details how the embedding is computed there. Sorry for being so picky. I really love the smile java library and like using it for my use cases, but I also need to have an eye on performance as well. |
Python implementation uses single precision float data type while Smile uses double. I think that it is a major source of performance difference. Java doesn't have c++ like templates. It is hard for us to support both data types with unified source code. |
This is included in release v4.1 |
Describe the bug
When running the UMAP algorithm with sample data, e.g. two distinct 3d point clouds and trying to reduce them to 2 dimensions, the UMAP does only contain the embeddings of the larger of the two point clouds, but not the smaller one.
Expected behavior
The UMAP algorithm result should contain the embedding results for all input data points
Code snippet
#796
Additional context
The text was updated successfully, but these errors were encountered: