Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST]: Leiden clustering before and after 23.02 #4529

Open
2 tasks done
wdnlotm opened this issue Jul 9, 2024 · 10 comments
Open
2 tasks done

[QST]: Leiden clustering before and after 23.02 #4529

wdnlotm opened this issue Jul 9, 2024 · 10 comments
Assignees
Labels
question Further information is requested

Comments

@wdnlotm
Copy link

wdnlotm commented Jul 9, 2024

What is your question?

Hi,
I use the Leiden clustering a lot. Louvain as well, sometimes. I am experiencing big differences in the results when I use 23.02 and newer versions. My dataset has 80 mil edges and 6 mil nodes. The biggest notable difference is older than 23.02, the Leiden clustering produces about 20 clusters, but newer than 23.02 produces, sometimes, 200K clusters for the same dataset and settings.
Now I need to use Dask cugraph but this big difference prevents me from moving on to the new versions.
Any idea?
Thank you.

Code of Conduct

  • I agree to follow cuGraph's Code of Conduct
  • I have searched the open issues and have found no duplicates for this question
@wdnlotm wdnlotm added the question Further information is requested label Jul 9, 2024
@alexbarghi-nv
Copy link
Member

23.02 is a fairly old version of cuGraph. I would start by moving to a newer version of cuGraph (the latest is 24.06) and seeing if that resolves your issues.

Also @ChuckHastings I think we've had a similar issue reported in the past? What was the resolution there?

@ChuckHastings
Copy link
Collaborator

Our original implementation of Leiden had a number of functionality issues (it was not much more than Louvain in actuality) and had significant scaling issues. Our implementation of Leiden was completely rewritten as of version 23.04, so you have probably observed that major shift. This version was much better, although there continued to be some minor issues with it through earlier this year. As of 24.06 we have no known outstanding issues with Leiden. So I would suggest using 24.06 or later.

If you have examples where our Leiden implementation in 24.06 generates answers that you are concerned about, please provide some details (ideally a sample data set, parameters and what you might expect the result to be) and we will be happy to investigate.

@wdnlotm
Copy link
Author

wdnlotm commented Jul 10, 2024

I am testing a few versions.
I have an edge list from a knn (82 mil edges, 5.5 mil vertices).
I load it to G=cugraph.MultiGraph(directed=False). I use leiden or louvain by running
parts, modularity_score = cugraph.louvain(G, max_level=500, resolution = 0.65) or
parts, modularity_score = cugraph.leiden(G, max_iter=500, resolution = 0.65, random_state=123)

I use the Apptainer by the way.
24.06 Leiden produced 1741523 clusters
24.06 Louvain produced 11 (eleven) clusters

24.02 Leiden produced 10 clusters.
Great!! But if I rerun the same code, I got 11 clusters. It's good too but not consistent. After one more run, I got 141 clusters.
24.02 Louvain produced 11 (eleven) clusters

23.02 Leiden produced 19 clusters.
23.02 Louvain produced 20 clusters

In fact, I like the results from 23.02. Those results are comparable to what I could get from the leidenalg library which is s l o w.

Thanks.

@ChuckHastings
Copy link
Collaborator

Leiden circa 23.02 was really Louvain with just a tiny bit of extra logic. It didn't actually implement much of the Leiden algorithm at all.

The big issue with 23.06 through 24.02 was inconsistencies. We identified and corrected several bugs that made things more consistent. Because we are operating in parallel, and Louvain/Leiden are greedy approximation algorithms, we can still be affected by numerical instability. Sums of values occur in parallel in a non-deterministic order and we can see minor variations in the results on larger graphs. We have seen several cases where the best choice in the greedy algorithm is really a tie, and due to the variability in the numerical stability we might see different vertices in the tie actually be a clear winner. This is a known source of variability in the results that we haven't tried to address. But we have addressed many of the inconsistencies.

Producing 1.7 million clusters seems like a bug of some kind. Is the graph you are using something you can share? I can try to find comparable size graphs and see if I can get this type of behavior, but starting from your graph will be the easiest path.

@wdnlotm
Copy link
Author

wdnlotm commented Jul 10, 2024

I can share my graph and code. Give me an hour or two.
Thanks.

@wdnlotm
Copy link
Author

wdnlotm commented Jul 10, 2024

I put files here. Thank you for your help!!
https://drive.google.com/drive/folders/10lziXlXut-ZGcXDlwbk6SxqPB9OIunSS?usp=sharing

@ChuckHastings
Copy link
Collaborator

It may take some time to investigate, but we will keep you informed on progress.

@ChuckHastings ChuckHastings self-assigned this Jul 10, 2024
@ChuckHastings
Copy link
Collaborator

Just added this to our development plan for the 24.10 release. We should at least get some analysis done soon.

@wdnlotm
Copy link
Author

wdnlotm commented Aug 19, 2024

If you don't mind, I can add another edge list, much smaller, showing similar results.
https://drive.google.com/drive/folders/10lziXlXut-ZGcXDlwbk6SxqPB9OIunSS?usp=sharing
go into smaller_knn

Results using 24.06 were like
with Leiden 35387 clusters
with Louvain 15 clusters.

Thanks.

@ChuckHastings
Copy link
Collaborator

Smaller graphs are always easier to debug. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants