Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST]: Benchmarking cugraph.leiden() #4488

Closed
2 tasks done
wolfram77 opened this issue Jun 14, 2024 · 3 comments
Closed
2 tasks done

[QST]: Benchmarking cugraph.leiden() #4488

wolfram77 opened this issue Jun 14, 2024 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@wolfram77
Copy link

What is your question?

Hello @afender I want to benchmark the runtime of cugraph.leiden(). For a benchmark of the algorithm, one should only consider the runtime of the algorithm, and exclude the runtime for validations and initial memory allocations. A direct measurement of runtime around the cugraph call includes all of the above. Is it possible to get an "algorithm runtime" from the call to cugraph.leiden()?

Code of Conduct

  • I agree to follow cuGraph's Code of Conduct
  • I have searched the open issues and have found no duplicates for this question
@wolfram77 wolfram77 added the question Further information is requested label Jun 14, 2024
@ChuckHastings
Copy link
Collaborator

@rlratzel should have a better answer for your question. Alex Fender has moved on to our cuopt effort and doesn't work on this software anymore.

I'm fuzzy on the performance overheads of the python API - where they exist and if/how you can avoid them. I know at one time we had (and perhaps still have) some lazy computations that occur on the first call to an algorithm. I believe there is a way to avoid those. @rlratzel should be able to clarify.

Expensive validation steps are directly enabled in the C/C++ layer by passing a parameter called do_expensive_check. This is set to False by default. My quick glance at the latest python for Leiden indicates there is no mechanism for you to override this. So the only error checks that occur are fast error checks (did you pass in an edge weights pointer is - I think - the only validation that occurs on the Leiden algorithm).

As implemented, memory allocation for the result is done inside of Leiden. That memory allocation does not include initialization, we copy the result into uninitialized memory. So the performance overhead of memory allocation of the result should be minimal. All other memory allocation done inside of Leiden is dynamic based on the progress of the clustering algorithm. If you configure RMM to use the pool allocator then memory allocations should be pretty fast. Perhaps @rlratzel can clarify how to do that from python.

@rlratzel
Copy link
Contributor

Hi @wolfram77 , I don't know if this is acceptable, but I think the best way to benchmark only the algorithm implementation and eliminate any additional allocations/conversions/input checks done in the cugraph python library would be to benchmark leiden from the C++ library in C++. Because the cugraph python library calls the libcugraph C++ library implementation, you'd be benchmarking as close to the algorithm implementation as possible (without modifying C++ source code to isolate further beyond the API).

If C++ isn't an option, you could benchmark leiden from our lower-level python library (pylibcugraph.leiden). The cugraph python library wraps pylibcugraph and adds various conveniences and additional checks which you'd want to avoid in the benchmark you're describing, so pylibcugraph.leiden might be the next best function to benchmark after C++.

Finally, configuring RMM to use pool allocation might also be something to consider, as @ChuckHastings mentioned. You can read about how to do that from python here.

@wolfram77
Copy link
Author

Thanks @ChuckHastings and @rlratzel

As suggested, I configured RMM to use pool allocation (code below). This seems to help a lot.

pool = rmm.mr.PoolMemoryResource(rmm.mr.CudaMemoryResource(), initial_pool_size=2**36)
rmm.mr.set_current_device_resource(pool)

I also discard the runtime of the first call to cugraph.leiden(). This also helps.

Below is the runtimes we observed for cuGraph Leiden (inc. other comparisons).
image

cuGraph Leiden fails to run on the arabic-2005, uk-2005, webbase-2001, it-2004, and sk-2005 graphs due to out of memory issues. We use an NVIDIA A100 GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants