Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: nx-cugraph should intercept built-in constructors like from_pandas_edgelist if NETWORKX_BACKEND_PRIORITY=cugraph #4456

Open
2 tasks done
beckernick opened this issue May 31, 2024 · 5 comments
Assignees
Labels
feature request New feature or request

Comments

@beckernick
Copy link
Member

beckernick commented May 31, 2024

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

High

Please provide a clear description of problem this feature solves

When we use nx-cugraph, we currently need to create the NetworkX graph on the CPU regardless of whether every algorithm we intend to use is supported by the cuGraph backend. As a result, we pay a non-trivial performance penalty converting between CPU and GPU graphs.

The new caching mechanism configurable via CACHE_CONVERTED_GRAPH=True was designed to address this problem, making it possible to only pay this cost once per graph if you're going to run multiple algorithms.

But it would be great to avoid this cost in the first place by dispatching on the graph construction operators in addition to the algorithms. In the example below, we spend significant time in from_pandas_edgelist and _convert_graph (the latter of which is only a one-time cost if we use caching).

If I've already committed to using the cuGraph backend as the top priority backend, I'd ideally just create the graph on the GPU and only pay the CPU/GPU conversion cost if I need to fallback to the CPU.

# !wget https://data.rapids.ai/cugraph/datasets/cit-Patents.csv

%env NETWORKX_BACKEND_PRIORITY=cugraph

import pandas as pd
import networkx as nx

df = pd.read_csv("cit-Patents.csv", sep=" ", names=["src", "dst"], dtype="int32")
%%snakeviz

G = nx.from_pandas_edgelist(df.head(1000000), source="src", target="dst")
pr = nx.pagerank(G, alpha=0.9)
Screenshot 2024-05-31 at 10 03 40 AM

But cuGraph supports from_pandas_edgelist and it's much faster (100ms vs 8s in this case):

%timeit -n3 -r3 G_gpu = cugraph.from_pandas_edgelist(df.head(1000000), source="src", destination="dst")
71.2 ms ± 11.8 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)

Describe your ideal solution

The following code should dispatch to the cuGraph backend for from_pandas_edgelist in addition to pagerank.

# !wget https://data.rapids.ai/cugraph/datasets/cit-Patents.csv

%env NETWORKX_BACKEND_PRIORITY=cugraph

import pandas as pd
import networkx as nx

df = pd.read_csv("cit-Patents.csv", sep=" ", names=["src", "dst"], dtype="int32")

G = nx.from_pandas_edgelist(df.head(1000000), source="src", target="dst")
pr = nx.pagerank(G, alpha=0.9)

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

  • I agree to follow cuGraph's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request
@beckernick beckernick added feature request New feature or request ? - Needs Triage Need team to review and classify labels May 31, 2024
@beckernick
Copy link
Member Author

Seems like this would be an expansion of networkx/networkx#6876

@eriknw
Copy link
Contributor

eriknw commented Jun 3, 2024

Thanks for raising this issue! I think this behavior ought to be controllable via a different environment variable and/or config option... and I do think this is important.

Today, you can add backend="cugraph" to e.g. nx.from_pandas_edgelist, but this is not longer "zero code change", so we'd like to make this story better.

@BradReesWork BradReesWork removed the ? - Needs Triage Need team to review and classify label Jun 13, 2024
@eriknw
Copy link
Contributor

eriknw commented Jun 17, 2024

I have begun to explore a solution to this in networkx/networkx#7502

@rlratzel rlratzel self-assigned this Aug 15, 2024
@rlratzel
Copy link
Contributor

Update: we've been discussing this issue quite a lot, and we're currently working through some solutions for it to include in 24.10.

We've also defined some criteria that we think needs to be supported by any solution in order to close this issue. We think these were implied in the issue, but we're explicitly listing them to be sure. @beckernick please don't hesitate to update this as needed:

Requirements:

  • Does not break existing work flows (i.e. truly zero-code-change)
  • Works for the "scipy example" (the workflow presented at the SciPy 2024 talk on Accelerated Pandas and NetworkX - TODO: add link)
  • Works for dynamic graph manipulation and associated algorithms

@rlratzel
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants