-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed feature info for distributed training #7715
Add distributed feature info for distributed training #7715
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…88/pytorch_geometric into distributed_graph sync
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Codecov Report
@@ Coverage Diff @@
## master #7715 +/- ##
==========================================
- Coverage 90.17% 89.73% -0.44%
==========================================
Files 456 456
Lines 26397 26516 +119
==========================================
- Hits 23803 23794 -9
- Misses 2594 2722 +128
... and 21 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Any chance we can rebase this one on the other PRs? Would make reviewing easier. |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good, but I cannot merge these with failing tests. Can you fix these on your end?
# meta information related to partition and graph store info | ||
self.meta: Optional[Dict[Any, Any]] = None | ||
# partition labels | ||
self.labels: Union[Tensor, Dict[EdgeType, Tensor]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why we need these. Can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.meta is the partition information like num_partition, is_hetero, edge_type, .. which is easy/ very helpful for later code simplicity and you don't need judge based on dict or not,etc. self.labels is the whole dataset information for labels y and also easy for later code simplicity otherwise you need put one more argument from top to low level code function.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ZhengHongming888. I think this looks good. The one issue I have with this PR is that the is_node_feat
argument in the RPC calls. I am not sure why we don't pass the attribute name directly. This way, we are not limited to two features only, and get a much more general RPC feature store. I don't want to break your internal code base by making this change, but might be good to revisit.
Good suggestion! Will fix it later. thanks |
This code belongs to the part of the whole distributed training for PyG. (This PR is to replace #7678)
This PR originally designed for the DistFeature class and now merged with LocalFeatureStore -
Add partition/rpc info into LocalFeatureStore like num_partition, partition_idx, feature_pb (feature_partitionbook), partition_meta, RpcRouter, etc
Add one new class (RpcCallFeatureLookup) to do real remote rpc feature_lookup work
Add one api ( .lookup_features() ) to do feature lookup in local node and remote nodes based on sampled global node ids/edge ids based on torch rpc apis
one unit test to verify the function of local/remote feature lookup under .test/distributed/. folder
Now we combined the local feature store and distributed feature properties (partition info and rpc remote access apis) into one FeatureStore. later on we will change the class name from LocalFeatureStore into PartitionFeatureStore with another PR.
Any comments please let us know. thanks.