Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed feature info for distributed training #7715

Merged
merged 46 commits into from
Aug 7, 2023

Conversation

ZhengHongming888
Copy link
Contributor

This code belongs to the part of the whole distributed training for PyG. (This PR is to replace #7678)

This PR originally designed for the DistFeature class and now merged with LocalFeatureStore -

Add partition/rpc info into LocalFeatureStore like num_partition, partition_idx, feature_pb (feature_partitionbook), partition_meta, RpcRouter, etc
Add one new class (RpcCallFeatureLookup) to do real remote rpc feature_lookup work
Add one api ( .lookup_features() ) to do feature lookup in local node and remote nodes based on sampled global node ids/edge ids based on torch rpc apis
one unit test to verify the function of local/remote feature lookup under .test/distributed/. folder
Now we combined the local feature store and distributed feature properties (partition info and rpc remote access apis) into one FeatureStore. later on we will change the class name from LocalFeatureStore into PartitionFeatureStore with another PR.

Any comments please let us know. thanks.

@codecov
Copy link

codecov bot commented Jul 9, 2023

Codecov Report

Merging #7715 (e256e0b) into master (ca5311c) will decrease coverage by 0.44%.
Report is 1 commits behind head on master.
The diff coverage is 68.70%.

@@            Coverage Diff             @@
##           master    #7715      +/-   ##
==========================================
- Coverage   90.17%   89.73%   -0.44%     
==========================================
  Files         456      456              
  Lines       26397    26516     +119     
==========================================
- Hits        23803    23794       -9     
- Misses       2594     2722     +128     
Files Changed Coverage Δ
torch_geometric/data/hetero_data.py 88.72% <ø> (-0.03%) ⬇️
torch_geometric/distributed/local_graph_store.py 91.20% <57.89%> (-8.80%) ⬇️
torch_geometric/distributed/local_feature_store.py 80.25% <69.15%> (-9.51%) ⬇️
torch_geometric/distributed/rpc.py 87.62% <100.00%> (+6.18%) ⬆️
torch_geometric/typing.py 43.91% <100.00%> (-4.39%) ⬇️

... and 21 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@mananshah99
Copy link
Contributor

Any chance we can rebase this one on the other PRs? Would make reviewing easier.

Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, but I cannot merge these with failing tests. Can you fix these on your end?

test/distributed/test_distributed_feature_lookup.py Outdated Show resolved Hide resolved
torch_geometric/distributed/__init__.py Outdated Show resolved Hide resolved
torch_geometric/distributed/local_feature_store.py Outdated Show resolved Hide resolved
torch_geometric/distributed/local_feature_store.py Outdated Show resolved Hide resolved
torch_geometric/distributed/local_feature_store.py Outdated Show resolved Hide resolved
torch_geometric/distributed/local_feature_store.py Outdated Show resolved Hide resolved
torch_geometric/distributed/local_feature_store.py Outdated Show resolved Hide resolved
torch_geometric/distributed/local_feature_store.py Outdated Show resolved Hide resolved
torch_geometric/distributed/local_feature_store.py Outdated Show resolved Hide resolved
# meta information related to partition and graph store info
self.meta: Optional[Dict[Any, Any]] = None
# partition labels
self.labels: Union[Tensor, Dict[EdgeType, Tensor]] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we need these. Can you clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.meta is the partition information like num_partition, is_hetero, edge_type, .. which is easy/ very helpful for later code simplicity and you don't need judge based on dict or not,etc. self.labels is the whole dataset information for labels y and also easy for later code simplicity otherwise you need put one more argument from top to low level code function.

@rusty1s rusty1s requested a review from mananshah99 as a code owner August 7, 2023 06:48
Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ZhengHongming888. I think this looks good. The one issue I have with this PR is that the is_node_feat argument in the RPC calls. I am not sure why we don't pass the attribute name directly. This way, we are not limited to two features only, and get a much more general RPC feature store. I don't want to break your internal code base by making this change, but might be good to revisit.

@rusty1s rusty1s merged commit aadb135 into pyg-team:master Aug 7, 2023
@ZhengHongming888
Copy link
Contributor Author

Good suggestion! Will fix it later. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants