Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(learning): Integrate GLTorch with GraphScope in Server-Client Mode with K8S Deployment Supports #3624

Merged
merged 60 commits into from
Apr 2, 2024

Conversation

husimplicity
Copy link
Collaborator

What do these changes do?

As titled.

Zhanghyi and others added 27 commits March 5, 2024 18:07
…n_torch dataset and training on a single machine (alibaba#3156)

This pr introduces the following changes:
1. A new graphlearn_torch API for session.
2. Add the script to launch the graphlearn_torch server with handle and
config.
3. Include an example of graphsage node classification with the
ogbn-arxiv dataset in GraphScope on a single machine.
alibaba#3157
The dynamic world size feature in server-client mode has been
implemented in GLT, and this PR is used to update and use this feature.
…ibaba#3227)

For graphlearn_torch_vineyard, the cxx11_abi should be adjusted to the gcc environment. For graphlearn_torch, the cxx11_abi should always be set to false to match PyTorch.
@husimplicity husimplicity changed the title feat[learning]: Integrate GLTorch with GraphScope in Server-Client Mode with K8S Deployment Supports feat(learning): Integrate GLTorch with GraphScope in Server-Client Mode with K8S Deployment Supports Mar 11, 2024
@LiSu LiSu self-requested a review March 28, 2024 12:23
@husimplicity husimplicity requested review from siyuan0322 and removed request for LiSu March 28, 2024 12:23
LiSu
LiSu previously approved these changes Mar 28, 2024
siyuan0322
siyuan0322 previously approved these changes Mar 28, 2024
api_groups=",apps,extensions", # The leading comma is necessary, represents for core api group.
resources="configmaps,deployments,deployments/status,statefulsets,statefulsets/status,endpoints,events,pods,pods/log,pods/exec,pods/status,services,replicasets", # noqa: E501
api_groups=",apps,extensions,kubeflow.org", # The leading comma is necessary, represents for core api group.
resources="configmaps,deployments,deployments/status,statefulsets,statefulsets/status,endpoints,events,pods,pods/log,pods/exec,pods/status,services,replicasets,pytorchjobs", # noqa: E501
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work normally in a cluster without kubeflow installed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems normal as tested

@husimplicity husimplicity dismissed stale reviews from siyuan0322 and LiSu via d619041 March 29, 2024 08:42
LiSu
LiSu previously approved these changes Apr 1, 2024
sighingnow
sighingnow previously approved these changes Apr 1, 2024
Copy link
Collaborator

@sighingnow sighingnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with some minor comments.

except K8SApiException as e:
print(
f"Exception when calling CoreV1Api->delete_namespaced_config_map: {e}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use logger.info for logging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

except K8SApiException as e:
print(
f"Exception when calling CustomObjectsApi->create_namespaced_custom_object: {e}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you reraise here and don't continue silently.

If creating the PytorchJob failed, the following watch loop will never return.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

self._graphlearn_instance_processes = {}

self._graphlearn_torch_services = {}
self._graphlearn_torch_instance_processes = {}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can graphlearn and graphlearn_torch share the same set of port/processes/services to make these part simpler?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will deprecate graphlearn later, so the separated implementation for now is fine :)

@husimplicity husimplicity dismissed stale reviews from sighingnow and LiSu via 1179f2f April 1, 2024 09:10
@sighingnow sighingnow merged commit a7ec218 into alibaba:main Apr 2, 2024
45 checks passed
@husimplicity husimplicity deleted the glt-k8s-latest branch April 25, 2024 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants