[SDK] Create Unify Training Client #1719

andreyvelich · 2023-01-10T20:21:19Z

I created unify TrainingClient for our SDK to improve UX and reduce code duplication.
I added the following APIs:
Common APIs:

get_job_conditions to get Training Job conditions.
is_job_created, is_job_running, is_job_restarting, is_job_succeeded, `is_job_failed to check Job status.
wait_for_job_conditions to wait for Job Conditions. I remove Watch Parameter in get API and print Job status in wait_for_job_conditions API. I don't think users use watch parameter in get API. If that is required, we can update it later. Does it sound good ?
get_job_pod_names to get Job pod names.
get_job_logs to get Job logs.

Job-related APIs (TFJob, PyTorchJob, MXJob, XGBoostJob, MPIJob, PaddleJob):

create_tfjob to create Job.
create_tfjob_from_func to create Job from func.
get_tfjob to get Job.
list_tfjobs to list Jobs.
delete_tfjob to delete Job.

Please let me know if we can reduce code duplication more and how we can improve our SDK further.

Also, I deleted py and test ksonnet old files since we are not longer using it.

TODO: Modify SDK examples.

/assign @kubeflow/wg-training-leads @tenzen-y @anencore94 @kuizhiqing @alembiewski

andreyvelich · 2023-01-10T20:21:31Z

/hold for review

coveralls · 2023-01-10T20:27:56Z

Pull Request Test Coverage Report for Build 3914850839

0 of 0 changed or added relevant lines in 0 files are covered.
11 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.1%) to 39.26%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller.v1/mpi/mpijob_controller.go	2	76.99%
pkg/controller.v1/pytorch/pytorchjob_controller.go	9	59.52%

Totals
Change from base Build 3894456879:	0.1%
Covered Lines:	2665
Relevant Lines:	6788

💛 - Coveralls

.gitattributes

review-notebook-app · 2023-01-12T19:05:46Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

andreyvelich · 2023-01-12T19:12:13Z

I've updated the SDK examples, this PR is ready for review

johnugeorge · 2023-01-13T03:01:44Z

sdk/python/test/e2e/test_e2e_mxjob.py

-                                           namespace=SDK_TEST_NAMESPACE):
-        raise RuntimeError("The MXJob is not succeeded.")
+    TRAINING_CLIENT.wait_for_job_conditions(
+        JOB_NAME, SDK_TEST_NAMESPACE, constants.MXJOB_KIND


We need to validate success condition in Job condition?

@johnugeorge By default it waits for the Succeeded condition:

training-operator/sdk/python/kubeflow/training/api/training_client.py

Line 328 in 33705c2

expected_conditions: Set = {constants.JOB_CONDITION_SUCCEEDED},

.

So if condition is incorrect this API fails.

johnugeorge · 2023-01-13T03:05:16Z

sdk/python/test/e2e/test_e2e_mxjob.py

-        raise RuntimeError("The MXJob is not succeeded.")
+    TRAINING_CLIENT.wait_for_job_conditions(
+        JOB_NAME, SDK_TEST_NAMESPACE, constants.MXJOB_KIND
+    )


If possible, can you add few more extra validations with other SDK functions as well (which are missing now) like get_job_pod_names, list_mxjobs, is_job_succeeded, is_job_created, get_job_conditions etc

@johnugeorge Sure, I'll add them.

johnugeorge · 2023-01-13T03:06:48Z

FYI This is a breaking SDK change

/cc @terrytangyuan @zw0610 @tenzen-y

tenzen-y · 2023-01-13T08:12:42Z

docs/testing/e2e_debugging.md

@@ -75,7 +78,7 @@ $ cat /tmp/output/artifact/junit_test_simple_tfjob_cpu.xml

 ## Common issues

-1. ksonnet is not installed 
+1. ksonnet is not installed


I guess we removed all configuration files for ksonnet in this PR. So Can we remove this section?

We need to update these docs for the new E2Es, do we want to do it in the following PRs ?

I see. I'm ok with either PR.

tenzen-y · 2023-01-13T09:15:15Z

hack/python-sdk/post_gen.py

+    # Add Kubernetes models to proper deserialization of Training models.
+    with open(os.path.join(sdk_dir, "kubeflow/training/models/__init__.py"), "r") as f:
+        new_lines = []
+        for line in f.readlines():
+            new_lines.append(line)
+            if line.startswith("from __future__ import absolute_import"):
+                new_lines.append("\n")
+                new_lines.append("# Import Kubernetes models.\n")
+                new_lines.append("from kubernetes.client import *\n")
+    with open(os.path.join(sdk_dir, "kubeflow/training/models/__init__.py"), "w") as f:
+        f.writelines(new_lines)


It might be better to generate swagger.json containing kubernetes APIs, not importing them to Python SDK.
That helps users to generate SDK of other languages by themselves.

For example, we can generate swagger.json with kubernetes core API in the following:

$ openapi-gen --input-dirs github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1,github.com/kubeflow/common/pkg/apis/common/v1,k8s.io/api/core/v1 --report-filename=hack/violation_exception.list \ --output-package github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1 \ --go-header-file hack/boilerplate/boilerplate.go.txt \ --output-base "${TEMP_DIR}"

WDYT?

Makes sense, but in that case we are going to store all the Kubernetes models in our repo.
Do we want it @tenzen-y @johnugeorge ?

Hmm... You're correct.
It might be better to import kubernetes models to Python SDK in this PR. And then we create an issue about this.
WDYT?

I feel that we can keep the way as it is for this PR and create a separate issue.

Sure, let's discuss the long term plan for it separately.

tenzen-y · 2023-01-13T09:40:52Z

sdk/python/kubeflow/training/api/training_client.py

+        namespace: str = utils.get_default_target_namespace(),
+        job_kind: str = constants.TFJOB_KIND,
+        job: object = None,
+        timeout: int = constants.DEFAULT_TIMEOUT,


Can we make timeout configurable, the same as Katib SDK?

Yeah, I've already done it in this commit: c6d1516.
Does it sound good ?

I have checked that commit. Sounds good. Thanks!

tenzen-y · 2023-01-13T09:44:21Z

sdk/python/kubeflow/training/api/training_client.py

+                    models.KubeflowOrgV1TFJob,
+                    models.KubeflowOrgV1PyTorchJob,
+                    models.KubeflowOrgV1MXJob,
+                    models.KubeflowOrgV1XGBoostJob,
+                    models.KubeflowOrgV1MPIJob,
+                    models.KubeflowOrgV1PaddleJob,


Should we use constants, {list(constants.JOB_KINDS.keys())}?

@tenzen-y Currently, I store Job classes under JOB_KIND["key"]["model"] parameter. So list(constants.JOB_KINDS.keys() returns ["TFJob", "PyTorchJob", ...]
Any ideas how to simplify this check ?

Maybe we can use list comprehensions in the following:

f"{[d.get('model') for d in list(constants.JOB_KIND.values())]}"

Although, the above expressions might make it more complex.

Actually, I like it, thanks @tenzen-y!
Now we can track all jobs only in constants: 3339089.

Looks great :)

tenzen-y · 2023-01-13T09:55:30Z

sdk/python/kubeflow/training/api/training_client.py

+        label_selector = f"{constants.JOB_NAME_LABEL}={name}"
+
+        # Add Job role label if that is required.
+        if is_master:
+            label_selector += f",{constants.JOB_ROLE_LABEL}={constants.JOB_ROLE_MASTER}"
+
+        # Add Replica type label if that is required.
+        if replica_type:
+            label_selector += (
+                f",{constants.REPLICA_TYPE_LABEL}={str.lower(replica_type)}"
+            )
+
+        # Add Replica index label if that is required.
+        if replica_index is not None:
+            label_selector += f",{constants.REPLICA_INDEX_LABEL}={replica_index}"


What is the intention of using label selector instead of OwnerReference?

It's needed if user wants to get Job's pod names for the appropriate replica type or replica index.

Makes sense. Thanks for clarifying.

tenzen-y · 2023-01-13T09:59:11Z

sdk/python/kubeflow/training/api/training_client.py

+        if (
+            num_chief_replicas is None
+            and num_ps_replicas is None
+            and num_worker_replicas is None
+        ):


Maybe, we can remove those validations once we introduce CEL validations.

Ref: #1708

Sure, let me add the comment about it.

Looks good.

tenzen-y · 2023-01-13T10:01:14Z

sdk/python/kubeflow/training/api/training_client.py

+        # Check if at least one worker replica is set.
+        if num_worker_replicas is None:
+            raise ValueError("At least one Worker replica for PyTorchJob must be set")
+
+        # Check if function is callable.
+        if not callable(func):
+            raise ValueError(
+                f"Training function must be callable, got function type: {type(func)}"
+            )


Maybe, we can remove those validations once we introduce CEL validations.

Ref: #1708

Not sure, if we can validate training function in the CEL validation tho.

Sorry for the confusion.
I meant only L870~L872. Maybe, we can validate num_worker_replicas in the following:

+kubebuilder:validation:XValidation:rule="self['Master'].replicas == 0 && self['Worker'].replicas == 0"

tenzen-y · 2023-01-13T10:04:49Z

sdk/python/kubeflow/training/api/training_client.py

+        # Add Chief, PS, and Worker replicas to the TFJob.
+        if num_chief_replicas is not None:
+            tfjob.spec.tf_replica_specs["Chief"] = models.V1ReplicaSpec(
+                replicas=num_chief_replicas, template=pod_template_spec,
+            )
+
+        if num_ps_replicas is not None:
+            tfjob.spec.tf_replica_specs["PS"] = models.V1ReplicaSpec(
+                replicas=num_ps_replicas, template=pod_template_spec,
+            )
+
+        if num_worker_replicas is not None:
+            tfjob.spec.tf_replica_specs["Worker"] = models.V1ReplicaSpec(
+                replicas=num_worker_replicas, template=pod_template_spec,
+            )


Can we make roles like Master, Worker, and more constants?

tenzen-y · 2023-01-13T10:07:00Z

sdk/python/kubeflow/training/api/training_client.py

+    def create_mxjob_from_func(self):
+        """Create MXJob from the function.
+        TODO (andreyvelich): Implement this function.
+        """
+        logging.warning("This API has not been implemented yet.")


Would you like to work on this in this PR? Or another PR?

I think, we can implement those in the following PRs.

Sounds good.

tenzen-y · 2023-01-13T10:07:41Z

sdk/python/kubeflow/training/api/training_client.py

+    def create_xgboostjob_from_func(self):
+        """Create XGBoost from the function.
+        TODO (andreyvelich): Implement this function.
+        """
+        logging.warning("This API has not been implemented yet.")


Would you like to work on this in this PR? Or another PR?

Let's do that additional APIs in the following PRs @tenzen-y (maybe in the next release).

andreyvelich · 2023-01-16T16:41:21Z

@johnugeorge @tenzen-y @terrytangyuan I believe, I addressed all of your suggestions.
Please let me know if you have any other comments.

tenzen-y · 2023-01-16T17:17:42Z

@andreyvelich Looks great! Thanks for this!
/assign @johnugeorge

johnugeorge · 2023-01-16T17:50:09Z

/lgtm

/assign @terrytangyuan

terrytangyuan

/lgtm
/approve

google-oss-prow · 2023-01-16T18:57:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2023-01-17T15:27:59Z

Thanks everyone for the review!
/hold cancel

andreyvelich added 4 commits January 6, 2023 16:58

Remove legacy ksonnet tests

3523221

[SDK] Create Unify Training Client

383c757

Modify E2E tests

385e632

Rename Training to Operator version

b31e778

google-oss-prow bot added do-not-merge/work-in-progress size/XXL labels Jan 10, 2023

google-oss-prow bot added the do-not-merge/hold label Jan 10, 2023

google-oss-prow bot requested review from jinchihe and kuizhiqing January 10, 2023 20:21

andreyvelich added 3 commits January 10, 2023 20:30

Add missing exception

963932e

Add API server timeout parameter

c6d1516

Add delete options

bc7cd54

terrytangyuan reviewed Jan 11, 2023

View reviewed changes

.gitattributes Show resolved Hide resolved

andreyvelich added 2 commits January 11, 2023 14:58

Fix import for V1JobCondition

55f6922

Fix container name in e2e tests

32e32f2

andreyvelich force-pushed the sdk-unify-training-client branch from 27a1b70 to 32e32f2 Compare January 11, 2023 16:04

andreyvelich added 3 commits January 11, 2023 16:57

Fix mxnet container

65ef786

Import all Kubernetes models

3c5d4db

Update SDK Examples

33705c2

andreyvelich changed the title ~~[WIP] [SDK] Create Unify Training Client~~ [SDK] Create Unify Training Client Jan 12, 2023

google-oss-prow bot removed the do-not-merge/work-in-progress label Jan 12, 2023

johnugeorge reviewed Jan 13, 2023

View reviewed changes

google-oss-prow bot requested review from tenzen-y and terrytangyuan January 13, 2023 03:06

google-oss-prow bot requested a review from zw0610 January 13, 2023 03:06

tenzen-y reviewed Jan 13, 2023

View reviewed changes

andreyvelich added 5 commits January 13, 2023 14:39

Verify other SDK APIs in e2e tests

71bd898

Add replica types to const

8ade008

Use logging in e2e tests

ad6d693

Fix logging for status

3cb3443

Use const for job types

3339089

andreyvelich mentioned this pull request Jan 13, 2023

[SDK] Improve Kubernetes Modules Dependency #1723

Open

johnugeorge mentioned this pull request Jan 16, 2023

*job API(master) cannot compatible with old job #1725

Closed

google-oss-prow bot assigned johnugeorge Jan 16, 2023

google-oss-prow bot assigned terrytangyuan Jan 16, 2023

google-oss-prow bot added the lgtm label Jan 16, 2023

terrytangyuan approved these changes Jan 16, 2023

View reviewed changes

google-oss-prow bot added the approved label Jan 16, 2023

google-oss-prow bot removed the do-not-merge/hold label Jan 17, 2023

google-oss-prow bot merged commit b87c6fa into kubeflow:master Jan 17, 2023

andreyvelich deleted the sdk-unify-training-client branch January 17, 2023 15:28

johnugeorge mentioned this pull request Feb 22, 2023

Training operator 1.6 Roadmap #1683

Closed

9 tasks

andreyvelich mentioned this pull request Jul 19, 2023

[SDK] Get Job Pods Events #1863

Closed

andreyvelich mentioned this pull request Aug 2, 2023

[SDK] Consolidate Naming for CRUD APIs #1877

Closed

andreyvelich mentioned this pull request May 27, 2024

PyTorchJobClient not found #2126

Closed

[SDK] Create Unify Training Client #1719

[SDK] Create Unify Training Client #1719

Conversation

andreyvelich commented Jan 10, 2023

andreyvelich commented Jan 10, 2023

coveralls commented Jan 10, 2023 • edited Loading

Pull Request Test Coverage Report for Build 3914850839

💛 - Coveralls

review-notebook-app bot commented Jan 12, 2023

andreyvelich commented Jan 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnugeorge commented Jan 13, 2023

tenzen-y Jan 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich commented Jan 16, 2023

tenzen-y commented Jan 16, 2023

johnugeorge commented Jan 16, 2023

terrytangyuan left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Jan 16, 2023

andreyvelich commented Jan 17, 2023

coveralls commented Jan 10, 2023 •

edited

Loading

tenzen-y Jan 13, 2023 •

edited

Loading