Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Add resources per worker for Create Job API #1990

Merged
merged 8 commits into from
Jan 18, 2024

Conversation

andreyvelich
Copy link
Member

Blocked by: #1988.
/hold

I added resources_per_worker parameter to create_job API.
Also, this has some refactoring for our SDK utils functions:

  • I removed validation from train API for resource per worker. Let's add the validation in the future if that is required. We might have users who want to do fine-tuning with train API on CPUs.
  • We have 3 new functions in utils. get_pod_template_spec to return Pod template spec, get_container_spec to return Container Spec, get_command_using_train_func to return args and command for train function.
  • I made a few changes to reduce number of typing errors in Pylance.

Please take a look.
/assign @deepanker13 @johnugeorge @tenzen-y @droctothorpe @kuizhiqing

Copy link

@andreyvelich: GitHub didn't allow me to assign the following users: droctothorpe, deepanker13.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Blocked by: #1988.
/hold

I added resources_per_worker parameter to create_job API.
Also, this has some refactoring for our SDK utils functions:

  • I removed validation from train API for resource per worker. Let's add the validation in the future if that is required. We might have users who want to do fine-tuning with train API on CPUs.
  • We have 3 new functions in utils. get_pod_template_spec to return Pod template spec, get_container_spec to return Container Spec, get_command_using_train_func to return args and command for train function.
  • I made a few changes to reduce number of typing errors in Pylance.

Please take a look.
/assign @deepanker13 @johnugeorge @tenzen-y @droctothorpe @kuizhiqing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich andreyvelich force-pushed the sdk-resource-per-worker branch from aef4735 to 64039fc Compare January 16, 2024 20:11
@coveralls
Copy link

coveralls commented Jan 16, 2024

Pull Request Test Coverage Report for Build 7571809122

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall first build on sdk-resource-per-worker at 42.873%

Totals Coverage Status
Change from base Build 7546373933: 42.9%
Covered Lines: 3754
Relevant Lines: 8756

💛 - Coveralls

@andreyvelich
Copy link
Member Author

/hold cancel
This PR is ready.

@@ -83,18 +101,15 @@

# PyTorchJob constants
PYTORCHJOB_KIND = "PyTorchJob"
PYTORCHJOB_MODEL = "KubeflowOrgV1PyTorchJob"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to override?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge What do you mean by override here ?
I just make string type rather than object to reduce number of typing errors in Pylance.

@@ -47,7 +47,8 @@ def verify_job_e2e(

# Job should have Created, Running, and Succeeded conditions.
conditions = client.get_job_conditions(job=job)
if len(conditions) != 3:
# If Job is complete fast, it has 2 conditions: Created and Succeeded.
if len(conditions) != 3 and len(conditions) != 2:
Copy link
Member

@johnugeorge johnugeorge Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks bit odd. Can we clean up this check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge Do you want to remove this check ?
I noticed that PyTorchJob has just 2 conditions (e.g. Created and Succeeded), if you run it with image that executes very fast. E.g. docker.io/hello-world.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that a bug which needs to be resolved separately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. @tenzen-y @kuizhiqing @tenzen-y What are your thoughts here ?
The main problem is that when reconciliation loop starts, the training Pod is already Succeeded, so we didn't add Running status for PyTorchJob.

):
if pvc_name is None or namespace is None or storage_size is None:
if pvc_name is None or namespace is None or "size" not in storage_config is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last condition needs correction

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@@ -139,64 +139,50 @@ def train(

namespace = namespace or self.namespace

if isinstance(resources_per_worker, dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich how are these validations stopping the user from running the training on cpus?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong. I isn't stopping user of running train API on CPUs, but we validate if cpu and memory is set in the resources_per_worker parameter. Which might be not required.
E.g. user can only specify number of GPUs in resources_per_worker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich then shall we change the default value of resources_per_worker as it is None currently, what if the user passes an empty dict

Copy link
Member Author

@andreyvelich andreyvelich Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepanker13 If will be fine if user passes the empty dict, since Kubernetes will assign the resources automatically.
E.g. this works for me:

TrainingClient().create_job(
    resources_per_worker={},
    name="test-empty",
    num_workers=1,
    base_image="docker.io/hello-world",
)

As I said, if we understand that we need additional validation in the future, we can always do it in a separate PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@johnugeorge
Copy link
Member

@deepanker13 Can you complete review?

@andreyvelich
Copy link
Member Author

@johnugeorge I've made changes for the condition check in test that we discussed.

@@ -323,13 +333,13 @@ def get_pytorchjob_template(
spec=models.KubeflowOrgV1PyTorchJobSpec(
run_policy=models.KubeflowOrgV1RunPolicy(clean_pod_policy=None),
pytorch_replica_specs={},
elastic_policy=elastic_policy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check if elastic policy is not an empty dict, else default env variables will get appended

podTemplateSpec.Spec.Containers[i].Env = append(

Copy link
Member Author

@andreyvelich andreyvelich Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if elastic_policy is None, Python doesn't assign value to the PyTorchJob spec, and we don't set default values.
If user accidentally set elastic_policy={}, our controller will fail with invalid spec error:

E0118 15:47:43.737955       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 398 [running]:

Copy link
Member Author

@andreyvelich andreyvelich Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see, elastic_policy has this type: KubeflowOrgV1ElasticPolicy, so user should set the appropriate instance value similar to other parameters (e.g. worker_pod_template_spec).
Right now, we don't even use elastic_policy in our public APIs:

job = utils.get_pytorchjob_template(
.

@deepanker13
Copy link
Contributor

/lgtm
Thanks @andreyvelich !

Copy link

@deepanker13: changing LGTM is restricted to collaborators

In response to this:

/lgtm
Thanks @andreyvelich !

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@johnugeorge
Copy link
Member

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Jan 18, 2024
@google-oss-prow google-oss-prow bot merged commit 07d1a61 into kubeflow:master Jan 18, 2024
35 checks passed
@andreyvelich andreyvelich deleted the sdk-resource-per-worker branch January 18, 2024 16:39
@andreyvelich andreyvelich added this to the v0.8.0 Release milestone Jan 24, 2024
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
* [SDK] Add resources for create Job API

* Fix unbound var

* Assign values in get pod template

* Add torchrun issue

* Test to create PyTorchJob from Image

* Fix e2e to create from image

* Fix condition

* Modify check test conditions
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
* [SDK] Add resources for create Job API

* Fix unbound var

* Assign values in get pod template

* Add torchrun issue

* Test to create PyTorchJob from Image

* Fix e2e to create from image

* Fix condition

* Modify check test conditions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants