-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SDK] Add resources per worker for Create Job API #1990
[SDK] Add resources per worker for Create Job API #1990
Conversation
@andreyvelich: GitHub didn't allow me to assign the following users: droctothorpe, deepanker13. Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
aef4735
to
64039fc
Compare
Pull Request Test Coverage Report for Build 7571809122
💛 - Coveralls |
/hold cancel |
@@ -83,18 +101,15 @@ | |||
|
|||
# PyTorchJob constants | |||
PYTORCHJOB_KIND = "PyTorchJob" | |||
PYTORCHJOB_MODEL = "KubeflowOrgV1PyTorchJob" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge What do you mean by override here ?
I just make string type rather than object to reduce number of typing errors in Pylance.
sdk/python/test/e2e/utils.py
Outdated
@@ -47,7 +47,8 @@ def verify_job_e2e( | |||
|
|||
# Job should have Created, Running, and Succeeded conditions. | |||
conditions = client.get_job_conditions(job=job) | |||
if len(conditions) != 3: | |||
# If Job is complete fast, it has 2 conditions: Created and Succeeded. | |||
if len(conditions) != 3 and len(conditions) != 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks bit odd. Can we clean up this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge Do you want to remove this check ?
I noticed that PyTorchJob has just 2 conditions (e.g. Created and Succeeded), if you run it with image that executes very fast. E.g. docker.io/hello-world
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that a bug which needs to be resolved separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. @tenzen-y @kuizhiqing @tenzen-y What are your thoughts here ?
The main problem is that when reconciliation loop starts, the training Pod is already Succeeded, so we didn't add Running status for PyTorchJob.
): | ||
if pvc_name is None or namespace is None or storage_size is None: | ||
if pvc_name is None or namespace is None or "size" not in storage_config is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last condition needs correction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
@@ -139,64 +139,50 @@ def train( | |||
|
|||
namespace = namespace or self.namespace | |||
|
|||
if isinstance(resources_per_worker, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich how are these validations stopping the user from running the training on cpus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong. I isn't stopping user of running train
API on CPUs, but we validate if cpu
and memory
is set in the resources_per_worker
parameter. Which might be not required.
E.g. user can only specify number of GPUs in resources_per_worker
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich then shall we change the default value of resources_per_worker as it is None currently, what if the user passes an empty dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepanker13 If will be fine if user passes the empty dict, since Kubernetes will assign the resources automatically.
E.g. this works for me:
TrainingClient().create_job(
resources_per_worker={},
name="test-empty",
num_workers=1,
base_image="docker.io/hello-world",
)
As I said, if we understand that we need additional validation in the future, we can always do it in a separate PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
@deepanker13 Can you complete review? |
@johnugeorge I've made changes for the condition check in test that we discussed. |
@@ -323,13 +333,13 @@ def get_pytorchjob_template( | |||
spec=models.KubeflowOrgV1PyTorchJobSpec( | |||
run_policy=models.KubeflowOrgV1RunPolicy(clean_pod_policy=None), | |||
pytorch_replica_specs={}, | |||
elastic_policy=elastic_policy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check if elastic policy is not an empty dict, else default env variables will get appended
podTemplateSpec.Spec.Containers[i].Env = append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, if elastic_policy is None, Python doesn't assign value to the PyTorchJob spec, and we don't set default values.
If user accidentally set elastic_policy={}
, our controller will fail with invalid spec error:
E0118 15:47:43.737955 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 398 [running]:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see, elastic_policy has this type: KubeflowOrgV1ElasticPolicy
, so user should set the appropriate instance value similar to other parameters (e.g. worker_pod_template_spec).
Right now, we don't even use elastic_policy in our public APIs:
job = utils.get_pytorchjob_template( |
/lgtm |
@deepanker13: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
* [SDK] Add resources for create Job API * Fix unbound var * Assign values in get pod template * Add torchrun issue * Test to create PyTorchJob from Image * Fix e2e to create from image * Fix condition * Modify check test conditions
* [SDK] Add resources for create Job API * Fix unbound var * Assign values in get pod template * Add torchrun issue * Test to create PyTorchJob from Image * Fix e2e to create from image * Fix condition * Modify check test conditions
Blocked by: #1988.
/hold
I added
resources_per_worker
parameter tocreate_job
API.Also, this has some refactoring for our SDK utils functions:
train
API for resource per worker. Let's add the validation in the future if that is required. We might have users who want to do fine-tuning withtrain
API on CPUs.get_pod_template_spec
to return Pod template spec,get_container_spec
to return Container Spec,get_command_using_train_func
to return args and command for train function.Please take a look.
/assign @deepanker13 @johnugeorge @tenzen-y @droctothorpe @kuizhiqing