-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SDK] Add resources per worker for Create Job API #1990
Merged
google-oss-prow
merged 8 commits into
kubeflow:master
from
andreyvelich:sdk-resource-per-worker
Jan 18, 2024
Merged
Changes from 6 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
d509aa9
[SDK] Add resources for create Job API
andreyvelich fbb23fb
Fix unbound var
andreyvelich 64039fc
Assign values in get pod template
andreyvelich 5ffb32a
Add torchrun issue
andreyvelich 5031fee
Test to create PyTorchJob from Image
andreyvelich f8dfdc1
Fix e2e to create from image
andreyvelich 3dd7ab7
Fix condition
andreyvelich 174b050
Modify check test conditions
andreyvelich File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich how are these validations stopping the user from running the training on cpus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong. I isn't stopping user of running
train
API on CPUs, but we validate ifcpu
andmemory
is set in theresources_per_worker
parameter. Which might be not required.E.g. user can only specify number of GPUs in
resources_per_worker
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich then shall we change the default value of resources_per_worker as it is None currently, what if the user passes an empty dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepanker13 If will be fine if user passes the empty dict, since Kubernetes will assign the resources automatically.
E.g. this works for me:
As I said, if we understand that we need additional validation in the future, we can always do it in a separate PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure