Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix nproc env in elastic mode for pytorchjob #1948

Merged
merged 1 commit into from
Nov 20, 2023

Conversation

kuizhiqing
Copy link
Member

What this PR does / why we need it:

In PyTorchJob elastic mode, nproc env may not be set.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes # #1947

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Nov 19, 2023

Pull Request Test Coverage Report for Build 6921477150

  • 6 of 6 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.03%) to 42.872%

Totals Coverage Status
Change from base Build 6854571899: 0.03%
Covered Lines: 3753
Relevant Lines: 8754

💛 - Coveralls

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Value: *pytorchjob.Spec.NprocPerNode,
})
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change removes the necessity to add master spec to run multiple processes per pod using python/torchrun launch method without the elastic spec in the yaml, but doesn't solve #1947

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to solve #1947 I think we need to make changes here

if elasticPolicy.NProcPerNode != nil {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepanker13 elastic.go has env vars when elastic policy is set. As per the original issue, it works correctly when elasticPolicy.NProcPerNode is set.

Copy link
Contributor

@deepanker13 deepanker13 Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge following this pr #1840 , I raised issue #1947 which when implemented should complete the point striked out in the below screenshot
Screenshot 2023-11-20 at 11 37 10 AM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to solve #1947 I think we need to make changes here

if elasticPolicy.NProcPerNode != nil {

No. this part will be deprecated. We should touch the global one.

@johnugeorge
Copy link
Member

@kuizhiqing I am bit confused. How does this resolve the case where spec.nprocPerNode is set in the elastic mode ? In #1947, issue happens when spec.nprocPerNode is set in the elastic mode.

@kuizhiqing
Copy link
Member Author

@kuizhiqing I am bit confused. How does this resolve the case where spec.nprocPerNode is set in the elastic mode ? In #1947, issue happens when spec.nprocPerNode is set in the elastic mode.

@johnugeorge
The issue raise when there is no master is set because spec.nprocPerNode will not effect the related ENV before this PR.

It is correct if elasticPolicy.NProcPerNode is set, while we recommend user to set spec.nprocPerNode rather than elasticPolicy.NProcPerNode. For long term plan, we deprecated the usage of elasticPolicy.NProcPerNode.

After this PR, spec.nprocPerNode will always work just as we suggestion.

@johnugeorge
Copy link
Member

Thanks @kuizhiqing for the explanation.

/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, kuizhiqing

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 2856aa0 into kubeflow:master Nov 20, 2023
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants