Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add default Intel MPI env variables to MPIJob #1804

Merged
merged 1 commit into from
Jun 19, 2023

Conversation

tkatila
Copy link
Contributor

@tkatila tkatila commented May 9, 2023

What this PR does / why we need it:

When the container is equipped with Intel MPI, the end user has had to manually add Intel MPI specific env variables for the MPIJob. This modifies the operator to add the variables with default values, if there are no user defined variables.

@google-cla
Copy link

google-cla bot commented May 9, 2023

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@google-oss-prow google-oss-prow bot requested review from jinchihe and kuizhiqing May 9, 2023 12:35
@johnugeorge
Copy link
Member

/assign @tenzen-y

@coveralls
Copy link

coveralls commented May 9, 2023

Pull Request Test Coverage Report for Build 5002839939

  • 28 of 28 (100.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.4%) to 40.135%

Totals Coverage Status
Change from base Build 4988117671: 0.4%
Covered Lines: 2791
Relevant Lines: 6954

💛 - Coveralls

@tenzen-y
Copy link
Member

@tkatila Can you sign a CLA? Thanks for your contributions.

@tkatila
Copy link
Contributor Author

tkatila commented May 10, 2023

@tkatila Can you sign a CLA? Thanks for your contributions.

Yes, I will. I just have to get confirmation from corp level on it first.

pkg/controller.v1/mpi/mpijob.go Outdated Show resolved Hide resolved
pkg/controller.v1/mpi/mpijob.go Outdated Show resolved Hide resolved
pkg/controller.v1/mpi/mpijob_controller.go Outdated Show resolved Hide resolved
@@ -1152,6 +1152,20 @@ func (jc *MPIJobReconciler) newLauncher(mpiJob *kubeflowv1.MPIJob, kubectlDelive
})
}

// Add default Intel MPI bootstrap variables if none is provided by the user
if !hasIntelMPIBootstrap(container.Env) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this implementation, if only I_MPI_HYDRA_BOOTSTRAP is set (I_MPI_HYDRA_BOOTSTRAP_EXEC is unset), it will skip this section.

Can we check both I_MPI_HYDRA_BOOTSTRAP and I_MPI_HYDRA_BOOTSTRAP_EXEC?
And then, should we set env vars if either is unset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this implementation, if only I_MPI_HYDRA_BOOTSTRAP is set (I_MPI_HYDRA_BOOTSTRAP_EXEC is unset), it will skip this section.

Can we check both I_MPI_HYDRA_BOOTSTRAP and I_MPI_HYDRA_BOOTSTRAP_EXEC?

hasIntelMPIBootstrap checks for I_MPI_HYDRA_BOOTSTRAP prefix so it includes both ..BOOTSTRAP and ..BOOSTRAP_EXEC. Actually, it also applies to two other variables that I didn't take into account: https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-9/hydra-environment-variables.html#GUID-CF3DA5C7-8B4D-468F-B463-42DCA7E0E55E and https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-9/hydra-environment-variables.html#GUID-F0F7163C-3674-4AEC-A945-9D62BFA1CE68

And then, should we set env vars if either is unset?

Maybe it's better that way. I'll change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasIntelMPIBootstrap checks for I_MPI_HYDRA_BOOTSTRAP prefix so it includes both ..BOOTSTRAP and ..BOOSTRAP_EXEC. Actually, it also applies to two other variables that I didn't take into account: https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-9/hydra-environment-variables.html#GUID-CF3DA5C7-8B4D-468F-B463-42DCA7E0E55E and https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-9/hydra-environment-variables.html#GUID-F0F7163C-3674-4AEC-A945-9D62BFA1CE68

Oh, it is great information :)
Mentioning which env vars (I_MPI_HYDRA_BOOTSTRAP, I_MPI_HYDRA_BOOTSTRAP_EXEC , I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS , and I_MPI_HYDRA_BOOTSTRAP_AUTOFORK ) are considered in the comment would be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a mention of the extra two variables to the comments. I also modified the code to set either of the missing variables.

fyi, still waiting on the CLA, should get it sorted during this week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a mention of the extra two variables to the comments. I also modified the code to set either of the missing variables.

Thank you for the updates! I left a few comments.

fyi, still waiting on the CLA, should get it sorted during this week.

Thanks for letting me know!

@tkatila tkatila force-pushed the intel-mpi-env-variables branch from f37df17 to 1ae623a Compare May 15, 2023 09:40
@google-oss-prow google-oss-prow bot added size/L and removed size/M labels May 15, 2023
pkg/controller.v1/mpi/mpijob_controller.go Outdated Show resolved Hide resolved
pkg/controller.v1/mpi/mpijob_controller_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/mpi/mpijob_controller_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/mpi/mpijob_controller_test.go Outdated Show resolved Hide resolved
@tkatila tkatila force-pushed the intel-mpi-env-variables branch from 1ae623a to b5948f1 Compare May 16, 2023 06:46
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing my comments.
I left the last review comments for some nit.

pkg/controller.v1/mpi/mpijob_controller_test.go Outdated Show resolved Hide resolved
pkg/controller.v1/mpi/mpijob_controller_test.go Outdated Show resolved Hide resolved
@tkatila tkatila force-pushed the intel-mpi-env-variables branch from b08dee8 to 8d43989 Compare May 17, 2023 11:52
@eero-t
Copy link

eero-t commented May 17, 2023

There were several mpi-operator IntelMPI support PRs in 2021, that this can be compared against: https://github.com/kubeflow/mpi-operator/pulls?q=is%3Apr+intel+mpi+is%3Aclosed

Most important ones are:

Without them, user needs to manually specify slot counts on the MPIJob command lines.

These could also relevant, but do not belong to this PR:

@tenzen-y
Copy link
Member

tenzen-y commented May 17, 2023

There were several mpi-operator IntelMPI support PRs in 2021, that this can be compared against: https://github.com/kubeflow/mpi-operator/pulls?q=is%3Apr+intel+mpi+is%3Aclosed

Most important ones are:

Without them, user needs to manually specify slot counts on the MPIJob command lines.

These could also relevant, but do not belong to this PR:

@eero-t Thank you for the great suggestion. However, we can discuss your proposals in another issue.
So, can you create an issue?

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tkatila Thank you for the great contribution!

This PR looks good to me.

Please let me know once you signed the CLA.

@eero-t eero-t mentioned this pull request May 17, 2023
@eero-t
Copy link

eero-t commented May 17, 2023

@eero-t Thank you for the great suggestion. However, we can discuss your proposals in another issue. So, can you create an issue?

Done (see above). Most likely I won't be able to verify it when it eventually gets fixed as mpi-operator project seems to be as active as this project and already support all 3 major MPI variants, so I'll go with it. I just noticed those issues when reviewing k8s MPI implementation alternatives...

@tenzen-y
Copy link
Member

@eero-t Thank you for the great suggestion. However, we can discuss your proposals in another issue. So, can you create an issue?

Done (see above). Most likely I won't be able to verify it when it eventually gets fixed as mpi-operator project seems to be as active as this project and already support all 3 major MPI variants, so I'll go with it. I just noticed those issues when reviewing k8s MPI implementation alternatives...

Yes, the mpi-operator is an active project with some performance improvement and supports the queueing feature by Kueue.
So, I would suggest using the mpi-operator if you don't have any issues to use the mpi-operator.

@tenzen-y
Copy link
Member

tenzen-y commented Jun 9, 2023

@tkatila ping

@tkatila
Copy link
Contributor Author

tkatila commented Jun 10, 2023

@tkatila ping

pong. I'm really sorry for the delay. I hit some corporate scheninagans, but most of the things sorted themselves out this Friday. I'm hoping to sign the CLA next week.

@tenzen-y
Copy link
Member

@tkatila ping

pong. I'm really sorry for the delay. I hit some corporate scheninagans, but most of the things sorted themselves out this Friday. I'm hoping to sign the CLA next week.

Thank you for letting me know!

@tkatila
Copy link
Contributor Author

tkatila commented Jun 12, 2023

The CLA is now done.

@tenzen-y
Copy link
Member

tenzen-y commented Jun 12, 2023

@tkatila Thanks! Welcome to Kubeflow Community 🎉
/lgtm
/assign @johnugeorge @terrytangyuan

@google-oss-prow google-oss-prow bot added the lgtm label Jun 12, 2023
@tenzen-y
Copy link
Member

ping @johnugeorge

@johnugeorge
Copy link
Member

Sorry. I missed this.

@tkatila Can you do a rebase?

Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
@tkatila tkatila force-pushed the intel-mpi-env-variables branch from 8d43989 to 55b13b6 Compare June 18, 2023 18:17
@google-oss-prow google-oss-prow bot removed the lgtm label Jun 18, 2023
@tkatila
Copy link
Contributor Author

tkatila commented Jun 18, 2023

@tkatila Can you do a rebase?

Sure. Rebased.

@tenzen-y
Copy link
Member

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Jun 18, 2023
Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/lgtm

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan, tkatila

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 7383114 into kubeflow:master Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants