Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835

Merged
merged 1 commit into from
Jun 18, 2023

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Jun 17, 2023

What this PR does / why we need it:
When enabling the gang-scheduling, we don't check whether jobs have been pending for a while in e2e.
So tests for the gang-scheduling will pass if jobs meet the Created=true and Running=false conditions for just a moment.
I added a check that jobs have been pending for a while.

Also, I fixed a test bug that volcano isn't set to jobs as a schedulerName when testing for volcano integration.

Note: I faced errors in #1834.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

@tenzen-y tenzen-y changed the title WIP: Add a checking that pods are pending for a while WIP: Add a check that pods have been pending for a while Jun 17, 2023
@coveralls
Copy link

coveralls commented Jun 17, 2023

Pull Request Test Coverage Report for Build 5302424174

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 13 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.06%) to 33.821%

Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 4 77.5%
pkg/controller.v1/pytorch/pytorchjob_controller.go 9 58.19%
Totals Coverage Status
Change from base Build 5296670827: -0.06%
Covered Lines: 3327
Relevant Lines: 9837

💛 - Coveralls

@tenzen-y tenzen-y changed the title WIP: Add a check that pods have been pending for a while Add a check that pods have been pending for a while Jun 17, 2023
@tenzen-y tenzen-y marked this pull request as ready for review June 17, 2023 16:08
@tenzen-y
Copy link
Member Author

Maybe, this PR will resolve #1832?

@tenzen-y
Copy link
Member Author

cc: @lowang-bh
/assign @johnugeorge

if client.is_job_running(name, namespace, job_kind):
raise Exception(f"{job_kind} shouldn't be in Running condition")
# Job shouldn't have a Running condition.
if client.is_job_running(name, namespace, job_kind):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what you mean by "pending for a while"? Are you referring to a situation which is in created but not running? If then, job will get into running state after retry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant unschedulable pods (gang scheduling).

Copy link
Member Author

@tenzen-y tenzen-y Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If before the training-operator updates the job condition from Runnng=false to Running=true, this test code gets the job condition and the job condition has Running=false or doesn't have Running condition, this test unintended passes.

So, let's imagine the following situation:

Current e2e:

  1. Test: Deploy job with gang scheduling setting (.runPolicy.schedulingPolicy).
  2. Operator: Failed to set schedulerName=volcano to the job. Or create an incorrect PodGroup.
  3. Test: Get the job with Running=false or without Running condition.
  4. Pods: Pods are immediately scheduled to Node and start since the job doesn't have appropriate gang scheduling settings.
  5. Operator: Update the job condition with Running=true.
  6. Test: Succeeded! (Unintended)

Copy link
Member Author

@tenzen-y tenzen-y Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This verify_unschedulable_job_e2e function verifies that gang scheduler integrations work well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

elif gang_scheduler_name == TEST_GANG_SCHEDULER_NAME_VOLCANO:
return ""
return TEST_GANG_SCHEDULER_NAME_VOLCANO
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge In fact, even though we forgot to set volcano to schedulerName in the podSpec, e2e passed in #1831.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary and it used to set the scheduler for gang-schedule e2e. I forget to changed it in last pr, sorry.

@tenzen-y tenzen-y changed the title Add a check that pods have been pending for a while Add a check pods are not scheduled when testing gang-scheduler integrations in e2e Jun 18, 2023
@tenzen-y tenzen-y changed the title Add a check pods are not scheduled when testing gang-scheduler integrations in e2e Add check pods are not scheduled when testing gang-scheduler integrations in e2e Jun 18, 2023
…ions in e2e

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@johnugeorge
Copy link
Member

/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Jun 18, 2023
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit e002b8a into kubeflow:master Jun 18, 2023
24 checks passed
@tenzen-y tenzen-y deleted the fix-gang-scheduling-e2e branch June 18, 2023 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants