Skip to content
This repository has been archived by the owner on May 25, 2023. It is now read-only.

Added backfill #805

Closed
wants to merge 2 commits into from
Closed

Added backfill #805

wants to merge 2 commits into from

Conversation

pdgetrf
Copy link

@pdgetrf pdgetrf commented Apr 19, 2019

Added backfill action that allows lower priority jobs to run when higher priority jobs cannot run but still hold resources

Tested with added unit tests, e2e tests and local manual test

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 19, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pdgetrf
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: k82cn

If they are not already assigned, you can assign the PR to them by writing /assign @k82cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @pdgetrf. Thanks for your PR.

I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 19, 2019
@pdgetrf
Copy link
Author

pdgetrf commented Apr 19, 2019

signed CLA

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 19, 2019
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 19, 2019
@pdgetrf pdgetrf force-pushed the backfill branch 4 times, most recently from cf61bce to bb6056e Compare April 19, 2019 23:23
config/kube-batch-conf.yaml Outdated Show resolved Hide resolved
@k82cn
Copy link
Contributor

k82cn commented Apr 22, 2019

btw, please also open a PR on design doc.

@k82cn
Copy link
Contributor

k82cn commented Apr 22, 2019

and please also append the review comments last time and highlight how those comments were addressed.

@dhzhuo
Copy link

dhzhuo commented Apr 22, 2019

@k82cn Design doc of backfill and starvation prevention can be found in PR 821.

defer wg.Done()

annotation := map[string]string{v1alpha1.BackfillAnnotationKey: "true"}
err := ssn.PatchAnnotation(pendingTask, annotation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotation should be also cached, there maybe out of sync between scheduler cache & apiserver.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP

func (ssn *Session) BackFillEligible(obj interface{}) bool {
for _, tier := range ssn.Tiers {
for _, plugin := range tier.Plugins {
jrf, found := ssn.backFillEligibleFns[plugin.Name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be also configurable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"configurable" as in "can be configured to turn on and off"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backfill function can be enabled/disabled from configurations. BackfillEligible is a function provided by each plugin to decide which job is eligible to be a backfill job. What is the use case to conditionally turn on/off BackfillEligible function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case to conditionally turn on/off BackfillEligible function?

Different plugin may provide this callback, we need a configuration to controll it.

pkg/scheduler/scheduler.go Outdated Show resolved Hide resolved
pkg/scheduler/scheduler.go Outdated Show resolved Hide resolved

"github.com/kubernetes-sigs/kube-batch/pkg/scheduler/conf"
"github.com/kubernetes-sigs/kube-batch/pkg/scheduler/framework"
"github.com/kubernetes-sigs/kube-batch/pkg/scheduler/plugins"
"gopkg.in/yaml.v2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the original format.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's original format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	"io/ioutil"

	"gopkg.in/yaml.v2"

	"github.com/kubernetes-sigs/kube-batch/pkg/scheduler/conf"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we have tools to enforce this? oh right, it's reverted!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will wait the tool is brought back to fix this. this shouldn't be fixed by hand.

pkg/scheduler/util.go Outdated Show resolved Hide resolved
pkg/scheduler/util.go Outdated Show resolved Hide resolved
@pdgetrf pdgetrf force-pushed the backfill branch 2 times, most recently from 0bddba3 to 21fedf1 Compare May 1, 2019 00:10
@pdgetrf
Copy link
Author

pdgetrf commented May 1, 2019

still working on a few issues. the e2e tests from the previous rollback in the master branch (not related to this PR) wrecked a mess. attempting to fix it now at least for backfill.

case err := <-errChannel:
if err != nil {
fmt.Println("error ", err)
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happen if Pod is patched but did not dispatch because of other error?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If dispatch fails, the pod remains unbound in the current scheduling round, and therefore will not be started. In the next scheduling round, when we create TaskInfo for a Pod in NewTaskInfo, we will noticed that the Pod is unbound but has backfill annotation. In this case, we will remove backfill annotation from the Pod.

@pdgetrf It looks to me that we do not yet have the logic to remove backfill annotation in NewTaskInfo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we will remove backfill annotation from the Pod.

It seems not implemented in this PR, right?

Copy link
Author

@pdgetrf pdgetrf May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be fixed in a following up PR which, as you already suggested, is smaller and easier to review. I do not see how this should block the current PR. It will only make this one even bigger. And we are going to address this in the starvation PR anyway. And backfill will be disabled in this PR anyway. If your purpose is to further delay the starvation PR including all the necessary testing, yes I agree we should fix it here in this PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happen if Pod is patched but did not dispatch because of other error?

code has the answer.

pkg/scheduler/util.go Outdated Show resolved Hide resolved
@pdgetrf pdgetrf force-pushed the backfill branch 2 times, most recently from 61815db to 0f1900d Compare May 3, 2019 22:36
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 6, 2019
@k8s-ci-robot
Copy link
Contributor

@pdgetrf: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 18, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 16, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 16, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants