Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Podgroup state changed from running to inqueue after pod deleted #2208

Closed
shinytang6 opened this issue May 2, 2022 · 5 comments
Closed

Podgroup state changed from running to inqueue after pod deleted #2208

shinytang6 opened this issue May 2, 2022 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@shinytang6
Copy link
Member

shinytang6 commented May 2, 2022

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
job yaml:

apiVersion: batch.paddlepaddle.org/v1
kind: PaddleJob
metadata:
  name: wide-ande-deep2
spec:
  cleanPodPolicy: OnCompletion
  withGloo: 1
  worker:
    replicas: 1
    template:
      spec:
        schedulerName: volcano
        containers:
          - name: paddle
            image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1
  ps:
    replicas: 1
    template:
      spec:
        schedulerName: volcano
        containers:
          - name: paddle
            image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1

In this case(cleanPodPolicy=OnCompletion),when the pod is completed, the pods will be deleted by paddlejob controller.
The PodGroup status transition will be like Inqueue => Running => Pending(all the pods deleted) => Inqueue(pass enqueue action again)(Then remains Inqueue all the time), it will result in unnecessary resource occupation.

related logic:

func jobStatus(ssn *Session, jobInfo *api.JobInfo) scheduling.PodGroupStatus {
	status := jobInfo.PodGroup.Status

	unschedulable := false
	for _, c := range status.Conditions {
		if c.Type == scheduling.PodGroupUnschedulableType &&
			c.Status == v1.ConditionTrue &&
			c.TransitionID == string(ssn.UID) {
			unschedulable = true
			break
		}
	}

	// If running tasks && unschedulable, unknown phase
	if len(jobInfo.TaskStatusIndex[api.Running]) != 0 && unschedulable {
		status.Phase = scheduling.PodGroupUnknown
	} else {
		allocated := 0
		for status, tasks := range jobInfo.TaskStatusIndex {
			if api.AllocatedStatus(status) || status == api.Succeeded {
				allocated += len(tasks)
			}
		}

		// If there're enough allocated resource, it's running
		if int32(allocated) >= jobInfo.PodGroup.Spec.MinMember {
			status.Phase = scheduling.PodGroupRunning
		} else if jobInfo.PodGroup.Status.Phase != scheduling.PodGroupInqueue {
                         // here PodGroup status converts from Running to Pending
			status.Phase = scheduling.PodGroupPending
		}
	}

	status.Running = int32(len(jobInfo.TaskStatusIndex[api.Running]))
	status.Failed = int32(len(jobInfo.TaskStatusIndex[api.Failed]))
	status.Succeeded = int32(len(jobInfo.TaskStatusIndex[api.Succeeded]))

	return status
}

Environment:

  • Volcano Version: latest image
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@shinytang6 shinytang6 added the kind/bug Categorizes issue or PR as related to a bug. label May 2, 2022
@shinytang6
Copy link
Member Author

My workaround is if podgroup is unschedulable and current state is Running, we convert it to Unknown while not becomes Pending then enter into enqueue action.

@shinytang6 shinytang6 changed the title Podgroup state becomes inqueue after pod deleted Podgroup state changed from running to inqueue after pod deleted May 2, 2022
@stale
Copy link

stale bot commented Aug 10, 2022

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022
@Thor-wl Thor-wl removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 12, 2022
@stale
Copy link

stale bot commented Nov 12, 2022

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 12, 2022
@stale
Copy link

stale bot commented Jan 22, 2023

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Jan 22, 2023
@bood
Copy link

bood commented Nov 5, 2024

I came to similar issue when adopting volcano in our project. I think there is a bug about the state change logic of jobStatus function:

  1. When there're not enough resources, pg should fall back from Inqueue to Pending according to state change of design doc delay-pod-creation
    The else if jobInfo.PodGroup.Status.Phase != scheduling.PodGroupInqueue condition seems to be reverted now. If this is changed to follow the design doc, Running state will not be changed to Pending state in your case

  2. PodGroupCompleted also help in this case if pods are completed successfully, which is introduced in PR Add podGroup completed phase #2667 . But I think the pod error case is missed here.

I can make a PR if it makes sense. @shinytang6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

3 participants