Podgroup state changed from running to inqueue after pod deleted #2208

shinytang6 · 2022-05-02T06:58:45Z

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
job yaml:

apiVersion: batch.paddlepaddle.org/v1
kind: PaddleJob
metadata:
  name: wide-ande-deep2
spec:
  cleanPodPolicy: OnCompletion
  withGloo: 1
  worker:
    replicas: 1
    template:
      spec:
        schedulerName: volcano
        containers:
          - name: paddle
            image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1
  ps:
    replicas: 1
    template:
      spec:
        schedulerName: volcano
        containers:
          - name: paddle
            image: registry.baidubce.com/paddle-operator/demo-wide-and-deep:v1

In this case(cleanPodPolicy=OnCompletion)，when the pod is completed, the pods will be deleted by paddlejob controller.
The PodGroup status transition will be like Inqueue => Running => Pending(all the pods deleted) => Inqueue(pass enqueue action again)(Then remains Inqueue all the time), it will result in unnecessary resource occupation.

related logic:

func jobStatus(ssn *Session, jobInfo *api.JobInfo) scheduling.PodGroupStatus {
	status := jobInfo.PodGroup.Status

	unschedulable := false
	for _, c := range status.Conditions {
		if c.Type == scheduling.PodGroupUnschedulableType &&
			c.Status == v1.ConditionTrue &&
			c.TransitionID == string(ssn.UID) {
			unschedulable = true
			break
		}
	}

	// If running tasks && unschedulable, unknown phase
	if len(jobInfo.TaskStatusIndex[api.Running]) != 0 && unschedulable {
		status.Phase = scheduling.PodGroupUnknown
	} else {
		allocated := 0
		for status, tasks := range jobInfo.TaskStatusIndex {
			if api.AllocatedStatus(status) || status == api.Succeeded {
				allocated += len(tasks)
			}
		}

		// If there're enough allocated resource, it's running
		if int32(allocated) >= jobInfo.PodGroup.Spec.MinMember {
			status.Phase = scheduling.PodGroupRunning
		} else if jobInfo.PodGroup.Status.Phase != scheduling.PodGroupInqueue {
                         // here PodGroup status converts from Running to Pending
			status.Phase = scheduling.PodGroupPending
		}
	}

	status.Running = int32(len(jobInfo.TaskStatusIndex[api.Running]))
	status.Failed = int32(len(jobInfo.TaskStatusIndex[api.Failed]))
	status.Succeeded = int32(len(jobInfo.TaskStatusIndex[api.Succeeded]))

	return status
}

Environment:

Volcano Version: latest image
Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

shinytang6 · 2022-05-02T07:02:48Z

My workaround is if podgroup is unschedulable and current state is Running, we convert it to Unknown while not becomes Pending then enter into enqueue action.

stale · 2022-08-10T03:18:52Z

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale · 2022-11-12T05:12:10Z

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale · 2023-01-22T08:17:38Z

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

bood · 2024-11-05T07:37:40Z

I came to similar issue when adopting volcano in our project. I think there is a bug about the state change logic of jobStatus function:

When there're not enough resources, pg should fall back from Inqueue to Pending according to state change of design doc delay-pod-creation
The else if jobInfo.PodGroup.Status.Phase != scheduling.PodGroupInqueue condition seems to be reverted now. If this is changed to follow the design doc, Running state will not be changed to Pending state in your case
PodGroupCompleted also help in this case if pods are completed successfully, which is introduced in PR Add podGroup completed phase #2667 . But I think the pod error case is missed here.

I can make a PR if it makes sense. @shinytang6

shinytang6 added the kind/bug Categorizes issue or PR as related to a bug. label May 2, 2022

shinytang6 changed the title ~~Podgroup state becomes inqueue after pod deleted~~ Podgroup state changed from running to inqueue after pod deleted May 2, 2022

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022

Thor-wl removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 12, 2022

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 12, 2022

stale bot closed this as completed Jan 22, 2023

bood mentioned this issue Nov 15, 2024

fix: mark PodGroup completed when pod fails #3807

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Podgroup state changed from running to inqueue after pod deleted #2208

Podgroup state changed from running to inqueue after pod deleted #2208

shinytang6 commented May 2, 2022 •

edited

Loading

shinytang6 commented May 2, 2022

stale bot commented Aug 10, 2022

stale bot commented Nov 12, 2022

stale bot commented Jan 22, 2023

bood commented Nov 5, 2024 •

edited

Loading

Podgroup state changed from running to inqueue after pod deleted #2208

Podgroup state changed from running to inqueue after pod deleted #2208

Comments

shinytang6 commented May 2, 2022 • edited Loading

shinytang6 commented May 2, 2022

stale bot commented Aug 10, 2022

stale bot commented Nov 12, 2022

stale bot commented Jan 22, 2023

bood commented Nov 5, 2024 • edited Loading

shinytang6 commented May 2, 2022 •

edited

Loading

bood commented Nov 5, 2024 •

edited

Loading