Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

studyJob cannot recover once Completed or Failed #291

Closed
hougangliu opened this issue Dec 13, 2018 · 8 comments
Closed

studyJob cannot recover once Completed or Failed #291

hougangliu opened this issue Dec 13, 2018 · 8 comments

Comments

@hougangliu
Copy link
Member

Now studyJob cannot recover once Completed or Failed.
When a studyJob CRD created, I can update it by "kubectl apply" or else, but if the studyJob condition is Completed or Failed, we never start next suggestion schedule.
For example, a user creates a studyJob with an invalid workerSpec, which would lead to spawnWorker return error and studyJob goes to Failed. Then the user can correct workerSpec by "kube apply", expecting studyJob re-triggered. However, nothing will happen.

@here we should discuss the behavior of updating a studyJob CRD (when it is in RUNNING, Completed or FAILED)

	if instance.Status.Condition == katibv1alpha1.ConditionCompleted || instance.Status.Condition == katibv1alpha1.ConditionFailed {
		nextSuggestionSchedule = false
	}
@hougangliu
Copy link
Member Author

/help

@k8s-ci-robot k8s-ci-robot added the help wanted Extra attention is needed label Dec 15, 2018
@hougangliu
Copy link
Member Author

/remove-help

@k8s-ci-robot k8s-ci-robot removed the help wanted Extra attention is needed label Dec 20, 2018
@hougangliu
Copy link
Member Author

@YujiOshima can you add label community/discussion for it

@richardsliu richardsliu added this to New in 0.5.0 via automation Jan 18, 2019
@richardsliu richardsliu moved this from New to Releasing & Testing in 0.5.0 Jan 18, 2019
@richardsliu richardsliu moved this from Releasing & Testing to Hyperparameter Tuning in 0.5.0 Jan 18, 2019
@hougangliu
Copy link
Member Author

hougangliu commented Jan 23, 2019

  1. At least, for Failed studyJob, we should try to rehandle it when updated.
  2. For completed studyJob, we should reject updating it by webhook(we need upgrade controller-runtime, and webhook can validate studyJob to fix studyJob controller is blocked by bad CR manifests #314).

@jlewi
Copy link
Contributor

jlewi commented Mar 10, 2019

This seems like its working as intended to me.

Once a job reaches a terminal state (failed or succeseded) updates to the job should not be allowed.
This is consistent with how native K8s jobs work.

If a user wants to update the spec they could create a new job.

/cc @johnugeorge @richardsliu

@johnugeorge
Copy link
Member

Agree. K8s job works in the same way.

@gaocegege
Copy link
Member

/close

We deprecated v1alpha1.

@k8s-ci-robot
Copy link

@gaocegege: Closing this issue.

In response to this:

/close

We deprecated v1alpha1.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants