Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Reconcilation fails when upgrading common to 0.3.6 #1394

Closed
gaocegege opened this issue Aug 30, 2021 · 5 comments
Closed

[bug] Reconcilation fails when upgrading common to 0.3.6 #1394

gaocegege opened this issue Aug 30, 2021 · 5 comments
Assignees
Labels

Comments

@gaocegege
Copy link
Member

gaocegege commented Aug 30, 2021

When the pods are created and running, the tfjob status is still hanging on Created.

The log of tf-operator:

{"filename":"record/event.go:282","level":"info","msg":"Event(v1.ObjectReference{Kind:\"TFJob\", Namespace:\"default\", Name:\"dist-mnist-for-e2e-test\", UID:\"95fff85a-176f-4fd1-afab-ce5298890be5\", APIVersion:\"kubeflow.org/v1\", ResourceVersion:\"1494926\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedCreatePod' Error creating: pods \"dist-mnist-for-e2e-test-worker-0\" already exists","time":"2021-08-30T10:48:59+08:00"}
{"filename":"tensorflow/controller.go:255","job":"default.dist-mnist-for-e2e-test","level":"info","msg":"TFJob has been deleted: default/dist-mnist-for-e2e-test","time":"2021-08-30T10:49:19+08:00"}
{"filename":"tensorflow/controller.go:255","job":"default.dist-mnist-for-e2e-test","level":"info","msg":"TFJob has been deleted: default/dist-mnist-for-e2e-test","time":"2021-08-30T10:50:21+08:00"}

Seems that the operator always tries to create the pod.

**We definitely need a way to prove that the changes in common are correct.

@gaocegege gaocegege changed the title [bug] Reconcilation fails when upgrade common to 0.3.6 [bug] Reconcilation fails when upgrading common to 0.3.6 Aug 30, 2021
@Jeffwan
Copy link
Member

Jeffwan commented Aug 30, 2021

Let me run some manual e2e test to check the problem. I think the error either comes from expectation pr or rtype pr.

@Jeffwan
Copy link
Member

Jeffwan commented Aug 30, 2021

I revert rtype PR and everything works fine. I think the problem is when it filters the pods with specific rtype (case sensitive), it doesn't find corresponding pods. In the end, it tries to create new ones.

I think at this moment, we should revert the change and consider refactor later and make it more careful. WDYT? @gaocegege

see kubeflow/common#158 for more details. If we plan to go that way, that's much easier to us to bump dependencies to 0.3.7 without that many rType conversion in tf-operator repo.

@Jeffwan Jeffwan self-assigned this Aug 30, 2021
@gaocegege
Copy link
Member Author

LGTM

@Jeffwan
Copy link
Member

Jeffwan commented Sep 1, 2021

both master and v1.2-branch are patched. We can close this issue

/close

@google-oss-robot
Copy link

@Jeffwan: Closing this issue.

In response to this:

both master and v1.2-branch are patched. We can close this issue

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants