Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flake: deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied #16025

Closed
smarterclayton opened this issue Aug 29, 2017 · 23 comments
Assignees
Labels
component/apps kind/test-flake Categorizes issue or PR as related to test flakes. priority/P0

Comments

@smarterclayton
Copy link
Contributor

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16022/test_pull_request_origin_extended_conformance_gce/6827/#-deploymentconfigs-with-minimum-ready-seconds-set-conformance-should-not-transition-the-deployment-to-complete-before-satisfied

deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied 1m32s

/tmp/openshift/build-rpm-release/tito/rpmbuild-originBXy0jB/BUILD/origin-3.7.0/_output/local/go/src/github.com/openshift/origin/test/extended/deployments/deployments.go:965
Expected error:
    <*errors.errorString | 0xc420481fb0>: {
        s: "deployment shouldn't be completed before ReadyReplicas become AvailableReplicas",
    }
    deployment shouldn't be completed before ReadyReplicas become AvailableReplicas
not to have occurred
/tmp/openshift/build-rpm-release/tito/rpmbuild-originBXy0jB/BUILD/origin-3.7.0/_output/local/go/src/github.com/openshift/origin/test/extended/deployments/deployments.go:950
@smarterclayton smarterclayton added kind/test-flake Categorizes issue or PR as related to test flakes. component/apps labels Aug 29, 2017
@smarterclayton
Copy link
Contributor Author

@openshift/sig-platform-management @mfojtik @tnozicka

@tnozicka
Copy link
Contributor

on a first look:

Aug 28 23:05:53.732: INFO: At 2017-08-28 23:05:52 -0400 EDT - event for minreadytest-1-deploy: {kubelet ci-prtest-5a37c28-6827-ig-n-qgx8} Killing: Killing container with id docker://deployment:Need to kill Pod

@smarterclayton Seems like we are out of memory - can we increase that for our jobs?
I saw this in #16003 as well

@smarterclayton
Copy link
Contributor Author

That can happen for other reasons too . What's in the master logs?

We can't increase memory really - nothing our tests are doing should be hitting 8gb of total use. Maybe someone should trace a run.

@bparees
Copy link
Contributor

bparees commented Aug 29, 2017

@tnozicka
Copy link
Contributor

@openshift/sig-continuous-infrastructure can someone pls trace this? It is happening quite often lately.

@smarterclayton any chance that your parallelism PR raised the memory expectations?

@smarterclayton
Copy link
Contributor Author

Parallelism was integration, this is e2e. The master logs are in the artifacts dir.

@tnozicka
Copy link
Contributor

Is it possible that test infra doesn't build deployer image and uses the old one?

I can see in logs that is uses openshift/origin-deployer:v3.7.0-alpha.0 but it could have built it from current commit and just tag it this way...

If that's the case I might have been derailed in my previous investigation.

Actually trying this with the 4 weeks old deployer image gives me flakiness of 1/3; using the one build from the same commit 0/16 and counting

@tnozicka
Copy link
Contributor

So it seems to be true:
https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_gce/6827/s3/download/nodes/qgx8/generated/docker.info

docker.io/openshift/origin-deployer     v3.7.0-alpha.0 a5bb3aff72a9  3 weeks ago   1.052 GB

this isn't build from current master that's why this test is flaky. It needs updated 'openshift/origin-deployer' image

Why aren't we building the images and using them for tests as they are part of the codebase as well? @stevekuznetsov @Kargakis Is this a CI bug?

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Aug 29, 2017 via email

@stevekuznetsov
Copy link
Contributor

GCE test can not and will not build images. How did the change to the image merge without GCE passing?

@stevekuznetsov
Copy link
Contributor

From the most recent release job:

...
[openshift/origin-deployer] --> FROM openshift/origin
[openshift/origin-deployer] --> LABEL io.k8s.display-name="OpenShift Origin Deployer"       io.k8s.description="This is a component of OpenShift Origin and executes the user deployment process to roll out new containers. It may be used as a base image for building your own custom deployer image."       io.openshift.tags="openshift,deployer"
[openshift/origin-deployer] --> USER 1001
[openshift/origin-deployer] --> ENTRYPOINT ["/usr/bin/openshift-deploy"]
[openshift/origin-deployer] --> Committing changes to openshift/origin-deployer:cf93cc3 ...
[openshift/origin-deployer] --> Tagged as openshift/origin-deployer:latest
[openshift/origin-deployer] --> Done
...
[INFO] Pushing docker.io/openshift/origin-deployer:latest...
The push refers to a repository [docker.io/openshift/origin-deployer]
3fd512fc9b33: Preparing
65d78bfd863a: Preparing
7e927f48afaa: Preparing
b362758f4793: Preparing
b362758f4793: Layer already exists
65d78bfd863a: Mounted from openshift/origin-base
3fd512fc9b33: Mounted from openshift/origin
7e927f48afaa: Mounted from openshift/origin-pod
latest: digest: sha256:de40deb9369540c4dcd620839cb6e21b323889947a04498c22d276cc25d97347 size: 1161

@tnozicka
Copy link
Contributor

@smarterclayton we have fixed how the minreadyseconds are counted few days ago in deployer pod.

The old deployer pod and the new one count it differently. So depending on timing and cluster utilization those might match or not. (they mostly do match)

@mfojtik
Copy link
Contributor

mfojtik commented Aug 29, 2017

Do we know when this occurred for the first time? What changes to deployments we made recently that will cause 3 tests to flake heavily?

@stevekuznetsov
Copy link
Contributor

@smarterclayton looks like the release job is not tagging the images as 3.7 or whatever is necessary -- can you update that?

@mfojtik
Copy link
Contributor

mfojtik commented Aug 29, 2017

@tnozicka thx, will see if this improves when we get the updated images out.

@tnozicka
Copy link
Contributor

btw. this is the PR fixing deployer #14954
(It suddenly became flaky while it was merging)

@stevekuznetsov
Copy link
Contributor

stevekuznetsov commented Aug 29, 2017

Latest tag should also be pushed in openshift-eng/aos-cd-jobs@fab6ea7

Not doing that for now

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Aug 29, 2017 via email

@tnozicka
Copy link
Contributor

tnozicka commented Aug 29, 2017

@smarterclayton I don't think it will. It used to count minreadyseconds by pods availability in the deployer. But throughout the time RC got minreadyseconds as well counted separately using the controller. Then when you deployed DC there was a brief moment when RC wasn't available due to minreadysec but DC was. We have unified it so DC now uses RC's minreadyseconds.

The reason why the test won't tolerate the old deployer is that it precisely checks the order of events and validates every state to make sure it is working now because it sometimes wasn't before.

Will this cause a problem when a user updates to master 3.7 while deployer
3.6 pods are running?

Shouldn't cause any issues. (it just won't fix the issue described above without updating the deployer. That said we should make sure QA test it as well.)

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Aug 30, 2017 via email

@tnozicka
Copy link
Contributor

I can't get to master logs because Jenkins is down with 503 for several hours now. (We don't have master logs on appspot.com AFAICS, only in S3 artifacts in Jenkins.)

From the build log this is a different (timeout) issue which might not hopefully manifest that frequently although I should fix it once I can see master logs and I can reproduce it.

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Aug 30, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/apps kind/test-flake Categorizes issue or PR as related to test flakes. priority/P0
Projects
None yet
Development

No branches or pull requests

6 participants