flake: deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied #16025

smarterclayton · 2017-08-29T05:08:11Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16022/test_pull_request_origin_extended_conformance_gce/6827/#-deploymentconfigs-with-minimum-ready-seconds-set-conformance-should-not-transition-the-deployment-to-complete-before-satisfied

deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied 1m32s

/tmp/openshift/build-rpm-release/tito/rpmbuild-originBXy0jB/BUILD/origin-3.7.0/_output/local/go/src/github.com/openshift/origin/test/extended/deployments/deployments.go:965
Expected error:
    <*errors.errorString | 0xc420481fb0>: {
        s: "deployment shouldn't be completed before ReadyReplicas become AvailableReplicas",
    }
    deployment shouldn't be completed before ReadyReplicas become AvailableReplicas
not to have occurred
/tmp/openshift/build-rpm-release/tito/rpmbuild-originBXy0jB/BUILD/origin-3.7.0/_output/local/go/src/github.com/openshift/origin/test/extended/deployments/deployments.go:950

The text was updated successfully, but these errors were encountered:

smarterclayton · 2017-08-29T05:08:42Z

@openshift/sig-platform-management @mfojtik @tnozicka

tnozicka · 2017-08-29T08:17:39Z

on a first look:

Aug 28 23:05:53.732: INFO: At 2017-08-28 23:05:52 -0400 EDT - event for minreadytest-1-deploy: {kubelet ci-prtest-5a37c28-6827-ig-n-qgx8} Killing: Killing container with id docker://deployment:Need to kill Pod

@smarterclayton Seems like we are out of memory - can we increase that for our jobs?
I saw this in #16003 as well

smarterclayton · 2017-08-29T13:42:30Z

That can happen for other reasons too . What's in the master logs?

We can't increase memory really - nothing our tests are doing should be hitting 8gb of total use. Maybe someone should trace a run.

bparees · 2017-08-29T13:54:34Z

https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_gce/6841

tnozicka · 2017-08-29T13:55:52Z

@openshift/sig-continuous-infrastructure can someone pls trace this? It is happening quite often lately.

@smarterclayton any chance that your parallelism PR raised the memory expectations?

smarterclayton · 2017-08-29T14:17:39Z

Parallelism was integration, this is e2e. The master logs are in the artifacts dir.

tnozicka · 2017-08-29T16:52:29Z

Is it possible that test infra doesn't build deployer image and uses the old one?

I can see in logs that is uses openshift/origin-deployer:v3.7.0-alpha.0 but it could have built it from current commit and just tag it this way...

If that's the case I might have been derailed in my previous investigation.

Actually trying this with the 4 weeks old deployer image gives me flakiness of 1/3; using the one build from the same commit 0/16 and counting

tnozicka · 2017-08-29T17:09:14Z

So it seems to be true:
https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended_conformance_gce/6827/s3/download/nodes/qgx8/generated/docker.info

docker.io/openshift/origin-deployer     v3.7.0-alpha.0 a5bb3aff72a9  3 weeks ago   1.052 GB

this isn't build from current master that's why this test is flaky. It needs updated 'openshift/origin-deployer' image

Why aren't we building the images and using them for tests as they are part of the codebase as well? @stevekuznetsov @Kargakis Is this a CI bug?

smarterclayton · 2017-08-29T17:13:37Z

Was there a bug fix in the image? Is install_update failing as well?

stevekuznetsov · 2017-08-29T17:17:39Z

GCE test can not and will not build images. How did the change to the image merge without GCE passing?

stevekuznetsov · 2017-08-29T17:19:07Z

From the most recent release job:

...
[openshift/origin-deployer] --> FROM openshift/origin
[openshift/origin-deployer] --> LABEL io.k8s.display-name="OpenShift Origin Deployer"       io.k8s.description="This is a component of OpenShift Origin and executes the user deployment process to roll out new containers. It may be used as a base image for building your own custom deployer image."       io.openshift.tags="openshift,deployer"
[openshift/origin-deployer] --> USER 1001
[openshift/origin-deployer] --> ENTRYPOINT ["/usr/bin/openshift-deploy"]
[openshift/origin-deployer] --> Committing changes to openshift/origin-deployer:cf93cc3 ...
[openshift/origin-deployer] --> Tagged as openshift/origin-deployer:latest
[openshift/origin-deployer] --> Done
...
[INFO] Pushing docker.io/openshift/origin-deployer:latest...
The push refers to a repository [docker.io/openshift/origin-deployer]
3fd512fc9b33: Preparing
65d78bfd863a: Preparing
7e927f48afaa: Preparing
b362758f4793: Preparing
b362758f4793: Layer already exists
65d78bfd863a: Mounted from openshift/origin-base
3fd512fc9b33: Mounted from openshift/origin
7e927f48afaa: Mounted from openshift/origin-pod
latest: digest: sha256:de40deb9369540c4dcd620839cb6e21b323889947a04498c22d276cc25d97347 size: 1161

tnozicka · 2017-08-29T17:21:04Z

@smarterclayton we have fixed how the minreadyseconds are counted few days ago in deployer pod.

The old deployer pod and the new one count it differently. So depending on timing and cluster utilization those might match or not. (they mostly do match)

mfojtik · 2017-08-29T17:22:07Z

Do we know when this occurred for the first time? What changes to deployments we made recently that will cause 3 tests to flake heavily?

stevekuznetsov · 2017-08-29T17:22:29Z

@smarterclayton looks like the release job is not tagging the images as 3.7 or whatever is necessary -- can you update that?

mfojtik · 2017-08-29T17:23:02Z

@tnozicka thx, will see if this improves when we get the updated images out.

tnozicka · 2017-08-29T17:31:06Z

btw. this is the PR fixing deployer #14954
(It suddenly became flaky while it was merging)

gabemontero · 2017-08-29T17:33:41Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16036/test_pull_request_origin_extended_conformance_gce/6865/

and

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16036/test_pull_request_origin_extended_conformance_install_update/5038/

stevekuznetsov · 2017-08-29T17:46:18Z

~~Latest tag should also be pushed in openshift-eng/aos-cd-jobs@fab6ea7~~

Not doing that for now

smarterclayton · 2017-08-29T20:29:22Z

Will this cause a problem when a user updates to master 3.7 while deployer 3.6 pods are running? If so this was not a safe change to make.

…

On Tue, Aug 29, 2017 at 1:46 PM, Steve Kuznetsov ***@***.***> wrote: Latest tag should also be pushed in ***@***.*** <openshift-eng/aos-cd-jobs@fab6ea7> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16025 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pzwbMBoZlSlyP4S7rmllsboys-aCks5sdE5wgaJpZM4PFYv6> .

tnozicka · 2017-08-29T21:16:52Z

@smarterclayton I don't think it will. It used to count minreadyseconds by pods availability in the deployer. But throughout the time RC got minreadyseconds as well counted separately using the controller. Then when you deployed DC there was a brief moment when RC wasn't available due to minreadysec but DC was. We have unified it so DC now uses RC's minreadyseconds.

The reason why the test won't tolerate the old deployer is that it precisely checks the order of events and validates every state to make sure it is working now because it sometimes wasn't before.

Will this cause a problem when a user updates to master 3.7 while deployer
3.6 pods are running?

Shouldn't cause any issues. (it just won't fix the issue described above without updating the deployer. That said we should make sure QA test it as well.)

smarterclayton · 2017-08-30T00:18:10Z

Looks like it is still failing even with the latest image https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16046/test_pull_request_origin_extended_conformance_gce/6910/#-deploymentconfigs-with-minimum-ready-seconds-set-conformance-should-not-transition-the-deployment-to-complete-before-satisfied

…

On Tue, Aug 29, 2017 at 5:16 PM, Tomáš Nožička ***@***.***> wrote: @smarterclayton <https://github.com/smarterclayton> I don't think it will. It used to count minreadyseconds by pods availability in the deployer. But throughout the time RC got minreadyseconds as well counted separately using the controller. Then when you deployed DC there was a brief moment when RC wasn't available due to minreadysec but DC was. We have unified it so DC now uses RC's minreadyseconds. The reason why the test won't tolerate the old deployer is that it precisely checks the order of events and validates every state to make sure it is working now because it wasn't before. Will this cause a problem when a user updates to master 3.7 while deployer 3.6 pods are running? Shouldn't cause any issues. (it just won't fix the issue described above without updating the deployer. That said we should make sure QA test it as well.) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16025 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p684nOlzcZD-UdS6FNu2gop60TOYks5sdH_HgaJpZM4PFYv6> .

tnozicka · 2017-08-30T08:41:30Z

I can't get to master logs because Jenkins is down with 503 for several hours now. (We don't have master logs on appspot.com AFAICS, only in S3 artifacts in Jenkins.)

From the build log this is a different (timeout) issue which might not hopefully manifest that frequently although I should fix it once I can see master logs and I can reproduce it.

smarterclayton · 2017-08-30T13:27:52Z

Master logs are all on gcs and are accessible via artifacts link on gubernator On Aug 30, 2017, at 4:41 AM, Tomáš Nožička <notifications@github.com> wrote: I can't get to master logs because Jenkins is down with 503 for several hours now. (We don't have master logs on appspot.com AFAICS, only in S3 artifacts in Jenkins.) From the build log this is a different (timeout) issue which might not hopefully manifest that frequently although I should fix it once I can see master logs and I can reproduce it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16025 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pxlg78iXFXBzzX114Z5crhID4Redks5sdSA9gaJpZM4PFYv6> .

…nds-test-more-tolerant-to-infra Automatic merge from submit-queue Make deployments minreadyseconds test more tolerant to infra Fixes #16025 Will run several test runs here to ses that the flake isn't occurring anymore.

smarterclayton added kind/test-flake Categorizes issue or PR as related to test flakes. component/apps labels Aug 29, 2017

mfojtik assigned tnozicka Aug 29, 2017

This was referenced Aug 29, 2017

registry: report publicDockerImageRepository to image stream if configured #15853

Merged

resolve groups "impersonategroup" already exists error #16022

Merged

mfojtik added the priority/P1 label Aug 29, 2017

dmage mentioned this issue Aug 29, 2017

rebase to Docker Distribution v2.6.2 #15694

Merged

tnozicka added priority/P0 and removed priority/P1 labels Aug 29, 2017

mfojtik mentioned this issue Aug 29, 2017

Extended.deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied #16037

Closed

gabemontero mentioned this issue Aug 29, 2017

bump integration test TO to 45 min per recent results #16036

Merged

detiber mentioned this issue Aug 30, 2017

hack/env: remove tmp volume if not user specified #16046

Merged

tnozicka mentioned this issue Aug 30, 2017

Make deployments minreadyseconds test more tolerant to infra #16061

Merged

enj mentioned this issue Aug 31, 2017

Use groupUIDNameMapping for LDAP sync/prune with Openshift groups #16071

Merged

jim-minter mentioned this issue Aug 31, 2017

don't require template name/namespace to be set on nested template within templateinstance #16035

Merged

deads2k mentioned this issue Aug 31, 2017

prevent references from openshift master to other binaries #16076

Merged

openshift-merge-robot closed this as completed in #16061 Aug 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flake: deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied #16025

flake: deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied #16025

smarterclayton commented Aug 29, 2017

smarterclayton commented Aug 29, 2017

tnozicka commented Aug 29, 2017

smarterclayton commented Aug 29, 2017

bparees commented Aug 29, 2017

tnozicka commented Aug 29, 2017

smarterclayton commented Aug 29, 2017

tnozicka commented Aug 29, 2017

tnozicka commented Aug 29, 2017

smarterclayton commented Aug 29, 2017 via email •

edited by stevekuznetsov

Loading

stevekuznetsov commented Aug 29, 2017

stevekuznetsov commented Aug 29, 2017

tnozicka commented Aug 29, 2017

mfojtik commented Aug 29, 2017

stevekuznetsov commented Aug 29, 2017

mfojtik commented Aug 29, 2017

tnozicka commented Aug 29, 2017

gabemontero commented Aug 29, 2017

stevekuznetsov commented Aug 29, 2017 •

edited

Loading

smarterclayton commented Aug 29, 2017 via email

tnozicka commented Aug 29, 2017 •

edited

Loading

smarterclayton commented Aug 30, 2017 via email

tnozicka commented Aug 30, 2017

smarterclayton commented Aug 30, 2017 via email

flake: deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied #16025

flake: deploymentconfigs with minimum ready seconds set [Conformance] should not transition the deployment to Complete before satisfied #16025

Comments

smarterclayton commented Aug 29, 2017

smarterclayton commented Aug 29, 2017

tnozicka commented Aug 29, 2017

smarterclayton commented Aug 29, 2017

bparees commented Aug 29, 2017

tnozicka commented Aug 29, 2017

smarterclayton commented Aug 29, 2017

tnozicka commented Aug 29, 2017

tnozicka commented Aug 29, 2017

smarterclayton commented Aug 29, 2017 via email • edited by stevekuznetsov Loading

stevekuznetsov commented Aug 29, 2017

stevekuznetsov commented Aug 29, 2017

tnozicka commented Aug 29, 2017

mfojtik commented Aug 29, 2017

stevekuznetsov commented Aug 29, 2017

mfojtik commented Aug 29, 2017

tnozicka commented Aug 29, 2017

gabemontero commented Aug 29, 2017

stevekuznetsov commented Aug 29, 2017 • edited Loading

smarterclayton commented Aug 29, 2017 via email

tnozicka commented Aug 29, 2017 • edited Loading

smarterclayton commented Aug 30, 2017 via email

tnozicka commented Aug 30, 2017

smarterclayton commented Aug 30, 2017 via email

smarterclayton commented Aug 29, 2017 via email •

edited by stevekuznetsov

Loading

stevekuznetsov commented Aug 29, 2017 •

edited

Loading

tnozicka commented Aug 29, 2017 •

edited

Loading