ci-kubernetes-e2e-kubeadm-gce is pulling ci/latest not prior job build results #6978

leblancd · 2018-02-23T20:46:35Z

The ci-kubernetes-e2e-kubeadm-gce test jobs are consistently failing. If you look at test results and logs for a given failing test job, and then compare it to the corresponding prior (prerequisite) bazel build job, you can see that the build job is pushing the build artifacts in the proper gs://kubernetes-release/bazel/... storage bucket, but the test job is extracting (or attempting to extract) build results from ci/latest. The test job also seems to be using inconsistent versions of kubeadm/kubelet/kubernetes.

This is very likely causing the CI test outages described in kubernetes/kubernetes issue # 59762.

This failure mode is also seen in other ci-kubernetes-e2e-XXXX test cases, but I'd like to try a fix first on one representative test job, and then replicate the fix to other test jobs if it works.

Consider, for example, this recent failing test job along with its prior build:
Build job:
ci-kubernetes-bazel-build/228062
Test job:
ci-kubernetes-e2e-kubeadm-gce/9615

In the bazel build log, the bazel build is pushing to the proper GCS bucket:

W0221 20:48:38.031] Run: ('bazel', 'run', '//:push-build', '--', 'gs://kubernetes-release-dev/bazel/v1.10.0-beta.0.260+ecc5eb67d96529')

However, as seen in the test job build log, the test job is calling kubetest with inconsistent kubeadm/kubelet/kubernetes versions:

W0221 20:51:11.985] Run: ('kubetest', '-v', '--dump=/workspace/_artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--up', '--down', '--test', '--deployment=kubernetes-anywhere', '--provider=kubernetes-anywhere', '--cluster=e2e-9615', '--gcp-network=e2e-9615', '--extract=ci/latest', '--gcp-zone=us-central1-f', '--kubernetes-anywhere-dump-cluster-logs=true', '--kubernetes-anywhere-kubelet-ci-version=latest', '--kubernetes-anywhere-kubernetes-version=ci/latest', '--test_args=--ginkgo.focus=\\[Conformance\\]|\\[Feature:BootstrapTokens\\]|\\[Feature:NodeAuthorizer\\] --minStartupPods=8', '--timeout=300m', '--kubernetes-anywhere-path=/workspace/kubernetes-anywhere', '--kubernetes-anywhere-phase2-provider=kubeadm', '--kubernetes-anywhere-cluster=e2e-9615', '--kubernetes-anywhere-kubeadm-version=gs://kubernetes-release-dev/bazel/v1.10.0-beta.0.260+ecc5eb67d96529/bin/linux/amd64/')

and it is needlessly/incorrectly extracting the ci/latest build:

W0221 20:51:20.857] 2018/02/21 20:51:20 extract_k8s.go:283: U=https://storage.googleapis.com/kubernetes-release-dev/ci R=v1.11.0-alpha.0.243+7a50f4a12fc158 get-kube.sh

Furthermore, if you look at the serial log collected from the kubernetes master node, you can see that the kubernetes-anywhere startup script is trying to pull down build artifacts from a nonsensical, non-existent GCS bucket:

Feb 21 20:53:05 e2e-9615-master startup-script: INFO startup-script: + gsutil rsync gs://kubernetes-release-dev/bazel/v1.11.0-alpha.0.243+7a50f4a12fc158/bin/linux/amd64/ /tmp/k8s-debs

resulting in failures later on when the script tries to unpack debs packages from the /tmp/k8s-debs directory:

Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script: + dpkg -i /tmp/k8s-debs/kubelet.deb /tmp/k8s-debs/kubeadm.deb /tmp/k8s-debs/kubectl.deb /tmp/k8s-debs/kubernetes-cni.deb
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script: dpkg: error processing archive /tmp/k8s-debs/kubelet.deb (--install):
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script:  cannot access archive: No such file or directory
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script: dpkg: error processing archive /tmp/k8s-debs/kubeadm.deb (--install):
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script:  cannot access archive: No such file or directory
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script: dpkg: error processing archive /tmp/k8s-debs/kubectl.deb (--install):
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script:  cannot access archive: No such file or directory
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script: dpkg: error processing archive /tmp/k8s-debs/kubernetes-cni.deb (--install):
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script:  cannot access archive: No such file or directory
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script: Errors were encountered while processing:

The text was updated successfully, but these errors were encountered:

…acts The ci-kubernetes-e2e-kubeadm-gce test jobs are consistently failing. If you look at test results and logs for a given failing test job, and then compare it to the corresponding prior (prerequisite) bazel build job, you can see that the build job is pushing the build artifacts in the proper gs://kubernetes-release/bazel/... storage bucket, but the test job is extracting (or attempting to extract) build results from ci/latest. The test job also seems to be using inconsistent versions of kubeadm/kubelet/kubernetes. This is very likely one of the causes of the CI test outages described in kubernetes/kubernetes#59762. This failure mode is also seen in other ci-kubernetes-e2e-XXXX test cases, but I'd like to try a fix first on one representative test job, and then replicate the fix to other test jobs if it works. Fixes kubernetes#6978

BenTheElder · 2018-02-23T20:56:55Z

/assign
The solution to this is not particularly generic, bazel push-build doesn't push the same way as release builds currently, however since this job is chained IIRC we can get it to use the special "shared build" logic which looks up the build location.
cc @ixdy

leblancd · 2018-02-23T21:12:49Z

@BenTheElder - I proposed a simple change to what's there now... which doesn't use the "shared build". Is this worth trying, or do we need the shared build approach?

BenTheElder · 2018-02-23T21:22:26Z

Actually I think your change will indeed work for CI. We have some logic for presubmit to publish a file pointer to a well known location since the bazel build locations are a bit different. I think we actually don't need that for CI. My bad.

…

On Fri, Feb 23, 2018, 13:12 Dane LeBlanc ***@***.***> wrote: @BenTheElder <https://github.com/bentheelder> - I proposed a simple change to what's there now... which doesn't use the "shared build". Is this worth trying, or do we need the shared build approach? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6978 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA4Bq3282F1r_s9DxDCTWGeQkp7gge4wks5tXynSgaJpZM4SRdJX> .

leblancd · 2018-02-23T21:25:43Z

@BenTheElder - On the other hand, it looks like I'll have to add a gratuitous extract of latest/ci. My PR diffs are showing this error for pull-test-infra-bazel ("--extract" or "--use-shared-build" is required):

I0223 21:04:17.259] AssertionError: e2e job needs --extract or --use-shared-build: ci-kubernetes-e2e-kubeadm-gce [u'--cluster=', u'--deployment=kubernetes-anywhere', u'--env-file=jobs/platform/gce.env', u'--gcp-zone=us-central1-f', u'--kubeadm=ci', u'--kubernetes-anywhere-dump-cluster-logs=true', u'--provider=kubernetes-anywhere', u'--test_args=--ginkgo.focus=\\[Conformance\\]|\\[Feature:BootstrapTokens\\]|\\[Feature:NodeAuthorizer\\] --minStartupPods=8', u'--timeout=300m']

…acts The ci-kubernetes-e2e-kubeadm-gce test jobs are consistently failing. If you look at test results and logs for a given failing test job, and then compare it to the corresponding prior (prerequisite) bazel build job, you can see that the build job is pushing the build artifacts in the proper gs://kubernetes-release/bazel/... storage bucket, but the test job is extracting (or attempting to extract) build results from ci/latest. The test job also seems to be using inconsistent versions of kubeadm/kubelet/kubernetes. This is very likely one of the causes of the CI test outages described in kubernetes/kubernetes#59762. This failure mode is also seen in other ci-kubernetes-e2e-XXXX test cases, but I'd like to try a fix first on one representative test job, and then replicate the fix to other test jobs if it works. Fixes kubernetes#6978

BenTheElder · 2018-02-23T21:31:28Z

I'll take a closer look soon. The presubmit here assumes these are the only ways to get a build. --shared-build is the flag implemented for presubmits (and isn't actually used much at the moment). I don't remember the exact behavior off the top of my head.

…

On Fri, Feb 23, 2018, 13:25 Dane LeBlanc ***@***.***> wrote: @BenTheElder <https://github.com/bentheelder> - On the other hand, it looks like I'll have to add a gratuitous extract of latest/ci. My PR diffs are showing this error for pull-test-infra-bazel ("--extract" or "--use-shared-build" is required): I0223 21:04:17.259] AssertionError: e2e job needs --extract or --use-shared-build: ci-kubernetes-e2e-kubeadm-gce [u'--cluster=', u'--deployment=kubernetes-anywhere', u'--env-file=jobs/platform/gce.env', u'--gcp-zone=us-central1-f', u'--kubeadm=ci', u'--kubernetes-anywhere-dump-cluster-logs=true', u'--provider=kubernetes-anywhere', u'--test_args=--ginkgo.focus=\\[Conformance\\]|\\[Feature:BootstrapTokens\\]|\\[Feature:NodeAuthorizer\\] --minStartupPods=8', u'--timeout=300m'] — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6978 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA4Bq8cjCTkSnyvsdRGvQ11j8CWbubgVks5tXyzYgaJpZM4SRdJX> .

BenTheElder · 2018-02-23T22:36:05Z

OK so this job is run_after_success on ci-kubernetes-bazel-build which should publish a pointer consumable by --shared-build on the e2e job, we probably do want to use that.

BenTheElder · 2018-02-23T22:36:32Z

Apologies, juggling perhaps a bit many things today :-)

leblancd · 2018-02-23T23:43:09Z

@BenTheElder - The good news with what you're suggesting, I think, is that if we switch the ci-kubernetes-e2e-kubeadm-gce to use --shared-build, it will then run almost identically to the way that pull-kubernetes-e2e-kubeadm-gce runs. So this may eliminate some conditionals in scenarios/kubernetes_e2e.py, and maybe there's a good chance the changes to the "ci" tests might work, since we have a working reference model. I'll take a look at what we need to do, it might not be too complicated.

BenTheElder · 2018-02-23T23:46:43Z

pointed out elsewhere but, https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-all#kubeadm-gce seems to be green now?

FWIW shared builds in the PR job is not a great model right now, since often the build job finishes but the refs have changed since then so the build is now stale by the time the e2e triggers. We're probably moving away from it towards #6808 for presubmits.

leblancd · 2018-02-23T23:58:35Z

@BenTheElder - The tests are still failing intermittently, and if you look at some of the most recent failures, e.g.:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-gce/9620/
you can see that the "Version" listed is different than the "job-version". And in the build log for that failing test case, there are still mismatching versions:

W0223 02:56:14.450] Run: ('kubetest', '-v', '--dump=/workspace/_artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--up', '--down', '--test', '--deployment=kubernetes-anywhere', '--provider=kubernetes-anywhere', '--cluster=e2e-9620', '--gcp-network=e2e-9620', '--extract=ci/latest', '--gcp-zone=us-central1-f', '--kubernetes-anywhere-dump-cluster-logs=true', '--kubernetes-anywhere-kubelet-ci-version=latest', '--kubernetes-anywhere-kubernetes-version=ci/latest', '--test_args=--ginkgo.focus=\\[Conformance\\]|\\[Feature:BootstrapTokens\\]|\\[Feature:NodeAuthorizer\\] --minStartupPods=8', '--timeout=300m', '--kubernetes-anywhere-path=/workspace/kubernetes-anywhere', '--kubernetes-anywhere-phase2-provider=kubeadm', '--kubernetes-anywhere-cluster=e2e-9620', '--kubernetes-anywhere-kubeadm-version=gs://kubernetes-release-dev/bazel/v1.11.0-alpha.0.378+f0ca996274fb4b/bin/linux/amd64/')

So maybe, if the timing is just right, the ci/latest just happens to coincide with the version that was just built? I don't know enough to explain this, but I think it's just luck whenever these tests pass.

leblancd · 2018-02-24T18:56:34Z

Hi @BenTheElder : Looks like the shared build approach won't work in this case.

The ci-kubernetes-bazel-build test job that is run before ci-kubernetes-e2e-kubeadm-gce is configured as a periodic test:

- name: ci-kubernetes-bazel-build
  interval: 6h

Because it's run as a periodic, Prow will not set the $PULL_REFS environment variable see Job Environmental Variables doc.

When $PULL_REFS isn't set, the test-infra/scenarios/kubernetes_bazel.py script won't upload a build location:

                pull_refs = os.getenv('PULL_REFS', '')
                gcs_shared = os.path.join(args.gcs_shared, pull_refs, 'bazel-build-location.txt')
                if pull_refs:
                    upload_string(gcs_shared, gcs_build)

BenTheElder · 2018-02-24T19:39:21Z

Hi @leblancd, I added a TODO to stop running these as run_after_success so we can de-couple the scheduling yesterday. I'm going to look into options including just having kubeadm perform a build itself or better yet fixing #5905. I think the latter will work well, we use this pattern with other e2es consuming the non-bazel build.

…acts The ci-kubernetes-e2e-kubeadm-gce test jobs are consistently failing. If you look at test results and logs for a given failing test job, and then compare it to the corresponding prior (prerequisite) bazel build job, you can see that the build job is pushing the build artifacts in the proper gs://kubernetes-release/bazel/... storage bucket, but the test job is extracting (or attempting to extract) build results from ci/latest. The test job also seems to be using inconsistent versions of kubeadm/kubelet/kubernetes. This is very likely one of the causes of the CI test outages described in kubernetes/kubernetes#59762. This failure mode is also seen in other ci-kubernetes-e2e-XXXX test cases, but I'd like to try a fix first on one representative test job, and then replicate the fix to other test jobs if it works. Fixes kubernetes#6978

leblancd · 2018-02-24T21:38:02Z

@BenTheElder - Yeah, the #5905 approach makes sense for overall efficiency, and it would eliminate the confusion between last build vs. ci/latest versions.

leblancd · 2018-02-25T15:24:54Z

@BenTheElder - Is there an approximate timeline for #5905? If that looks like it will happen soon, maybe we can hold off on this proposal and close this issue when #5905 is implemented?

BenTheElder · 2018-02-26T19:17:26Z

O(soon), I want to see this fixed [I use kubeadm myself on the weekends so :-)] and I hope it's actually not too much work to go that route...

BenTheElder · 2018-03-01T22:25:35Z

I've fixed #5905, working on switching over the kubeadm jobs to consume this.

leblancd · 2018-03-02T00:23:17Z

@BenTheElder Outstanding! Do you have ideas on what's causing the race condition that @jessicaochen described in k/k issue # 59766 (problem "[1]") whereby test jobs looking for the kubeadm binary in a GS bucket, but it hasn't been populated yet?

BenTheElder · 2018-03-02T00:25:05Z

not sure, I haven't flipped them over yet, I need to refactor that PR but first we needed to roll out a prow upgrade... which hit a bug :(

I'm going to flip ci-kubeadm-gce-e2e over to running independently to pick up builds from the ci/latest-bazel tag first and then switch the rest once that's ironed out.

BenTheElder · 2018-03-12T20:48:19Z

This should be fixed now.

k8s-ci-robot assigned BenTheElder Feb 23, 2018

leblancd mentioned this issue Feb 23, 2018

ci-kubernetes-e2e-kubeadm-gce pulling ci/latest not bazel build results #6979

Closed

leblancd mentioned this issue Feb 23, 2018

[job failure] ci-kubernetes-e2e-kubeadm-gce kubernetes/kubernetes#59762

Closed

This was referenced Mar 1, 2018

publish ci/latest-bazel builds #7065

Merged

switch kubeadm CI jobs to ci/latest-bazel tags #7081

Closed

BenTheElder closed this as completed Mar 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci-kubernetes-e2e-kubeadm-gce is pulling ci/latest not prior job build results #6978

ci-kubernetes-e2e-kubeadm-gce is pulling ci/latest not prior job build results #6978

leblancd commented Feb 23, 2018 •

edited

Loading

BenTheElder commented Feb 23, 2018

leblancd commented Feb 23, 2018

BenTheElder commented Feb 23, 2018 via email

leblancd commented Feb 23, 2018

BenTheElder commented Feb 23, 2018 via email

BenTheElder commented Feb 23, 2018

BenTheElder commented Feb 23, 2018

leblancd commented Feb 23, 2018 •

edited

Loading

BenTheElder commented Feb 23, 2018

leblancd commented Feb 23, 2018

leblancd commented Feb 24, 2018

BenTheElder commented Feb 24, 2018

leblancd commented Feb 24, 2018

leblancd commented Feb 25, 2018

BenTheElder commented Feb 26, 2018

BenTheElder commented Mar 1, 2018

leblancd commented Mar 2, 2018

BenTheElder commented Mar 2, 2018

BenTheElder commented Mar 12, 2018

ci-kubernetes-e2e-kubeadm-gce is pulling ci/latest not prior job build results #6978

ci-kubernetes-e2e-kubeadm-gce is pulling ci/latest not prior job build results #6978

Comments

leblancd commented Feb 23, 2018 • edited Loading

BenTheElder commented Feb 23, 2018

leblancd commented Feb 23, 2018

BenTheElder commented Feb 23, 2018 via email

leblancd commented Feb 23, 2018

BenTheElder commented Feb 23, 2018 via email

BenTheElder commented Feb 23, 2018

BenTheElder commented Feb 23, 2018

leblancd commented Feb 23, 2018 • edited Loading

BenTheElder commented Feb 23, 2018

leblancd commented Feb 23, 2018

leblancd commented Feb 24, 2018

BenTheElder commented Feb 24, 2018

leblancd commented Feb 24, 2018

leblancd commented Feb 25, 2018

BenTheElder commented Feb 26, 2018

BenTheElder commented Mar 1, 2018

leblancd commented Mar 2, 2018

BenTheElder commented Mar 2, 2018

BenTheElder commented Mar 12, 2018

leblancd commented Feb 23, 2018 •

edited

Loading

leblancd commented Feb 23, 2018 •

edited

Loading