Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci-kubernetes-e2e-kubeadm-gce is pulling ci/latest not prior job build results #6978

Closed
leblancd opened this issue Feb 23, 2018 · 19 comments
Closed
Assignees

Comments

@leblancd
Copy link
Contributor

leblancd commented Feb 23, 2018

The ci-kubernetes-e2e-kubeadm-gce test jobs are consistently failing. If you look at test results and logs for a given failing test job, and then compare it to the corresponding prior (prerequisite) bazel build job, you can see that the build job is pushing the build artifacts in the proper gs://kubernetes-release/bazel/... storage bucket, but the test job is extracting (or attempting to extract) build results from ci/latest. The test job also seems to be using inconsistent versions of kubeadm/kubelet/kubernetes.

This is very likely causing the CI test outages described in kubernetes/kubernetes issue # 59762.

This failure mode is also seen in other ci-kubernetes-e2e-XXXX test cases, but I'd like to try a fix first on one representative test job, and then replicate the fix to other test jobs if it works.

Consider, for example, this recent failing test job along with its prior build:
Build job:
ci-kubernetes-bazel-build/228062
Test job:
ci-kubernetes-e2e-kubeadm-gce/9615

In the bazel build log, the bazel build is pushing to the proper GCS bucket:

W0221 20:48:38.031] Run: ('bazel', 'run', '//:push-build', '--', 'gs://kubernetes-release-dev/bazel/v1.10.0-beta.0.260+ecc5eb67d96529')

However, as seen in the test job build log, the test job is calling kubetest with inconsistent kubeadm/kubelet/kubernetes versions:

W0221 20:51:11.985] Run: ('kubetest', '-v', '--dump=/workspace/_artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--up', '--down', '--test', '--deployment=kubernetes-anywhere', '--provider=kubernetes-anywhere', '--cluster=e2e-9615', '--gcp-network=e2e-9615', '--extract=ci/latest', '--gcp-zone=us-central1-f', '--kubernetes-anywhere-dump-cluster-logs=true', '--kubernetes-anywhere-kubelet-ci-version=latest', '--kubernetes-anywhere-kubernetes-version=ci/latest', '--test_args=--ginkgo.focus=\\[Conformance\\]|\\[Feature:BootstrapTokens\\]|\\[Feature:NodeAuthorizer\\] --minStartupPods=8', '--timeout=300m', '--kubernetes-anywhere-path=/workspace/kubernetes-anywhere', '--kubernetes-anywhere-phase2-provider=kubeadm', '--kubernetes-anywhere-cluster=e2e-9615', '--kubernetes-anywhere-kubeadm-version=gs://kubernetes-release-dev/bazel/v1.10.0-beta.0.260+ecc5eb67d96529/bin/linux/amd64/')

and it is needlessly/incorrectly extracting the ci/latest build:

W0221 20:51:20.857] 2018/02/21 20:51:20 extract_k8s.go:283: U=https://storage.googleapis.com/kubernetes-release-dev/ci R=v1.11.0-alpha.0.243+7a50f4a12fc158 get-kube.sh

Furthermore, if you look at the serial log collected from the kubernetes master node, you can see that the kubernetes-anywhere startup script is trying to pull down build artifacts from a nonsensical, non-existent GCS bucket:

Feb 21 20:53:05 e2e-9615-master startup-script: INFO startup-script: + gsutil rsync gs://kubernetes-release-dev/bazel/v1.11.0-alpha.0.243+7a50f4a12fc158/bin/linux/amd64/ /tmp/k8s-debs

resulting in failures later on when the script tries to unpack debs packages from the /tmp/k8s-debs directory:

Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script: + dpkg -i /tmp/k8s-debs/kubelet.deb /tmp/k8s-debs/kubeadm.deb /tmp/k8s-debs/kubectl.deb /tmp/k8s-debs/kubernetes-cni.deb
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script: dpkg: error processing archive /tmp/k8s-debs/kubelet.deb (--install):
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script:  cannot access archive: No such file or directory
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script: dpkg: error processing archive /tmp/k8s-debs/kubeadm.deb (--install):
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script:  cannot access archive: No such file or directory
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script: dpkg: error processing archive /tmp/k8s-debs/kubectl.deb (--install):
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script:  cannot access archive: No such file or directory
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script: dpkg: error processing archive /tmp/k8s-debs/kubernetes-cni.deb (--install):
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script:  cannot access archive: No such file or directory
Feb 21 20:53:08 e2e-9615-master startup-script: INFO startup-script: Errors were encountered while processing:
leblancd pushed a commit to leblancd/test-infra that referenced this issue Feb 23, 2018
…acts

The ci-kubernetes-e2e-kubeadm-gce test jobs are consistently failing. If you look at test results and logs for a given failing test job, and then compare it to the corresponding prior (prerequisite) bazel build job, you can see that the build job is pushing the build artifacts in the proper gs://kubernetes-release/bazel/... storage bucket, but the test job is extracting (or attempting to extract) build results from ci/latest. The test job also seems to be using inconsistent versions of kubeadm/kubelet/kubernetes.

This is very likely one of the causes of the CI test outages described in kubernetes/kubernetes#59762.

This failure mode is also seen in other ci-kubernetes-e2e-XXXX test cases, but I'd like to try a fix first on one representative test job, and then replicate the fix to other test jobs if it works.

Fixes kubernetes#6978
@BenTheElder
Copy link
Member

/assign
The solution to this is not particularly generic, bazel push-build doesn't push the same way as release builds currently, however since this job is chained IIRC we can get it to use the special "shared build" logic which looks up the build location.
cc @ixdy

@leblancd
Copy link
Contributor Author

@BenTheElder - I proposed a simple change to what's there now... which doesn't use the "shared build". Is this worth trying, or do we need the shared build approach?

@BenTheElder
Copy link
Member

BenTheElder commented Feb 23, 2018 via email

@leblancd
Copy link
Contributor Author

@BenTheElder - On the other hand, it looks like I'll have to add a gratuitous extract of latest/ci. My PR diffs are showing this error for pull-test-infra-bazel ("--extract" or "--use-shared-build" is required):

I0223 21:04:17.259] AssertionError: e2e job needs --extract or --use-shared-build: ci-kubernetes-e2e-kubeadm-gce [u'--cluster=', u'--deployment=kubernetes-anywhere', u'--env-file=jobs/platform/gce.env', u'--gcp-zone=us-central1-f', u'--kubeadm=ci', u'--kubernetes-anywhere-dump-cluster-logs=true', u'--provider=kubernetes-anywhere', u'--test_args=--ginkgo.focus=\\[Conformance\\]|\\[Feature:BootstrapTokens\\]|\\[Feature:NodeAuthorizer\\] --minStartupPods=8', u'--timeout=300m']

leblancd pushed a commit to leblancd/test-infra that referenced this issue Feb 23, 2018
…acts

The ci-kubernetes-e2e-kubeadm-gce test jobs are consistently failing. If you look at test results and logs for a given failing test job, and then compare it to the corresponding prior (prerequisite) bazel build job, you can see that the build job is pushing the build artifacts in the proper gs://kubernetes-release/bazel/... storage bucket, but the test job is extracting (or attempting to extract) build results from ci/latest. The test job also seems to be using inconsistent versions of kubeadm/kubelet/kubernetes.

This is very likely one of the causes of the CI test outages described in kubernetes/kubernetes#59762.

This failure mode is also seen in other ci-kubernetes-e2e-XXXX test cases, but I'd like to try a fix first on one representative test job, and then replicate the fix to other test jobs if it works.

Fixes kubernetes#6978
@BenTheElder
Copy link
Member

BenTheElder commented Feb 23, 2018 via email

@BenTheElder
Copy link
Member

OK so this job is run_after_success on ci-kubernetes-bazel-build which should publish a pointer consumable by --shared-build on the e2e job, we probably do want to use that.

@BenTheElder
Copy link
Member

Apologies, juggling perhaps a bit many things today :-)

@leblancd
Copy link
Contributor Author

leblancd commented Feb 23, 2018

@BenTheElder - The good news with what you're suggesting, I think, is that if we switch the ci-kubernetes-e2e-kubeadm-gce to use --shared-build, it will then run almost identically to the way that pull-kubernetes-e2e-kubeadm-gce runs. So this may eliminate some conditionals in scenarios/kubernetes_e2e.py, and maybe there's a good chance the changes to the "ci" tests might work, since we have a working reference model. I'll take a look at what we need to do, it might not be too complicated.

@BenTheElder
Copy link
Member

pointed out elsewhere but, https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-all#kubeadm-gce seems to be green now?

FWIW shared builds in the PR job is not a great model right now, since often the build job finishes but the refs have changed since then so the build is now stale by the time the e2e triggers. We're probably moving away from it towards #6808 for presubmits.

@leblancd
Copy link
Contributor Author

@BenTheElder - The tests are still failing intermittently, and if you look at some of the most recent failures, e.g.:
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-gce/9620/
you can see that the "Version" listed is different than the "job-version". And in the build log for that failing test case, there are still mismatching versions:

W0223 02:56:14.450] Run: ('kubetest', '-v', '--dump=/workspace/_artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--up', '--down', '--test', '--deployment=kubernetes-anywhere', '--provider=kubernetes-anywhere', '--cluster=e2e-9620', '--gcp-network=e2e-9620', '--extract=ci/latest', '--gcp-zone=us-central1-f', '--kubernetes-anywhere-dump-cluster-logs=true', '--kubernetes-anywhere-kubelet-ci-version=latest', '--kubernetes-anywhere-kubernetes-version=ci/latest', '--test_args=--ginkgo.focus=\\[Conformance\\]|\\[Feature:BootstrapTokens\\]|\\[Feature:NodeAuthorizer\\] --minStartupPods=8', '--timeout=300m', '--kubernetes-anywhere-path=/workspace/kubernetes-anywhere', '--kubernetes-anywhere-phase2-provider=kubeadm', '--kubernetes-anywhere-cluster=e2e-9620', '--kubernetes-anywhere-kubeadm-version=gs://kubernetes-release-dev/bazel/v1.11.0-alpha.0.378+f0ca996274fb4b/bin/linux/amd64/')

So maybe, if the timing is just right, the ci/latest just happens to coincide with the version that was just built? I don't know enough to explain this, but I think it's just luck whenever these tests pass.

@leblancd
Copy link
Contributor Author

Hi @BenTheElder : Looks like the shared build approach won't work in this case.

The ci-kubernetes-bazel-build test job that is run before ci-kubernetes-e2e-kubeadm-gce is configured as a periodic test:

- name: ci-kubernetes-bazel-build
  interval: 6h

Because it's run as a periodic, Prow will not set the $PULL_REFS environment variable see Job Environmental Variables doc.

When $PULL_REFS isn't set, the test-infra/scenarios/kubernetes_bazel.py script won't upload a build location:

                pull_refs = os.getenv('PULL_REFS', '')
                gcs_shared = os.path.join(args.gcs_shared, pull_refs, 'bazel-build-location.txt')
                if pull_refs:
                    upload_string(gcs_shared, gcs_build)

@BenTheElder
Copy link
Member

Hi @leblancd, I added a TODO to stop running these as run_after_success so we can de-couple the scheduling yesterday. I'm going to look into options including just having kubeadm perform a build itself or better yet fixing #5905. I think the latter will work well, we use this pattern with other e2es consuming the non-bazel build.

leblancd pushed a commit to leblancd/test-infra that referenced this issue Feb 24, 2018
…acts

The ci-kubernetes-e2e-kubeadm-gce test jobs are consistently failing. If you look at test results and logs for a given failing test job, and then compare it to the corresponding prior (prerequisite) bazel build job, you can see that the build job is pushing the build artifacts in the proper gs://kubernetes-release/bazel/... storage bucket, but the test job is extracting (or attempting to extract) build results from ci/latest. The test job also seems to be using inconsistent versions of kubeadm/kubelet/kubernetes.

This is very likely one of the causes of the CI test outages described in kubernetes/kubernetes#59762.

This failure mode is also seen in other ci-kubernetes-e2e-XXXX test cases, but I'd like to try a fix first on one representative test job, and then replicate the fix to other test jobs if it works.

Fixes kubernetes#6978
@leblancd
Copy link
Contributor Author

@BenTheElder - Yeah, the #5905 approach makes sense for overall efficiency, and it would eliminate the confusion between last build vs. ci/latest versions.

@leblancd
Copy link
Contributor Author

@BenTheElder - Is there an approximate timeline for #5905? If that looks like it will happen soon, maybe we can hold off on this proposal and close this issue when #5905 is implemented?

@BenTheElder
Copy link
Member

O(soon), I want to see this fixed [I use kubeadm myself on the weekends so :-)] and I hope it's actually not too much work to go that route...

@BenTheElder
Copy link
Member

I've fixed #5905, working on switching over the kubeadm jobs to consume this.

@leblancd
Copy link
Contributor Author

leblancd commented Mar 2, 2018

@BenTheElder Outstanding! Do you have ideas on what's causing the race condition that @jessicaochen described in k/k issue # 59766 (problem "[1]") whereby test jobs looking for the kubeadm binary in a GS bucket, but it hasn't been populated yet?

@BenTheElder
Copy link
Member

not sure, I haven't flipped them over yet, I need to refactor that PR but first we needed to roll out a prow upgrade... which hit a bug :(

I'm going to flip ci-kubeadm-gce-e2e over to running independently to pick up builds from the ci/latest-bazel tag first and then switch the rest once that's ironed out.

@BenTheElder
Copy link
Member

This should be fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants