-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci-kubernetes-e2e-kubeadm-gce is pulling ci/latest not prior job build results #6978
Comments
…acts The ci-kubernetes-e2e-kubeadm-gce test jobs are consistently failing. If you look at test results and logs for a given failing test job, and then compare it to the corresponding prior (prerequisite) bazel build job, you can see that the build job is pushing the build artifacts in the proper gs://kubernetes-release/bazel/... storage bucket, but the test job is extracting (or attempting to extract) build results from ci/latest. The test job also seems to be using inconsistent versions of kubeadm/kubelet/kubernetes. This is very likely one of the causes of the CI test outages described in kubernetes/kubernetes#59762. This failure mode is also seen in other ci-kubernetes-e2e-XXXX test cases, but I'd like to try a fix first on one representative test job, and then replicate the fix to other test jobs if it works. Fixes kubernetes#6978
/assign |
@BenTheElder - I proposed a simple change to what's there now... which doesn't use the "shared build". Is this worth trying, or do we need the shared build approach? |
Actually I think your change will indeed work for CI. We have some logic
for presubmit to publish a file pointer to a well known location since the
bazel build locations are a bit different. I think we actually don't need
that for CI. My bad.
…On Fri, Feb 23, 2018, 13:12 Dane LeBlanc ***@***.***> wrote:
@BenTheElder <https://github.com/bentheelder> - I proposed a simple
change to what's there now... which doesn't use the "shared build". Is this
worth trying, or do we need the shared build approach?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6978 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA4Bq3282F1r_s9DxDCTWGeQkp7gge4wks5tXynSgaJpZM4SRdJX>
.
|
@BenTheElder - On the other hand, it looks like I'll have to add a gratuitous extract of latest/ci. My PR diffs are showing this error for pull-test-infra-bazel ("--extract" or "--use-shared-build" is required):
|
…acts The ci-kubernetes-e2e-kubeadm-gce test jobs are consistently failing. If you look at test results and logs for a given failing test job, and then compare it to the corresponding prior (prerequisite) bazel build job, you can see that the build job is pushing the build artifacts in the proper gs://kubernetes-release/bazel/... storage bucket, but the test job is extracting (or attempting to extract) build results from ci/latest. The test job also seems to be using inconsistent versions of kubeadm/kubelet/kubernetes. This is very likely one of the causes of the CI test outages described in kubernetes/kubernetes#59762. This failure mode is also seen in other ci-kubernetes-e2e-XXXX test cases, but I'd like to try a fix first on one representative test job, and then replicate the fix to other test jobs if it works. Fixes kubernetes#6978
I'll take a closer look soon. The presubmit here assumes these are the only
ways to get a build. --shared-build is the flag implemented for presubmits
(and isn't actually used much at the moment). I don't remember the exact
behavior off the top of my head.
…On Fri, Feb 23, 2018, 13:25 Dane LeBlanc ***@***.***> wrote:
@BenTheElder <https://github.com/bentheelder> - On the other hand, it
looks like I'll have to add a gratuitous extract of latest/ci. My PR diffs
are showing this error for pull-test-infra-bazel ("--extract" or
"--use-shared-build" is required):
I0223 21:04:17.259] AssertionError: e2e job needs --extract or --use-shared-build: ci-kubernetes-e2e-kubeadm-gce [u'--cluster=', u'--deployment=kubernetes-anywhere', u'--env-file=jobs/platform/gce.env', u'--gcp-zone=us-central1-f', u'--kubeadm=ci', u'--kubernetes-anywhere-dump-cluster-logs=true', u'--provider=kubernetes-anywhere', u'--test_args=--ginkgo.focus=\\[Conformance\\]|\\[Feature:BootstrapTokens\\]|\\[Feature:NodeAuthorizer\\] --minStartupPods=8', u'--timeout=300m']
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6978 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA4Bq8cjCTkSnyvsdRGvQ11j8CWbubgVks5tXyzYgaJpZM4SRdJX>
.
|
OK so this job is |
Apologies, juggling perhaps a bit many things today :-) |
@BenTheElder - The good news with what you're suggesting, I think, is that if we switch the ci-kubernetes-e2e-kubeadm-gce to use --shared-build, it will then run almost identically to the way that pull-kubernetes-e2e-kubeadm-gce runs. So this may eliminate some conditionals in scenarios/kubernetes_e2e.py, and maybe there's a good chance the changes to the "ci" tests might work, since we have a working reference model. I'll take a look at what we need to do, it might not be too complicated. |
pointed out elsewhere but, https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-all#kubeadm-gce seems to be green now? FWIW shared builds in the PR job is not a great model right now, since often the build job finishes but the refs have changed since then so the build is now stale by the time the e2e triggers. We're probably moving away from it towards #6808 for presubmits. |
@BenTheElder - The tests are still failing intermittently, and if you look at some of the most recent failures, e.g.:
So maybe, if the timing is just right, the ci/latest just happens to coincide with the version that was just built? I don't know enough to explain this, but I think it's just luck whenever these tests pass. |
Hi @BenTheElder : Looks like the shared build approach won't work in this case. The ci-kubernetes-bazel-build test job that is run before ci-kubernetes-e2e-kubeadm-gce is configured as a periodic test:
Because it's run as a periodic, Prow will not set the $PULL_REFS environment variable see Job Environmental Variables doc. When $PULL_REFS isn't set, the test-infra/scenarios/kubernetes_bazel.py script won't upload a build location:
|
Hi @leblancd, I added a |
…acts The ci-kubernetes-e2e-kubeadm-gce test jobs are consistently failing. If you look at test results and logs for a given failing test job, and then compare it to the corresponding prior (prerequisite) bazel build job, you can see that the build job is pushing the build artifacts in the proper gs://kubernetes-release/bazel/... storage bucket, but the test job is extracting (or attempting to extract) build results from ci/latest. The test job also seems to be using inconsistent versions of kubeadm/kubelet/kubernetes. This is very likely one of the causes of the CI test outages described in kubernetes/kubernetes#59762. This failure mode is also seen in other ci-kubernetes-e2e-XXXX test cases, but I'd like to try a fix first on one representative test job, and then replicate the fix to other test jobs if it works. Fixes kubernetes#6978
@BenTheElder - Yeah, the #5905 approach makes sense for overall efficiency, and it would eliminate the confusion between last build vs. ci/latest versions. |
@BenTheElder - Is there an approximate timeline for #5905? If that looks like it will happen soon, maybe we can hold off on this proposal and close this issue when #5905 is implemented? |
O(soon), I want to see this fixed [I use kubeadm myself on the weekends so :-)] and I hope it's actually not too much work to go that route... |
I've fixed #5905, working on switching over the kubeadm jobs to consume this. |
@BenTheElder Outstanding! Do you have ideas on what's causing the race condition that @jessicaochen described in k/k issue # 59766 (problem "[1]") whereby test jobs looking for the kubeadm binary in a GS bucket, but it hasn't been populated yet? |
not sure, I haven't flipped them over yet, I need to refactor that PR but first we needed to roll out a prow upgrade... which hit a bug :( I'm going to flip |
This should be fixed now. |
The ci-kubernetes-e2e-kubeadm-gce test jobs are consistently failing. If you look at test results and logs for a given failing test job, and then compare it to the corresponding prior (prerequisite) bazel build job, you can see that the build job is pushing the build artifacts in the proper gs://kubernetes-release/bazel/... storage bucket, but the test job is extracting (or attempting to extract) build results from ci/latest. The test job also seems to be using inconsistent versions of kubeadm/kubelet/kubernetes.
This is very likely causing the CI test outages described in kubernetes/kubernetes issue # 59762.
This failure mode is also seen in other ci-kubernetes-e2e-XXXX test cases, but I'd like to try a fix first on one representative test job, and then replicate the fix to other test jobs if it works.
Consider, for example, this recent failing test job along with its prior build:
Build job:
ci-kubernetes-bazel-build/228062
Test job:
ci-kubernetes-e2e-kubeadm-gce/9615
In the bazel build log, the bazel build is pushing to the proper GCS bucket:
However, as seen in the test job build log, the test job is calling kubetest with inconsistent kubeadm/kubelet/kubernetes versions:
and it is needlessly/incorrectly extracting the ci/latest build:
Furthermore, if you look at the serial log collected from the kubernetes master node, you can see that the kubernetes-anywhere startup script is trying to pull down build artifacts from a nonsensical, non-existent GCS bucket:
resulting in failures later on when the script tries to unpack debs packages from the /tmp/k8s-debs directory:
The text was updated successfully, but these errors were encountered: