Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

dns-liveness pod restarts failing k8s 1.14 tests #919

Closed
mboersma opened this issue Mar 29, 2019 · 14 comments
Closed

dns-liveness pod restarts failing k8s 1.14 tests #919

mboersma opened this issue Mar 29, 2019 · 14 comments

Comments

@mboersma
Copy link
Member

Recently—since the introduction of CoreDNS 1.3.1?—there have been many test failures indicating restarts of the dns-liveness pod. We should fix this or skip the test.

@CecileRobertMichon
Copy link
Contributor

Is this specific to 1.14 or is it more general?

@mboersma
Copy link
Member Author

I think I have seen it with 1.12 and 1.13 as well, but I'm not sure now. I'll collect some data.

@mboersma
Copy link
Member Author

Based on the E2E tests of the last 48 hours, this appears only to happen with Kubernetes 1.14 on Linux. That would make sense if the new CoreDNS is the problem.

I tried to use CoreDNS 1.4.0 based on the following comment which may be relevant to this, but the Docker image is not published:

There is a known issue coredns/coredns#2629 in CoreDNS 1.3.1, wherein if the Kubernetes API shuts down while CoreDNS is connected, CoreDNS will crash. The issue is fixed in CoreDNS 1.4.0 in coredns/coredns#2529.

@mboersma mboersma changed the title dns-liveness pod restarts failing tests dns-liveness pod restarts failing k8s 1.14 tests Mar 29, 2019
@CecileRobertMichon
Copy link
Contributor

This is failing more than 50% of 1.14 E2E tests. Should we pause the 1.14 E2E while we wait for the new CoreDNS image? The result is more often red than green and we're not actively troubleshooting failures. @jackfrancis thoughts?

@jackfrancis
Copy link
Member

Let's just skip the "dns liveness validation tests" for >= 1.14

@mboersma
Copy link
Member Author

mboersma commented Apr 3, 2019

Closed by #931

@mboersma mboersma closed this as completed Apr 3, 2019
@CecileRobertMichon
Copy link
Contributor

should we keep the issue open since we still need to fix the root cause? My PR just removed the test...

@mboersma
Copy link
Member Author

mboersma commented Apr 3, 2019

Yes, let's do keep it open—thanks for paying attention. I think there's a fighting chance that k8s.gcr.io/coredns:1.4.0 fixes this behavior, whenever that actually gets published.

@mboersma mboersma reopened this Apr 3, 2019
@gjtempleton
Copy link

Possibly worth noting that 1.4.0 of coreDNS has been published to coredns/coredns:1.4.0, just not to the k8s.gcr.io registry yet.

Although worth noting the discussion here: kubernetes/kubernetes#75414 (comment) - v1.4.0 looks unlikely to ever end up in vanilla k8s.

@johnbelamaric
Copy link

@fturib have you seen this?

@mboersma
Copy link
Member Author

mboersma commented Apr 4, 2019

FWIW the emptyDir workaround suggested for coreDNS 1.3.1 does seem to have fixed the AKS Engine test case in #949.

@fturib
Copy link

fturib commented Apr 4, 2019

@chrisohaver, @rajansandeep - could you help understanding the cause and advising a solution ?

@chrisohaver
Copy link

chrisohaver commented Apr 5, 2019

Since the emptyDir workaround resolved the test case, the api watches may be trying to log something.

Assuming that they are, applying the emptyDir fix seems like the best fix for now. When 1.5.0 is released (date TBD), it wont require the emptyDir fix, but it contains feature deprecations that make it incompatible with the 1.3.1 CoreDNS config file.

If we want to dig into what is being logged, clues could be in coredns logs or api-server logs. Knowing why the message is logged would help understand the scope.

@mboersma
Copy link
Member Author

I'm closing this since the emptyDir fix is the best option for now. We can revisit this for aks-engine when CoreDNS 1.5.0 or later is blessed by a Kubernetes release.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants