-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linkerd stable-2.13.4 - linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused #11156
Comments
The message
Suggests that the destination controller is not able to connect to the policy controller for some reason. I'd recommend looking at the logs of the policy container in the linkerd-destination pod to see if there are any errors there:
|
@adleong it's straightforward to replicate this using the load test here: #11055 (comment) these issues might all be related? destination proxy logs:
policy logs:
|
Hi @adleong , Thanks a lot for your quick response. Please find below the logs of the policy controller:
Any idea what could be the issue? |
I have seen everything exactly as @valentinwidmer has described. This only seems to happen when we deploy a new version of an application, it also seems to only be affecting a specific application. |
@valentinwidmer Seeing some warnings for a few seconds after starting a Linkerd controller isn't unexpected since it can take a few seconds for Linkerd to sync it's caches. Is this impacting your traffic or is it just a logs issue? |
@adleong I am getting similar issue but in policy controller pod I get errors
|
Getting similar issue when a new version of an app is deployed. Workload error:
Policy container error:
|
i am facing with a similar issue while installing linkerd with version 2.13.6
I am installing linked with kubectl |
Same here ☝🏻 |
@valentinwidmer thanks for providing those logs!
This suggests that the problem is that policy controller is failing to connect to the kubernetes API. When it attempts to establish a connection, it receives an EOF from the kubernetes server, causing the connection to fail. It's not clear why this is happening. The next things to look at would be the controller metrics. This should give us some information about how many watches the control plane is maintaining, how many connections it has to the k8s API, etc. That way, we can determine if these failed connections are due to, for example, a connection limit being reached or something like that. You can fetch the controller metrics by running |
It seems I have similar issue. |
@omidraha did you find anything because it Is affecting production? |
No I just added more info. |
@adleong could you look into this it is impacting the production I must be sure that this is the issue related to the compatibility of the linked with the k8s version because v1.10 of linkerd works fine with gke v1.24 but when I upgraded to v.127 it throws an error |
@adleong Is there a way to exempt the destination-controller from failfast? Right now it seems like as soon as a couple of pods have problems reaching the destination-controller it's deemed to be in failfast and that exacerbates the problem. If it is indeed an issue with the number of watches etc. then it would be nice if the healthcheck failed and the pod is replaced. When it goes into failfast in the middle of a canary rollout it just causes chaos. |
@omidraha Seems linkerd is now working fine initially, the inkerd-proxy container is trying to hit the policy container in the same container, And the policy container needs some time to spin up so we got the error once the policy container is up service will be recovered automatically as seen in log |
Any update on this? We are also facing similar issue. |
We are also facing this similar issue just after installing linkerd. |
@ThomasCardin @JesseAhh @bjoernw @omidraha Could I repeat the plea for the controller metrics from anyone who's seeing this? If you run |
The complete log is provided here: |
@omidraha, are you on the Linkerd Slack? if so, can you ping me (@flynn) there? Thanks! |
Hey guys, We are also experiencing the same issue. We upgraded linkerd-control-plane-chart to v1.16.3. When this upgrade was applied to one of our Kubernetes clusters we observed the destination service being oomkilled. We increased resources to accommodate this however in the logs we still observed We have rolled back linkerd-control-plane-chart to v1.16.2 to stabilise the cluster. Thanks |
@danny-devops That sounds like a separate issue to anything mentioned here given you are seeing OOM kills and are not using 2.13.x. Do you mind creating a new ticket so we can triage it there? |
Any ways to resolve this? I am facing similar issues when trying to install linkerd on EKS. #11697 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
What is the issue?
I have installed Linkerd over Helm (1.12.5) on an EKS cluster and observing the errors listed below..
How can it be reproduced?
Installing Linkerd stable-2.13.4
Logs, error output, etc
linkerd-destination
Workload which has proxy injected
output of
linkerd check -o short
Status check results are √
Environment
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: