Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd stable-2.13.4 - linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused #11156

Closed
valentinwidmer opened this issue Jul 24, 2023 · 25 comments

Comments

@valentinwidmer
Copy link

valentinwidmer commented Jul 24, 2023

What is the issue?

I have installed Linkerd over Helm (1.12.5) on an EKS cluster and observing the errors listed below..

How can it be reproduced?

Installing Linkerd stable-2.13.4

Logs, error output, etc

linkerd-destination

Defaulted container "linkerd-proxy" out of: linkerd-proxy, destination, sp-validator, policy, linkerd-init (init)
[     0.002839s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.003837s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.003868s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.003872s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.003875s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[     0.003878s]  INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[     0.003881s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.003885s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via localhost:8086
[     0.004410s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     0.017185s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[     0.110593s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     0.316984s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     0.722792s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     1.223601s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     1.724346s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     2.226078s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     2.727832s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     3.228568s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     3.729321s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     4.231132s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

Workload which has proxy injected

[333255.038731s]  INFO ThreadId(01) linkerd_stack::failfast: Service has recovered
[333256.099452s]  WARN ThreadId(01) outbound:proxy{addr=172.16.64.160:8081}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=10.129.110.112:8090}: linkerd_reconnect:Failed to connect error=endpoint 10.129.110.112:8090: connect timed out after 1s error.sources=[connect timed out after 1s]
[334074.421535s]  WARN ThreadId(01) outbound:proxy{addr=172.16.0.1:443}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=10.129.111.71:8090}: linkerd_reconnect: Service failed error=endpoint 10.129.111.71:8090: channel closed error.sources=[channel closed]
[334074.513021s]  WARN ThreadId(01) outbound:proxy{addr=172.16.12.201:6379}:balance{addr=argo-cd-redis-ha-haproxy.argocd.svc.cluster.local:6379}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=10.129.111.71:8086}: linkerd_reconnect: Service failed error=endpoint 10.129.111.71:8086: channel closed error.sources=[channel closed]
[334074.530198s]  WARN ThreadId(01) outbound:proxy{addr=172.16.0.1:443}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=10.129.111.71:8090}: linkerd_reconnect: Failed to connect error=endpoint 10.129.111.71:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[334074.622297s]  WARN ThreadId(01) outbound:proxy{addr=172.16.12.201:6379}:balance{addr=argo-cd-redis-ha-haproxy.argocd.svc.cluster.local:6379}:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:endpoint{addr=10.129.111.71:8086}: linkerd_reconnect: Failed to connect error=endpoint 10.129.111.71:8086: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

output of linkerd check -o short

Status check results are √

Environment

  • Kubernetes version: v1.26.6-eks-a5565ad
  • Cluster env: EKS
  • Host OS: Linux
  • Linkerd Version: stable-2.13.4
  • CNI: AWS VPC CNI v1.26.2-eksbuild.1

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

@adleong
Copy link
Member

adleong commented Jul 26, 2023

Hi @valentinwidmer

The message

Failed to connect error=endpoint 127.0.0.1:8090

Suggests that the destination controller is not able to connect to the policy controller for some reason. I'd recommend looking at the logs of the policy container in the linkerd-destination pod to see if there are any errors there:

kubectl -n linkerd logs deploy/linkerd-destination -c policy

@bjoernw
Copy link

bjoernw commented Jul 26, 2023

@adleong it's straightforward to replicate this using the load test here: #11055 (comment)

these issues might all be related?

destination proxy logs:

[     0.001961s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.002955s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.002981s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.002983s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.002984s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[     0.002985s]  INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[     0.002986s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.002988s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via localhost:8086
[     0.003519s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     0.004269s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_app_core::control: Failed to resolve control-plane component error=failed SRV and A record lookups: failed t
o resolve SRV record: no record found for Query { name: Name("linkerd-identity-headless.linkerd.svc.cluster.local."), query_type: SRV, query_class: IN }; failed to resolve A record: no record found for Query { name: Name("linkerd-identi
ty-headless.linkerd.svc.cluster.local."), query_type: AAAA, query_class: IN } error.sources=[failed to resolve A record: no record found for Query { name: Name("linkerd-identity-headless.linkerd.svc.cluster.local."), query_type: AAAA, q
uery_class: IN }, no record found for Query { name: Name("linkerd-identity-headless.linkerd.svc.cluster.local."), query_type: AAAA, query_class: IN }]
[     0.005812s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_app_core::control: Failed to resolve control-plane component error=failed SRV and A record lookups: failed t
o resolve SRV record: no record found for Query { name: Name("linkerd-identity-headless.linkerd.svc.cluster.local."), query_type: SRV, query_class: IN }; failed to resolve A record: no record found for Query { name: Name("linkerd-identi
ty-headless.linkerd.svc.cluster.local."), query_type: AAAA, query_class: IN } error.sources=[failed to resolve A record: no record found for Query { name: Name("linkerd-identity-headless.linkerd.svc.cluster.local."), query_type: AAAA, q
uery_class: IN }, no record found for Query { name: Name("linkerd-identity-headless.linkerd.svc.cluster.local."), query_type: AAAA, query_class: IN }]
[     0.111468s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     0.323257s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     0.754783s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     1.255966s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     1.756611s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     2.258819s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     2.760748s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     3.008522s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_stack::failfast: Service entering failfast after 3s
[     3.008786s] ERROR ThreadId(02) identity: linkerd_proxy_identity_client::certify: Failed to obtain identity error=status: Unknown, message: "controller linkerd-identity-headless.linkerd.svc.cluster.local:8080: service in fail-fast",
 details: [], metadata: MetadataMap { headers: {} } error.sources=[controller linkerd-identity-headless.linkerd.svc.cluster.local:8080: service in fail-fast, service in fail-fast]
[     3.262034s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     3.764521s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     4.266589s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     4.768371s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     5.269691s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     5.770976s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     6.273257s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     6.775043s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     7.276555s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     7.777981s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     8.279963s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     8.783531s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     9.285752s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[     9.789637s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Con
nection refused (os error 111)]
[    10.006307s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}: linkerd_stack::failfast: Service entering failfast after 10s

policy logs:

2023-07-26T23:21:19.465832Z  INFO linkerd_policy_controller: created Lease resource lease=Lease { metadata: ObjectMeta { annotations: None, cluster_name: None, creation_timestamp: Some(Time(2023-07-26T23:21:19Z)), deletion_grace_period_
seconds: None, deletion_timestamp: None, finalizers: None, generate_name: None, generation: None, labels: Some({"linkerd.io/control-plane-component": "destination", "linkerd.io/control-plane-ns": "linkerd"}), managed_fields: Some([Manag
edFieldsEntry { api_version: Some("coordination.k8s.io/v1"), fields_type: Some("FieldsV1"), fields_v1: Some(FieldsV1(Object {"f:metadata": Object {"f:labels": Object {"f:linkerd.io/control-plane-component": Object {}, "f:linkerd.io/cont
rol-plane-ns": Object {}}, "f:ownerReferences": Object {"k:{\"uid\":\"3fdfa0b7-8774-473f-9c77-23cf7ec43c86\"}": Object {}}}})), manager: Some("policy-controller"), operation: Some("Apply"), time: Some(Time(2023-07-26T23:21:19Z)) }]), na
me: Some("policy-controller-write"), namespace: Some("linkerd"), owner_references: Some([OwnerReference { api_version: "apps/v1", block_owner_deletion: None, controller: Some(true), kind: "Deployment", name: "linkerd-destination", uid:
"3fdfa0b7-8774-473f-9c77-23cf7ec43c86" }]), resource_version: Some("724"), self_link: None, uid: Some("ef44e99a-3b32-4ba0-9583-cfbcee1030d6") }, spec: Some(LeaseSpec { acquire_time: None, holder_identity: None, lease_duration_seconds: N
one, lease_transitions: None, renew_time: None }) }
2023-07-26T23:21:19.468485Z  INFO grpc{port=8090}: linkerd_policy_controller: policy gRPC server listening addr=0.0.0.0:8090
2023-07-26T23:21:41.935230Z  INFO authorizationpolicies:apply{ns=linkerd-viz saz=metrics-api}:reindex{ns=linkerd-viz}:pod{pod=metrics-api-78d5c76d8c-7skzb}: linkerd_policy_controller_k8s_index::inbound::index: Illegal AuthorizationPolic
y; ignoring server=metrics-api authorizationpolicy=metrics-api error=could not find MeshTLSAuthentication metrics-api-web in namespace linkerd-viz
2023-07-26T23:21:42.025280Z  INFO pods:apply{ns=linkerd-viz name=tap-5896c4fb56-p87mf}: linkerd_policy_controller_k8s_index::inbound::index: Illegal AuthorizationPolicy; ignoring server=tap-api authorizationpolicy=tap error=could not fi
nd NetworkAuthentication kube-api-server in namespace linkerd-viz
2023-07-26T23:21:42.066387Z  INFO servers:apply{ns=linkerd-viz name=tap-injector-webhook}:reindex{ns=linkerd-viz}:pod{pod=tap-5896c4fb56-p87mf}: linkerd_policy_controller_k8s_index::inbound::index: Illegal AuthorizationPolicy; ignoring
server=tap-api authorizationpolicy=tap error=could not find NetworkAuthentication kube-api-server in namespace linkerd-viz
2023-07-26T23:21:42.071896Z  INFO authorizationpolicies:apply{ns=linkerd-viz saz=tap-injector}:reindex{ns=linkerd-viz}:pod{pod=tap-5896c4fb56-p87mf}: linkerd_policy_controller_k8s_index::inbound::index: Illegal AuthorizationPolicy; igno
ring server=tap-api authorizationpolicy=tap error=could not find NetworkAuthentication kube-api-server in namespace linkerd-viz

@valentinwidmer
Copy link
Author

valentinwidmer commented Jul 28, 2023

Hi @valentinwidmer

The message

Failed to connect error=endpoint 127.0.0.1:8090

Suggests that the destination controller is not able to connect to the policy controller for some reason. I'd recommend looking at the logs of the policy container in the linkerd-destination pod to see if there are any errors there:

kubectl -n linkerd logs deploy/linkerd-destination -c policy

Hi @adleong ,

Thanks a lot for your quick response.

Please find below the logs of the policy controller:

2023-07-26T09:19:35.286371Z  INFO linkerd_policy_controller: Lease already exists, no need to create it
2023-07-26T09:19:35.291752Z  INFO grpc{port=8090}: linkerd_policy_controller: policy gRPC server listening addr=0.0.0.0:8090
2023-07-27T05:48:27.239136Z  WARN services: kube_client::client: eof in poll: error reading a body from connection: error reading a body from connection: unexpected EOF during chunk size line
2023-07-27T05:48:27.240005Z ERROR services: kube_client::client::builder: failed with error connection closed before message completed
2023-07-27T05:48:27.240045Z  INFO services: kubert::errors: stream failed error=failed to start watching object: HyperError: connection closed before message completed
2023-07-27T05:48:27.291039Z  WARN pods: kube_client::client: eof in poll: error reading a body from connection: error reading a body from connection: unexpected EOF during chunk size line
2023-07-27T05:48:27.293979Z ERROR services: kube_client::client::builder: failed with error error trying to connect: Connection reset by peer (os error 104)
2023-07-27T05:48:27.294024Z  INFO services: kubert::errors: stream failed error=failed to start watching object: HyperError: error trying to connect: Connection reset by peer (os error 104)
2023-07-27T05:48:43.509105Z  WARN networkauthentications: kube_client::client: eof in poll: error reading a body from connection: error reading a body from connection: unexpected EOF during chunk size line
2023-07-27T05:48:43.509292Z  WARN serverauthorizations: kube_client::client: eof in poll: error reading a body from connection: error reading a body from connection: unexpected EOF during chunk size line
2023-07-27T05:48:43.511921Z  WARN servers: kube_client::client: eof in poll: error reading a body from connection: error reading a body from connection: unexpected EOF during chunk size line
2023-07-27T05:48:43.512479Z  WARN meshtlsauthentications: kube_client::client: eof in poll: error reading a body from connection: error reading a body from connection: unexpected EOF during chunk size line
2023-07-27T05:48:43.512810Z  WARN httproutes: kube_client::client: eof in poll: error reading a body from connection: error reading a body from connection: unexpected EOF during chunk size line
2023-07-27T05:48:43.559371Z  WARN authorizationpolicies: kube_client::client: eof in poll: error reading a body from connection: error reading a body from connection: unexpected EOF during chunk size line
2023-07-27T05:48:43.561063Z ERROR httproutes: kube_client::client::builder: failed with error error trying to connect: Connection reset by peer (os error 104)
2023-07-27T05:48:43.561104Z  INFO httproutes: kubert::errors: stream failed error=failed to start watching object: HyperError: error trying to connect: Connection reset by peer (os error 104)
2023-07-27T05:48:43.561396Z ERROR httproutes: kube_client::client::builder: failed with error error trying to connect: tcp connect error: Connection refused (os error 111)
2023-07-27T05:48:43.561422Z  INFO httproutes: kubert::errors: stream failed error=failed to start watching object: HyperError: error trying to connect: tcp connect error: Connection refused (os error 111)
2023-07-27T05:48:43.618628Z  INFO servers: kubert::errors: stream failed error=error returned by apiserver during watch: too old resource version: 414267543 (414268048): Expired
2023-07-27T05:48:43.943441Z  INFO serverauthorizations: kubert::errors: stream failed error=error returned by apiserver during watch: too old resource version: 414267566 (414268569): Expired
2023-07-27T05:48:48.564104Z  INFO httproutes: kubert::errors: stream failed error=error returned by apiserver during watch: too old resource version: 414268262 (414268572): Expired

Any idea what could be the issue?

@nicklanng
Copy link

nicklanng commented Jul 31, 2023

I have seen everything exactly as @valentinwidmer has described.

This only seems to happen when we deploy a new version of an application, it also seems to only be affecting a specific application.
Deleting or the killing the pods will just bring it back with the same behavior.
What fixes it for me is tagging a new version of the container and deploying that.

@adleong
Copy link
Member

adleong commented Aug 10, 2023

@valentinwidmer Seeing some warnings for a few seconds after starting a Linkerd controller isn't unexpected since it can take a few seconds for Linkerd to sync it's caches. Is this impacting your traffic or is it just a logs issue?

@jsong336
Copy link

@adleong I am getting similar issue but in policy controller pod I get errors

kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: Connection reset by peer (os error 104)
kubert::errors: stream failed error=watch stream failed: Error reading events stream: error reading a body from connection: error reading a body from connection: timed out

@Sul-ss
Copy link

Sul-ss commented Aug 30, 2023

Getting similar issue when a new version of an app is deployed.
Linkerd version 2.13.5 deployed using helm 1.12.5

Workload error:

 ERROR ThreadId(02) identity: linkerd_proxy_identity_client::certify: 
 Failed to obtain identity error=status: Unknown, message: "controller linkerd-identity-headless.linkerd.svc.cluster.local:8080: service in fail-fast", details: [], metadata: MetadataMap { headers: {} } error.sources=[controller linkerd-identity-headless.linkerd.svc.cluster.local:8080: service in fail-fast, service in fail-fast

Policy container error:

ERROR servers: kube_client::client::builder: failed with error error trying to connect: Connection reset by peer (os error 104)
INFO servers: kubert::errors: stream failed error=failed to start watching object: HyperError: error trying to connect: Connection reset by peer (os error 104)
ERROR authorizationpolicies: kube_client::client::builder: failed with error error trying to connect: Connection reset by peer (os error 104)
INFO authorizationpolicies: kubert::errors: stream failed error=failed to start watching object: HyperError: error trying to connect: Connection reset by peer (os error 104)
ERROR httproutes: kube_client::client::builder: failed with error error trying to connect: tcp connect error: Connection refused (os error 111)

@mayank-ag-dev
Copy link

mayank-ag-dev commented Sep 6, 2023

i am facing with a similar issue while installing linkerd with version 2.13.6
Logs of linkerd-destination

[     0.003374s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.004492s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.004516s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.004519s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.004524s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[     0.004528s]  INFO ThreadId(01) linkerd2_proxy: Local identity is linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[     0.004530s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.004534s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via localhost:8086
[     0.005368s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     0.010692s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.129.192.22:8080}: rustls::conn: Sending fatal alert BadCertificate
[     0.010777s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.129.192.22:8080}: linkerd_reconnect: Failed to connect error=endpoint 10.129.192.22:8080: invalid peer certificate contents: invalid peer certificate: UnknownIssuer error.sources=[invalid peer certificate contents: invalid peer certificate: UnknownIssuer]
[     0.115916s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     0.119620s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.129.192.22:8080}: rustls::conn: Sending fatal alert BadCertificate
[     0.119767s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.129.192.22:8080}: linkerd_reconnect: Failed to connect error=endpoint 10.129.192.22:8080: invalid peer certificate contents: invalid peer certificate: UnknownIssuer error.sources=[invalid peer certificate contents: invalid peer certificate: UnknownIssuer]
[     0.321645s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     0.341749s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.129.192.22:8080}: rustls::conn: Sending fatal alert BadCertificate
[     0.341880s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.129.192.22:8080}: linkerd_reconnect: Failed to connect error=endpoint 10.129.192.22:8080: invalid peer certificate contents: invalid peer certificate: UnknownIssuer error.sources=[invalid peer certificate contents: invalid peer certificate: UnknownIssuer]
[     0.726543s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     0.781415s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.129.192.22:8080}: rustls::conn: Sending fatal alert BadCertificate
[     0.781558s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.129.192.22:8080}: linkerd_reconnect: Failed to connect error=endpoint 10.129.192.22:8080: invalid peer certificate contents: invalid peer certificate: UnknownIssuer error.sources=[invalid peer certificate contents: invalid peer certificate: UnknownIssuer]
[     1.228510s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     1.283593s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.129.192.22:8080}: rustls::conn: Sending fatal alert BadCertificate
[     1.283688s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.129.192.22:8080}: linkerd_reconnect: Failed to connect error=endpoint 10.129.192.22:8080: invalid peer certificate contents: invalid peer certificate: UnknownIssuer error.sources=[invalid peer certificate contents: invalid peer certificate: UnknownIssuer]
[     1.730383s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     1.785340s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.129.192.22:8080}: rustls::conn: Sending fatal alert BadCertificate
[     1.785436s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.129.192.22:8080}: linkerd_reconnect: Failed to connect error=endpoint 10.129.192.22:8080: invalid peer certificate contents: invalid peer certificate: UnknownIssuer error.sources=[invalid peer certificate contents: invalid peer certificate: UnknownIssuer]
[     2.231389s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     2.287294s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.129.192.22:8080}: rustls::conn: Sending fatal alert BadCertificate
[     2.287394s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.129.192.22:8080}: linkerd_reconnect: Failed to connect error=endpoint 10.129.192.22:8080: invalid peer certificate contents: invalid peer certificate: UnknownIssuer error.sources=[invalid peer certificate contents: invalid peer certificate: UnknownIssuer]
[     2.732343s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     2.788895s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=10.129.192.22:8080}: linkerd_reconnect: Failed to connect error=endpoint 10.129.192.22:8080: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     3.006358s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}: linkerd_stack::failfast: Service entering failfast after 3s
[     3.006433s] ERROR ThreadId(02) identity: linkerd_proxy_identity_client::certify: Failed to obtain identity error=status: Unknown, message: "controller linkerd-identity-headless.linkerd.svc.cluster.local:8080: service in fail-fast", details: [], metadata: MetadataMap { headers: {} } error.sources=[controller linkerd-identity-headless.linkerd.svc.cluster.local:8080: service in fail-fast, service in fail-fast]
[     3.233320s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     3.734831s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     4.236756s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     4.738579s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     5.240583s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     5.742614s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     6.244831s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     6.746087s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     7.247070s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     7.748981s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     8.249872s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     8.751831s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     9.253951s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[     9.755793s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[    10.006367s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}: linkerd_stack::failfast: Service entering failfast after 10s
[    10.257199s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[    10.758780s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[    11.260705s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[    11.762684s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[    12.264864s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[    12.765962s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[    13.267769s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[    13.507733s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local
[    13.769740s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[    14.271638s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[    14.774180s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[    15.274971s]  WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

I am installing linked with kubectl
Note:- GKE version is 1.26.5-gke.2700

@yoramshai
Copy link

Same here ☝🏻
Must mention it is impacting real traffic and it caused a prod incident.

@adleong
Copy link
Member

adleong commented Sep 11, 2023

@valentinwidmer thanks for providing those logs!

2023-07-27T05:48:27.240005Z ERROR services: kube_client::client::builder: failed with error connection closed before message completed
2023-07-27T05:48:27.240045Z  INFO services: kubert::errors: stream failed error=failed to start watching object: HyperError: connection closed before message completed
2023-07-27T05:48:27.291039Z  WARN pods: kube_client::client: eof in poll: error reading a body from connection: error reading a body from connection: unexpected EOF during chunk size line
2023-07-27T05:48:27.293979Z ERROR services: kube_client::client::builder: failed with error error trying to connect: Connection reset by peer (os error 104)

This suggests that the problem is that policy controller is failing to connect to the kubernetes API. When it attempts to establish a connection, it receives an EOF from the kubernetes server, causing the connection to fail. It's not clear why this is happening. The next things to look at would be the controller metrics. This should give us some information about how many watches the control plane is maintaining, how many connections it has to the k8s API, etc. That way, we can determine if these failed connections are due to, for example, a connection limit being reached or something like that.

You can fetch the controller metrics by running linkerd diagnostics controller-metrics.

@omidraha
Copy link

It seems I have similar issue.

@mayank-ag-dev
Copy link

@omidraha did you find anything because it Is affecting production?

@omidraha
Copy link

No I just added more info.

@mayank-ag-dev
Copy link

@adleong could you look into this it is impacting the production I must be sure that this is the issue related to the compatibility of the linked with the k8s version because v1.10 of linkerd works fine with gke v1.24 but when I upgraded to v.127 it throws an error

@bjoernw
Copy link

bjoernw commented Sep 19, 2023

@adleong Is there a way to exempt the destination-controller from failfast? Right now it seems like as soon as a couple of pods have problems reaching the destination-controller it's deemed to be in failfast and that exacerbates the problem. If it is indeed an issue with the number of watches etc. then it would be nice if the healthcheck failed and the pod is replaced.

When it goes into failfast in the middle of a canary rollout it just causes chaos.

@mayank-ag-dev
Copy link

@omidraha Seems linkerd is now working fine initially, the inkerd-proxy container is trying to hit the policy container in the same container, And the policy container needs some time to spin up so we got the error once the policy container is up service will be recovered automatically as seen in log
attaching logs FYR
linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 8.782819s] WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 9.284685s] WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 9.786430s] WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 10.004825s] WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}: linkerd_stack::failfast: Service entering failfast after 10s
[ 10.286995s] WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 10.788804s] WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 11.290575s] WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 11.791364s] WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 12.292052s] WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 12.793850s] WARN ThreadId(01) watch{port=9997}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 13.295349s] INFO ThreadId(01) linkerd_stack::failfast: Service has recovered

@jessectl
Copy link

Any update on this? We are also facing similar issue.

@ThomasCardin
Copy link

ThomasCardin commented Oct 16, 2023

We are also facing this similar issue just after installing linkerd.

@kflynn
Copy link
Member

kflynn commented Oct 18, 2023

@ThomasCardin @JesseAhh @bjoernw @omidraha Could I repeat the plea for the controller metrics from anyone who's seeing this? If you run linkerd diagnostics controller-metrics that should give you exactly what we're after: that should give us some more information about what's up with talking to the K8s APIServer. Many thanks in advance!!!

@omidraha
Copy link

The complete log is provided here:
#10994 (comment)
and I successfully deployed Linkerd .
The deployment code using Python and Pulumi is available below:
https://github.com/omidraha/pulumi_example/tree/main/linkerd

@kflynn
Copy link
Member

kflynn commented Nov 2, 2023

@omidraha, are you on the Linkerd Slack? if so, can you ping me (@flynn) there? Thanks!

@danny-devops
Copy link

Hey guys,

We are also experiencing the same issue.

We upgraded linkerd-control-plane-chart to v1.16.3. When this upgrade was applied to one of our Kubernetes clusters we observed the destination service being oomkilled. We increased resources to accommodate this however in the logs we still observed linkerd_reconnect: "Failed to connect","error" and Connection refused (os error 111) error.sources=[Connection refused (os error 111)].

We have rolled back linkerd-control-plane-chart to v1.16.2 to stabilise the cluster.
Please can you provide an update on this?

Thanks

@DavidMcLaughlin
Copy link
Contributor

@danny-devops That sounds like a separate issue to anything mentioned here given you are seeing OOM kills and are not using 2.13.x. Do you mind creating a new ticket so we can triage it there?

@Nutties93
Copy link

Any ways to resolve this? I am facing similar issues when trying to install linkerd on EKS. #11697

Copy link

stale bot commented Mar 10, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Mar 10, 2024
@stale stale bot closed this as completed Mar 30, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests