Outbound HTTP call failing intermittently #11055

vkperumal · 2023-06-23T13:39:07Z

What is the issue?

Outbound HTTP calls to application fails with connection closed error intermittently. Source application from which HTTP calls failing is a scheduler application which makes regular calls to other application based on schedule. After the failure, the next schedule works fine. Please see the logs below from linkerd-proxy

How can it be reproduced?

Linkerd Version: 2.13.4

Logs, error output, etc

[129664.685446s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39860
[129664.691331s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39872
[129664.691557s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39878
[129664.692554s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39894
[129664.695277s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39900
[129784.342535s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42724
[129784.342604s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42738
[129784.342642s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42746
[129784.342682s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42760

output of `linkerd check -o short`

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2023-06-25T04:48:41Z
    see https://linkerd.io/2.13/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
    certificate will expire on 2023-06-24T04:48:45Z
    see https://linkerd.io/2.13/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
    certificate will expire on 2023-06-24T04:48:46Z
    see https://linkerd.io/2.13/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
‼ policy-validator cert is valid for at least 60 days
    certificate will expire on 2023-06-24T04:48:45Z
    see https://linkerd.io/2.13/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints

linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.13.4 but the latest stable version is 2.13.5
    see https://linkerd.io/2.13/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.13.4 but the latest stable version is 2.13.5
    see https://linkerd.io/2.13/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-5b59946658-pm2wk (stable-2.13.4)
	* linkerd-identity-54dfb7648b-hhp2r (stable-2.13.4)
	* linkerd-proxy-injector-5fc4795598-4h68n (stable-2.13.4)
    see https://linkerd.io/2.13/checks/#l5d-cp-proxy-version for hints

linkerd-viz
-----------
‼ tap API server cert is valid for at least 60 days
    certificate will expire on 2023-06-24T04:22:23Z
    see https://linkerd.io/2.13/checks/#l5d-tap-cert-not-expiring-soon for hints
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-d6665b56-7hs59 (stable-2.13.4)
	* tap-6d45c9cd5c-wj2ft (stable-2.13.4)
	* tap-injector-6c95fb4f97-q945w (stable-2.13.4)
	* web-7c9756b6b-4bcvw (stable-2.13.4)
    see https://linkerd.io/2.13/checks/#l5d-viz-proxy-cp-version for hints

Status check results are √

Environment

Kubernetes Version: 1.23
Cluster Environment: EKS
OS: Linux
Linkerd Version: 2.13.4

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

The text was updated successfully, but these errors were encountered:

someone-stole-my-name · 2023-06-28T14:07:11Z

Since our upgrade to 2.13.4 we've been experiencing the same errors in our cluster which result in transient 503 returned to the clients. It looks like the same or very similar problem reported in https://linkerd.buoyant.io/t/503-service-unavailable-from-proxy-with-large-number-of-connections/126. It doesn't affect any service specifically, the whole mesh is affected randomly during peak load. There's no throttling or any resource constraint issues, we're not using service profiles, Linkerd's circuit breaking or dynamic routing.

The log of the affected clients is filled with information, but I don't see an answer as to why the proxy is marking services as unavailable:

2023-06-28T10:52:48Z	[ 94126.531685s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36056}:proxy{addr=10.100.109.178:80}:http: linkerd_stack::failfast: Service has become unavailable
2023-06-28T10:52:48Z	[ 94126.531675s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z	[ 94126.531668s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "3081", "content-type": "application/json", "traceparent": "00-083c196cf720aab5452406dfa155fda3-19bc1b8cdb26cebd-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z	[ 94126.531665s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z	[ 94126.531657s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17df00, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 43, rx_fields: "..." } } } })
2023-06-28T10:52:48Z	[ 94126.531641s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z	[ 94126.531634s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "22912", "content-type": "application/json", "traceparent": "00-23856363099ba7654f627b8ab7e2905a-9328a2c69cd37c13-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z	[ 94126.531630s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z	[ 94126.531620s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:36094}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17e400, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 23, rx_fields: "..." } } } })
2023-06-28T10:52:48Z	[ 94126.531598s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z	[ 94126.531564s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "94320", "content-type": "application/json", "traceparent": "00-a638ca610eaefb86a771ebaaa411f072-9b465174c9c12082-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z	[ 94126.531559s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z	[ 94126.531541s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52608}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.60.248:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17da00, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 24, rx_fields: "..." } } } })
2023-06-28T10:52:48Z	[ 94126.531466s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}: linkerd_app_core::serve: Connection closed
2023-06-28T10:52:48Z	[ 94126.531444s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
2023-06-28T10:52:48Z	[ 94126.531357s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:41748}: linkerd_app_core::serve: Connection closed
2023-06-28T10:52:48Z	[ 94126.531314s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:41748}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::server: The stack is tearing down the connection
2023-06-28T10:52:48Z	[ 94126.531265s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35784}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error
2023-06-28T10:52:48Z	[ 94126.531237s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35776}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: sending EOS to classify
2023-06-28T10:52:48Z	[ 94126.531232s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35776}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: dropping ResponseBody
2023-06-28T10:52:48Z	[ 94126.531214s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error
2023-06-28T10:52:48Z	[ 94126.531173s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35720}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: sending EOS to classify
2023-06-28T10:52:48Z	[ 94126.531165s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35720}:proxy{addr=10.100.109.178:80}:http: linkerd_proxy_http::classify::channel: dropping ResponseBody
2023-06-28T10:52:48Z	[ 94126.531148s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35766}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error
2023-06-28T10:52:48Z	[ 94126.531121s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_stack::failfast: Service has become unavailable
2023-06-28T10:52:48Z	[ 94126.531109s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::orig_proto: Upgrading request version=HTTP/1.1 absolute_form=false
2023-06-28T10:52:48Z	[ 94126.531105s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: headers={"host": "service-b.service-b", "user-agent": "Go-http-client/1.1", "content-length": "52431", "content-type": "application/json", "traceparent": "00-8c14a06804f62ea8bb6f352ee3f4f0e4-4acb6b9ebed60214-00", "service": "service-a", "version": "master-20230627T083604Z-5f42ab", "accept-encoding": "gzip"}
2023-06-28T10:52:48Z	[ 94126.531102s] DEBUG ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:service{ns= name=service port=0}:endpoint{addr=172.16.81.177:8501}:orig-proto-upgrade: linkerd_proxy_http::client: method=POST uri=http://service-b.service-b/v1/models/paccept:predict version=HTTP/1.1
2023-06-28T10:52:48Z	[ 94126.530991s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:35766}:proxy{addr=10.100.109.178:80}:http:service{ns= name=service port=0}:endpoint{addr=172.16.64.39:8501}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Http(StatusRanges([500..=599])), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7faf3e17e400, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 100 }, bound: 100 }, rx_waker: AtomicWaker, tx_count: 22, rx_fields: "..." } } } })
2023-06-28T10:52:48Z	[ 94126.530792s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_app_core::errors::respond: Handling error on HTTP connection status=503 Service Unavailable version=HTTP/1.1 close=true
2023-06-28T10:52:48Z	[ 94126.530791s] DEBUG ThreadId(02) outbound: linkerd_app_core::serve: Connection closed
2023-06-28T10:52:48Z	[ 94126.530785s]  INFO ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http:rescue{client.addr=172.16.85.145:52548}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 10.100.109.178:80: service unavailable error.sources=[service unavailable]
2023-06-28T10:52:48Z	[ 94126.530775s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_stack::loadshed: Service shedding load
2023-06-28T10:52:48Z	[ 94126.530767s] DEBUG ThreadId(03) outbound:accept{client.addr=172.16.85.145:52548}:proxy{addr=10.100.109.178:80}:http: linkerd_stack::loadshed: Service has become unavailable

Metrics are not helpful either, according to them, everything is working fine, but it isn't. The proxy is not adding those 503 as failed requests, they're not visible in viz or Prometheus. The only real clue comes from transport level metrics, where we see the same loadshed errors visible in debug logs, example from the same client pod as the logs:

Sadly I don't have reproduction instructions, but something must have changed between 2.12 and 2.13 that causes this odd behaviour.

MarkSRobinson · 2023-06-28T20:16:36Z

@someone-stole-my-name What version were you running before?

vkperumal · 2023-06-29T03:09:39Z

@MarkSRobinson Issues started for me after upgrading from 2.11.1 to 2.13.4. Issue still exists on 2.13.5 as well

someone-stole-my-name · 2023-06-29T08:17:14Z

@someone-stole-my-name What version were you running before?

2.12.4

We've reverted just the data plane to the previous version and things are stable again, so far. We replaced some core components first, and in the rollback process noticed something that may give some clues on where to look. Reverting just the outbound proxies doesn't fix the issue, you still get the slightly different errors from v2.12 but same outcome, service marked as unavailable and 503 returned to the client:

2023-06-28T15:32:28Z	[   512.908510s]  WARN ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_reconnect: Service failed error=channel closed
2023-06-28T15:32:28Z	[   512.908502s]  WARN ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:38392}: linkerd_app_outbound::http::server: Unexpected error error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:32:28Z	[   512.908498s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:38392}: linkerd_app_core::errors::respond: Request failed error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:32:28Z	[   512.908460s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:38392}: linkerd_app_core::errors::respond: Request failed error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:32:28Z	[   512.854087s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z	[   512.854013s]  INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z	[   512.854006s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z	[   512.853958s]  INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z	[   512.853927s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z	[   512.853877s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z	[   512.853843s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z	[   512.853806s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z	[   512.853762s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z	[   512.853718s]  INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z	[   512.853714s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z	[   512.853647s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z	[   512.853594s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z	[   512.853582s]  INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z	[   512.853507s]  INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:32:28Z	[   512.853455s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:56038: server: 172.16.81.177:8501: server 172.16.81.177:8501: service service-b.service-b.svc.cluster.local:80: service unavailable
2023-06-28T15:25:42Z	[   106.568334s]  WARN ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_reconnect: Service failed error=channel closed
2023-06-28T15:25:42Z	[   106.568300s]  WARN ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:39856}: linkerd_app_outbound::http::server: Unexpected error error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:25:42Z	[   106.568295s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:39856}: linkerd_app_core::errors::respond: Request failed error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:25:42Z	[   106.568265s]  INFO ThreadId(03) outbound:proxy{addr=10.100.109.178:80}:rescue{client.addr=172.16.61.121:39856}: linkerd_app_core::errors::respond: Request failed error=operation was canceled: connection closed error.sources=[connection closed]
2023-06-28T15:25:42Z	[   106.525745s]  INFO ThreadId(02) outbound:proxy{addr=10.100.109.178:80}: linkerd_app_outbound::http::proxy_connection_close: Closing application connection for remote proxy error=client 172.16.61.121:38646: server: 172.16.64.39:8501: server 172.16.64.39:8501: service service-b.service-b.svc.cluster.local:80: service unavailable

Reverting the inbound proxy to 2.12 apparently fixes this issue. Here are some logs from the inbound side while still using 2.13.4. Notice how the pattern changes around 2023-06-28T15:32:28Z [195845.387669s], then all connections are dropped and established again around 2023-06-28T15:32:28Z [195845.401767s].

alpeb · 2023-06-29T10:52:03Z

@vkperumal Can you tell if the target services are under high load when those errors occur, like reported by @someone-stole-my-name ?

vkperumal · 2023-06-29T11:38:42Z

@alpeb My application is a scheduler which makes outbound http calls to other apps. There is no load on the application

juhomakitalo · 2023-07-20T07:36:15Z

Having same errors after upgrading to 2.13.5 and only load was communication between the services in our application.
Errors were fixed after downgrading to 2.12.5.
(For some reason our external prometheus metrics also stopped working after upgrading linkerd-viz to 2.13.5 but started working again after downgrading to 2.12.5)

Running Kubernetes version 1.23 EKS

tomhaddad · 2023-07-20T11:44:35Z

Also experiencing the same issue - seems it was introduced in 2.13.0, as I have switched between 2.12.5 / 2.13.0 today and seen no issues in the former.

It is also very specifically occurring when making over 100 concurrent outgoing requests. Any less and everything works as normal, any more (even 101) cause the the same errors described in the original issue and crash the respective client pod.

ali-accesso · 2023-07-22T22:13:41Z

We also experiencing unpredictable connections issue between our services since upgrading from 2.12.5 to 2.13.5 - no heavy load on our apps at the moment, and still can't figure out what could cause this issue.

WARN ThreadId(01) outbound:proxy{addr=10.0.187.210:80}:rescue{client.addr=10.1.17.12:41126}: linkerd_app_outbound::http::server: Unexpected error error=logical service 10.0.187.210:80: default service.: endpoint 10.1.17.12:80: operation was canceled: connection was not ready error.sources=[default service.: endpoint 10.1.17.12:80: operation was canceled: connection was not ready, endpoint 10.1.17.12:80: operation was canceled: connection was not ready, operation was canceled: connection was not ready, connection was not ready]
[  7750.378237s]  INFO ThreadId(01) outbound:proxy{addr=10.0.187.210:80}:rescue{client.addr=10.1.17.12:41126}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 10.0.187.210:80: default service.: endpoint 10.1.17.12:80: operation was canceled: connection was not ready error.sources=[default service.: endpoint 10.1.17.12:80: operation was canceled: connection was not ready, endpoint 10.1.17.12:80: operation was canceled: connection was not ready, operation was canceled: connection was not ready, connection was not ready]
INFO ThreadId(01) outbound:proxy{addr=10.0.187.210:80}:service{ns= name=service port=0}:endpoint{addr=10.1.17.12:80}:rescue{client.addr=10.1.17.12:41126}: linkerd_app_core::errors::respond: gRPC request failed error=endpoint 10.1.17.12:80: operation was canceled: connection was not ready error.sources=[operation was canceled: connection was not ready, connection was not ready]

bjoernw · 2023-07-24T18:59:04Z

I'm having this same issue on a cluster with several hundred meshed pods. I'm on 2.13.4. I worry that it's an out-of-date endpoints cache issue that is causing the whole service to be deemed unavailable. The only way to recover from this issue is to restart all the proxies that are logging this issue.

jandersen-plaid · 2023-07-24T20:25:17Z

I worry that it's an out-of-date endpoints cache issue

Similar to other posters, we have run into this issue and reported it elsewhere (see https://linkerd.buoyant.io/t/linkerd-destination-has-stale-data-in-endpoint-cache/171). @bjoernw, we also think it is a stale endpoints cache, but we found that we could recover from incidences of it by restarting the destination controllers (presumably this sends a RESET to the affected proxies, and that clears their caches?).

We could not reproduce it after ~2 weeks of trying, but our suspicion is that what happens is:

An endpoint or discovery cache (distribute's BackendCache maybe?) in the proxy fills up with balancer endpoints that are pending (whether through a cache without aggressive enough eviction or some other reason).
Some failure causes the proxy to find all endpoints in a pending state, and the service moves into a loadshed state after going into failfast.
When the proxy goes into a loadshed it starts to consider alternate backends & endpoints that are in the affected cache.
Those balancer endpoints that are in pending are then each considered to try and find one available.
Somehow the appropriate endpoints // the ready endpoints are never hit upon or evicted from the cache????
This causes the affected proxies to never be able to route requests to ready endpoints, so the loadshedding // failfast never recovers
Removing all affected proxies forces the endpoints to be refreshed, or restarting the destination control plane forces a RESET of the endpoint caches in the proxies, which moves the state out of loadshed // failfast

Of course, this is all conjecture: without a reproduction it is really hard to prove any of this is happening or get the necessary log info in order to track down which cache is failing in what way. Since we don't by default log debug level, all we have are our metrics which indicate that the outbound_http_balancer_endpoints had large number of pending endpoints during the incident, some correlation that stale IPs were being accessed around the time of the incident (from our CNI) which resolved directly after we restarted the destination controller, and a huge rise in loadshed errors from outbound_http_errors_total

Correlation between loadshedding behavior and stale ips (note that stale ips were being accessed before the major errors start, which indicates that some bad endpoints in the cache were being accessed on a relatively frequent basis, but not considered for routing http requests until something -- a rise in latency or a few 504s -- triggered failfast // loadshedding behavior):

Relationship between loadshedding errors and endpoints:

(it is worth noting that the divot in the graph is where we attempted to restart the affected inbound proxies // pods, but were unable to recover from the situation as we did not restart the nginx pods which were sending the outbound traffic to the affected deployment. At the end, the ready endpoints spike up as we considered that it was an availability issue and provisioned more pods (to no avail), and then finally the errors resolve when we restart the destination controller).

Hopefully someone can come up with a reproduction, but we are going to downgrade to 1.12.5 for the time being.

bjoernw · 2023-07-24T22:38:07Z

@jandersen-plaid Great write-up, makes sense to me. There have been a couple of issues related to endpoints being out-of-date recently #10003
#8956

I'm going to setup a job that continuously monitors the diff between the k8s endpoints api and linkerd's diagnostic command. I just ran this during one of our canary deploys and I did see some IPs in the linkerd list that weren't in the endpoints list for a few seconds. Definitely expected since this is eventually consistent but I'm going to be looking for the scenario where this list doesn't reconcile over time for a specific destination pod.

I see you're also having this issue primarily on your ingress controller. That's usually where it starts for us but then it spreads randomly around to other service-to-service calls as well. I guess the one thing they all have in common is the cache in the destination controller...

If we run this once a minute we should be able to see when a certain pod's cache doesn't seem to recover.

#!/bin/bash

NAMESPACE=my-namespace
ENDPOINT=my-service
PORT=3000

ENDPOINTS=$(kubectl get endpointslice -n $NAMESPACE -l kubernetes.io/service-name=$ENDPOINT -o json | jq -r '.items[].endpoints[].addresses[]' | sort)

LINKERD_DESTINATION_PODS=$(kubectl get pods -n linkerd -l linkerd.io/control-plane-component=destination -o jsonpath='{.items[*].metadata.name}')

for POD in $LINKERD_DESTINATION_PODS
do
  LINKERD_DIAGNOSTICS=$(linkerd diagnostics endpoints --destination-pod $POD $ENDPOINT.$NAMESPACE.svc.cluster.local:$PORT | awk 'NR>1 {print $2}' | sort)

  ONLY_IN_ENDPOINTS=$(comm -23 <(echo "$ENDPOINTS") <(echo "$LINKERD_DIAGNOSTICS"))

  ONLY_IN_LINKERD=$(comm -13 <(echo "$ENDPOINTS") <(echo "$LINKERD_DIAGNOSTICS"))

  if [[ -z $ONLY_IN_ENDPOINTS && -z $ONLY_IN_LINKERD ]]
  then
    echo "Both IP sets are identical for Pod $POD."
  else
    if [[ -n $ONLY_IN_ENDPOINTS ]]
    then
      echo "IPs only in EndpointSlices for Pod $POD:"
      echo "$ONLY_IN_ENDPOINTS"
    fi

    if [[ -n $ONLY_IN_LINKERD ]]
    then
      echo "IPs only in Linkerd Diagnostics for Pod $POD:"
      echo "$ONLY_IN_LINKERD"
    fi
  fi
done

bjoernw · 2023-07-24T23:56:12Z

@jandersen-plaid Are you seeing any endpoint update failures in your k8s events around the time of the issue?

I'm seeing an elevated rate of these on the problematic service:

1461 FailedToUpdateEndpoint: Failed to update endpoint my-namespace/my-service: Operation cannot be fulfilled on endpoints "my-service": the object has been modified; please apply your changes to the latest version and try again

jandersen-plaid · 2023-07-25T14:50:16Z

@jandersen-plaid Are you seeing any endpoint update failures in your k8s events around the time of the issue?

I'm seeing an elevated rate of these on the problematic service:
1461 FailedToUpdateEndpoint: Failed to update endpoint my-namespace/my-service: Operation cannot be fulfilled on endpoints "my-service": the object has been modified; please apply your changes to the latest version and try again

Hmmm is this coming from the endpoint controller or linkerd? On the endpoint controller side, we do see this error (a 409 in api server logs because of the resource conflict) around the incident, but it also happens rather frequently in the cluster outside of the incident window as well. We may have seen it on the linkerd side during the incident, but some of our logs from the most recent occurrence have reduced granularity at this point, so I can't confirm that.

bjoernw · 2023-07-26T00:05:21Z

@jandersen-plaid that's from the endpoint controller. In the past I had suspected that throttling at the API server level was causing the endpoints API itself to be out of sync with the current set of Pods hence leading the linkerd cache to be out of sync but it seems like the problem lies between endpoints and linkerd. I'm running a version of that script above in the cluster now with automatic restarts of the destination controller and extensive logging. Will report back if I find anything interesting.

nikskiz · 2023-07-26T04:21:33Z

We have experienced the same issue, and rolling back the proxy with linkerd2 stable-2.12.x worked. When observing the linekrd metrics exposed we could see the "loadshed" error as outlined above by @jandersen-plaid

This is easily reproduced by following the steps below

Prerequisites

Steps to replicate proxy errors "service unavailable"

Create the k6 test

cat <<EOF >> k6s-test.js
// import necessary modules
import { check } from "k6";
import http from "k6/http";

// define configuration
export const options = {
    // define thresholds
    thresholds: {
      http_req_failed: ['rate<0.01'], // http errors should be less than 1%
      http_req_duration: ['p(99)<1000'], // 99% of requests should be below 1s
    },
    // define scenarios
    scenarios: {
      // arbitrary name of scenario
      average_load: {
        executor: "ramping-vus",
        stages: [
          { duration: "10s", target: 500 },
        ],
      },
    }
  };

export default function () {
  // define URL and request body
  const url = "http://localhost:8080/api/vote?choice=:rabbit:";
  const payload = JSON.stringify({
    choice: ":rabbit:"
  });
  const params = {
    headers: {
      "Content-Type": "application/json",
    },
  };

  // send a post request and save response as a variable
  const res = http.post(url, payload, params);

  // check that response is 200
  check(res, {
    "response code was 200": (res) => res.status == 200,
  });
}
EOF

Install k8s cluster using kind

kind create cluster

Install Linkerd2 CRDS

linkerd install --crds | kubectl apply -f -

Install Linkerd2

linkerd install | kubectl apply -f -

Install viz

linkerd viz install | kubectl apply -f -

Validate installation

linkerd check

Install emojivoto

linkerd inject https://run.linkerd.io/emojivoto.yml | kubectl apply -f -

Check emojivoto status

kubectl -n emojivoto rollout status deploy/web

In a separate terminal port forward to emojivoto web app

kubectl port-forward deployment/web 8080 -n emojivoto

In a third terminal, follow the proxy logs on the emojivoto web deployment

kubectl logs -f -n emojivoto  deployment/web -c linkerd-proxy

Run k6s test and observer results

k6 run k6s-test.js

jandersen-plaid · 2023-07-26T16:18:57Z

Thanks for the reproduction @nikskiz! the results of performing the test on 2.13.4 vs 2.12.5 are staggering:

Note: the difference in requests sent can likely be explained by the faster failure in 2.13.4: less time on requests leads to more iterations, but they shouldn't fail in the first place!

Linkerd 2.13.4

k6 Summary

     ✗ response code was 200
      ↳  32% — ✓ 24137 / ✗ 49549

     checks.........................: 32.75% ✓ 24137       ✗ 49549
     data_received..................: 13 MB  1.3 MB/s
     data_sent......................: 15 MB  1.5 MB/s
     http_req_blocked...............: avg=62.37µs min=710ns    med=80.02µs max=36.37ms  p(90)=115.5µs  p(95)=127.52µs
     http_req_connecting............: avg=41.34µs min=0s       med=53.11µs max=36.32ms  p(90)=78.68µs  p(95)=87.03µs
   ✓ http_req_duration..............: avg=33.98ms min=322.75µs med=29.03ms max=126.73ms p(90)=70.71ms  p(95)=96.58ms
       { expected_response:true }...: avg=51.81ms min=2.22ms   med=47.98ms max=126.73ms p(90)=103.06ms p(95)=108.87ms
   ✗ http_req_failed................: 67.24% ✓ 49549       ✗ 24137
     http_req_receiving.............: avg=37µs    min=7.56µs   med=26.99µs max=11.63ms  p(90)=49.68µs  p(95)=56.16µs
     http_req_sending...............: avg=27.87µs min=5.24µs   med=23.12µs max=20.87ms  p(90)=37.21µs  p(95)=42.27µs
     http_req_tls_handshaking.......: avg=0s      min=0s       med=0s      max=0s       p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=33.91ms min=278.29µs med=28.94ms max=126.68ms p(90)=70.67ms  p(95)=96.52ms
     http_reqs......................: 73686  7315.819808/s
     iteration_duration.............: avg=34.12ms min=400.5µs  med=29.19ms max=126.87ms p(90)=70.78ms  p(95)=96.76ms
     iterations.....................: 73686  7315.819808/s
     vus............................: 497    min=48        max=497
     vus_max........................: 500    min=500       max=500

k6 logs

          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: /opt/k6/script.js
     output: -

  scenarios: (100.00%) 1 scenario, 500 max VUs, 40s max duration (incl. graceful stop):
           * average_load: Up to 500 looping VUs for 10s over 1 stages (gracefulRampDown: 30s, gracefulStop: 30s)


running (01.0s), 048/500 VUs, 3368 complete and 0 interrupted iterations
average_load   [  10% ] 048/500 VUs  01.0s/10.0s

running (02.0s), 098/500 VUs, 7347 complete and 0 interrupted iterations
average_load   [  20% ] 098/500 VUs  02.0s/10.0s

running (03.0s), 148/500 VUs, 11229 complete and 0 interrupted iterations
average_load   [  30% ] 148/500 VUs  03.0s/10.0s

running (04.0s), 198/500 VUs, 17293 complete and 0 interrupted iterations
average_load   [  40% ] 198/500 VUs  04.0s/10.0s

running (05.0s), 248/500 VUs, 25458 complete and 0 interrupted iterations
average_load   [  50% ] 248/500 VUs  05.0s/10.0s

running (06.0s), 298/500 VUs, 35032 complete and 0 interrupted iterations
average_load   [  60% ] 298/500 VUs  06.0s/10.0s

running (07.0s), 348/500 VUs, 44495 complete and 0 interrupted iterations
average_load   [  70% ] 348/500 VUs  07.0s/10.0s

running (08.0s), 398/500 VUs, 53949 complete and 0 interrupted iterations
average_load   [  80% ] 398/500 VUs  08.0s/10.0s

running (09.0s), 447/500 VUs, 63643 complete and 0 interrupted iterations
average_load   [  90% ] 447/500 VUs  09.0s/10.0s

running (10.0s), 497/500 VUs, 72804 complete and 0 interrupted iterations
average_load   [ 100% ] 497/500 VUs  10.0s/10.0s

running (10.1s), 000/500 VUs, 73686 complete and 0 interrupted iterations
average_load ✓ [ 100% ] 000/500 VUs  10s

     ✗ response code was 200
      ↳  32% — ✓ 24137 / ✗ 49549

     checks.........................: 32.75% ✓ 24137       ✗ 49549
     data_received..................: 13 MB  1.3 MB/s
     data_sent......................: 15 MB  1.5 MB/s
     http_req_blocked...............: avg=62.37µs min=710ns    med=80.02µs max=36.37ms  p(90)=115.5µs  p(95)=127.52µs
     http_req_connecting............: avg=41.34µs min=0s       med=53.11µs max=36.32ms  p(90)=78.68µs  p(95)=87.03µs
   ✓ http_req_duration..............: avg=33.98ms min=322.75µs med=29.03ms max=126.73ms p(90)=70.71ms  p(95)=96.58ms
       { expected_response:true }...: avg=51.81ms min=2.22ms   med=47.98ms max=126.73ms p(90)=103.06ms p(95)=108.87ms
   ✗ http_req_failed................: 67.24% ✓ 49549       ✗ 24137
     http_req_receiving.............: avg=37µs    min=7.56µs   med=26.99µs max=11.63ms  p(90)=49.68µs  p(95)=56.16µs
     http_req_sending...............: avg=27.87µs min=5.24µs   med=23.12µs max=20.87ms  p(90)=37.21µs  p(95)=42.27µs
     http_req_tls_handshaking.......: avg=0s      min=0s       med=0s      max=0s       p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=33.91ms min=278.29µs med=28.94ms max=126.68ms p(90)=70.67ms  p(95)=96.52ms
     http_reqs......................: 73686  7315.819808/s
     iteration_duration.............: avg=34.12ms min=400.5µs  med=29.19ms max=126.87ms p(90)=70.78ms  p(95)=96.76ms
     iterations.....................: 73686  7315.819808/s
     vus............................: 497    min=48        max=497
     vus_max........................: 500    min=500       max=500

time="2023-07-26T15:29:41Z" level=error msg="some thresholds have failed"

`linkerd-proxy` container logs on `web` pod

# Snippet of linkerd-proxy container logs on web pod during k6 test
...
[   900.650890s]  INFO ThreadId(01) outbound:proxy{addr=100.68.109.136:8080}:rescue{client.addr=100.96.51.98:56736}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 100.68.109.136:8080: serviceunavailable error.sources=[service unavailable]
[   900.650896s]  INFO ThreadId(01) outbound:proxy{addr=100.68.109.136:8080}:rescue{client.addr=100.96.51.98:56736}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 100.68.109.136:8080: serviceunavailable error.sources=[service unavailable]
[   900.650904s]  INFO ThreadId(01) outbound:proxy{addr=100.68.109.136:8080}:rescue{client.addr=100.96.51.98:56736}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 100.68.109.136:8080: serviceunavailable error.sources=[service unavailable]
[   900.650910s]  INFO ThreadId(01) outbound:proxy{addr=100.68.109.136:8080}:rescue{client.addr=100.96.51.98:56736}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 100.68.109.136:8080: serviceunavailable error.sources=[service unavailable]
[   900.650916s]  INFO ThreadId(01) outbound:proxy{addr=100.68.109.136:8080}:rescue{client.addr=100.96.51.98:56736}: linkerd_app_core::errors::respond: gRPC request failed error=logical service 100.68.109.136:8080: serviceunavailable error.sources=[service unavailable]
...

Logs from `web-svc` container on the `web` pod

# Snippet of web-svc container logs on the web pod during the k6 test
...
2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc000df0d00 <nil> 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] <nil> map[] 100.96.51.98:59656 /api/vote?choice=:rabbit: <nil> <nil> <nil> 0xc0010d7a70}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable
2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc000b55880 <nil> 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] <nil> map[] 100.96.51.98:59246 /api/vote?choice=:rabbit: <nil> <nil> <nil> 0xc00074b620}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable
2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc00124ea00 <nil> 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] <nil> map[] 100.96.51.98:54418 /api/vote?choice=:rabbit: <nil> <nil> <nil> 0xc000290f30}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable
2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc000899ac0 <nil> 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] <nil> map[] 100.96.51.98:59070 /api/vote?choice=:rabbit: <nil> <nil> <nil> 0xc000a80390}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable
2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc000df0800 <nil> 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] <nil> map[] 100.96.51.98:59262 /api/vote?choice=:rabbit: <nil> <nil> <nil> 0xc0010d6b10}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable
2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc000b55040 <nil> 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] <nil> map[] 100.96.51.98:37612 /api/vote?choice=:rabbit: <nil> <nil> <nil> 0xc00074a4e0}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable
2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc0005d0780 <nil> 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] <nil> map[] 100.96.51.98:37064 /api/vote?choice=:rabbit: <nil> <nil> <nil> 0xc0008a3860}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable
2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc000f27880 <nil> 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] <nil> map[] 100.96.51.98:37008 /api/vote?choice=:rabbit: <nil> <nil> <nil> 0xc000c2d620}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable
2023/07/26 15:31:36 Error serving request [&{POST /api/vote?choice=:rabbit: HTTP/1.1 1 1 map[Content-Length:[21] Content-Type:[application/json] L5d-Client-Id:[default.emojivoto.serviceaccount.identity.linkerd.cluster.local] User-Agent:[k6/0.38.3 (https://k6.io/)]] 0xc00095e100 <nil> 21 [] false web-svc.emojivoto.svc.cluster.local:8080 map[choice:[:rabbit:]] map[] <nil> map[] 100.96.51.98:37168 /api/vote?choice=:rabbit: <nil> <nil> <nil> 0xc000a80b40}]: rpc error: code = Unavailable desc = logical service 100.68.109.136:8080: service unavailable
...

Linkerd 2.12.5

k6 Summary

     ✓ response code was 200

     checks.........................: 100.00% ✓ 34525       ✗ 0
     data_received..................: 2.6 MB  257 kB/s
     data_sent......................: 7.0 MB  698 kB/s
     http_req_blocked...............: avg=8.84µs  min=770ns  med=1.74µs  max=41.05ms  p(90)=2.25µs   p(95)=2.63µs
     http_req_connecting............: avg=5.83µs  min=0s     med=0s      max=40.99ms  p(90)=0s       p(95)=0s
   ✓ http_req_duration..............: avg=73.06ms min=1.29ms med=73.14ms max=150.3ms  p(90)=131.12ms p(95)=139.24ms
       { expected_response:true }...: avg=73.06ms min=1.29ms med=73.14ms max=150.3ms  p(90)=131.12ms p(95)=139.24ms
   ✓ http_req_failed................: 0.00%   ✓ 0           ✗ 34525
     http_req_receiving.............: avg=23.62µs min=8.43µs med=21.61µs max=4.63ms   p(90)=30.34µs  p(95)=35.57µs
     http_req_sending...............: avg=13.65µs min=5.69µs med=12.05µs max=730.65µs p(90)=19.75µs  p(95)=22.18µs
     http_req_tls_handshaking.......: avg=0s      min=0s     med=0s      max=0s       p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=73.02ms min=1.26ms med=73.09ms max=150.26ms p(90)=131.09ms p(95)=139.2ms
     http_reqs......................: 34525   3421.759353/s
     iteration_duration.............: avg=73.15ms min=1.37ms med=73.24ms max=150.39ms p(90)=131.21ms p(95)=139.32ms
     iterations.....................: 34525   3421.759353/s
     vus............................: 498     min=49        max=498
     vus_max........................: 500     min=500       max=500

k6 logs


          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: /opt/k6/script.js
     output: -

  scenarios: (100.00%) 1 scenario, 500 max VUs, 40s max duration (incl. graceful stop):
           * average_load: Up to 500 looping VUs for 10s over 1 stages (gracefulRampDown: 30s, gracefulStop: 30s)


running (01.0s), 049/500 VUs, 3176 complete and 0 interrupted iterations
average_load   [  10% ] 049/500 VUs  01.0s/10.0s

running (02.0s), 099/500 VUs, 6627 complete and 0 interrupted iterations
average_load   [  20% ] 099/500 VUs  02.0s/10.0s

running (03.0s), 149/500 VUs, 10086 complete and 0 interrupted iterations
average_load   [  30% ] 149/500 VUs  03.0s/10.0s

running (04.0s), 198/500 VUs, 13493 complete and 0 interrupted iterations
average_load   [  40% ] 198/500 VUs  04.0s/10.0s

running (05.0s), 248/500 VUs, 16913 complete and 0 interrupted iterations
average_load   [  50% ] 248/500 VUs  05.0s/10.0s

running (06.0s), 298/500 VUs, 20348 complete and 0 interrupted iterations
average_load   [  60% ] 298/500 VUs  06.0s/10.0s

running (07.0s), 348/500 VUs, 23759 complete and 0 interrupted iterations
average_load   [  70% ] 348/500 VUs  07.0s/10.0s

running (08.0s), 398/500 VUs, 27167 complete and 0 interrupted iterations
average_load   [  80% ] 398/500 VUs  08.0s/10.0s

running (09.0s), 448/500 VUs, 30527 complete and 0 interrupted iterations
average_load   [  90% ] 448/500 VUs  09.0s/10.0s

running (10.0s), 498/500 VUs, 33915 complete and 0 interrupted iterations
average_load   [ 100% ] 498/500 VUs  10.0s/10.0s

running (10.1s), 000/500 VUs, 34525 complete and 0 interrupted iterations
average_load ✓ [ 100% ] 000/500 VUs  10s

     ✓ response code was 200

     checks.........................: 100.00% ✓ 34525       ✗ 0
     data_received..................: 2.6 MB  257 kB/s
     data_sent......................: 7.0 MB  698 kB/s
     http_req_blocked...............: avg=8.84µs  min=770ns  med=1.74µs  max=41.05ms  p(90)=2.25µs   p(95)=2.63µs
     http_req_connecting............: avg=5.83µs  min=0s     med=0s      max=40.99ms  p(90)=0s       p(95)=0s
   ✓ http_req_duration..............: avg=73.06ms min=1.29ms med=73.14ms max=150.3ms  p(90)=131.12ms p(95)=139.24ms
       { expected_response:true }...: avg=73.06ms min=1.29ms med=73.14ms max=150.3ms  p(90)=131.12ms p(95)=139.24ms
   ✓ http_req_failed................: 0.00%   ✓ 0           ✗ 34525
     http_req_receiving.............: avg=23.62µs min=8.43µs med=21.61µs max=4.63ms   p(90)=30.34µs  p(95)=35.57µs
     http_req_sending...............: avg=13.65µs min=5.69µs med=12.05µs max=730.65µs p(90)=19.75µs  p(95)=22.18µs
     http_req_tls_handshaking.......: avg=0s      min=0s     med=0s      max=0s       p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=73.02ms min=1.26ms med=73.09ms max=150.26ms p(90)=131.09ms p(95)=139.2ms
     http_reqs......................: 34525   3421.759353/s
     iteration_duration.............: avg=73.15ms min=1.37ms med=73.24ms max=150.39ms p(90)=131.21ms p(95)=139.32ms
     iterations.....................: 34525   3421.759353/s
     vus............................: 498     min=49        max=498
     vus_max........................: 500     min=500       max=500

`linkerd-proxy` container logs on `web` pod

#  linkerd-proxy container logs on web pod during k6 test
Defaulted container "linkerd-proxy" out of: linkerd-proxy, web-svc, linkerd-init (init)
[     0.001443s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[     0.001781s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[     0.001792s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[     0.001794s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[     0.001795s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
[     0.001797s]  INFO ThreadId(01) linkerd2_proxy: Local identity is web.emojivoto.serviceaccount.identity.linkerd.cluster.local
[     0.001798s]  INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     0.001800s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[     1.003837s]  WARN ThreadId(01) policy:watch{port=8080}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=100.101.17.33:8090}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[     1.003882s]  WARN ThreadId(01) policy:watch{port=8080}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=100.97.119.64:8090}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[     1.003891s]  WARN ThreadId(01) policy:watch{port=8080}:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:endpoint{addr=100.100.230.80:8090}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[     1.005052s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=100.101.17.175:8080}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[     1.005091s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=100.100.230.162:8080}: linkerd_reconnect: Failed to connect error=connect timed out after1s
[     1.005100s]  WARN ThreadId(02) identity:controller{addr=linkerd-identity-headless.linkerd.svc.cluster.local:8080}:endpoint{addr=100.96.168.60:8080}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[     1.122383s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=web.emojivoto.serviceaccount.identity.linkerd.cluster.local

Logs from `web-svc` container on the `web` pod

2023/07/26 15:48:35 Connecting to [voting-svc.emojivoto:8080]
2023/07/26 15:48:35 Connecting to [emoji-svc.emojivoto:8080]
2023/07/26 15:48:35 Starting web server on WEB_PORT=[8080]

Clarifying a bit on the particular problem we are seeing in 2.13 (happy to file an additional issue if this is too far off topic):

Our issue is not necessarily that the loadshed warnings appear (because, it is likely better to shed some load rather than fail all requests), but that:
1. In some cases, the loadshed warnings will never resolve because the endpoints they are sending requests to are endpoints that did exist in the cluster, but don't any longer.
2. In almost all cases, loadshed is too aggressive and will fail requests even though the backing application is able to respond adequately to the incoming load.
I think loadshed is not necessarily an error, and just represents a state of the proxy where the proxy is unable to handle all requests adequately and must drop some requests in order to respond to the rest. This is similar to failfast https://linkerd.buoyant.io/t/what-are-failfast-errors-in-linkerd-and-how-do-you-debug-them/85 except that, instead of failing all requests, the proxy will just fail a specific percentage of them so that the other requests can succeed. It is hard to know though, because I could not find any documentation on loadshed warnings or the state whatsoever.

@nikskiz your test perfectly encapsulates some really eager request failing (our second problem) within Linkerd.

nikskiz · 2023-07-27T05:48:35Z

@jandersen-plaid Thanks for the details on your first issue. Maybe we can still try to replicate your first issue by modifying the load test parameters. We use the stages in k6 to ramp down the test and try to persist the error during "normal" load. The stages will look as follow

        stages: [
          // Produce load on a steady environment to get the loadshed error
          { duration: "10s", target: 500 },
          // Ramp down the load to what the proxy can handle and restart the downstream "voting" app to get new endpoints
          { duration: "5s", target: 20 },
          // Hold for a 60s and observer
          { duration: "60s", target: 20 },
        ],

In the first 10 seconds, we should be able to get the failures. Then, in 5 seconds we ramp down to 20 iterations. We hold the 20 iterations for another 60s. You should restart the downstream app in the first 10 seconds so that it gets new endpoints. Assuming the proxy is still sending requests to the old endpoint we should see still see errors during the ramp-down (after 15 seconds).

I did run this locally and wasn't able to reproduce any errors. I could see the errors disappear when the ramp-down started, even though the downstream app recieved a new endpoint.

The only way I was able to see the "service unavailable" error again was when I forcefully deleted the voting app endpoind, so it had none endpoints.

bjoernw · 2023-08-02T00:45:14Z

This is a huge problem. The destination-controller is broken on 2.13.4. Occasionally under load (i guess?) it will stop responding to diagnostics commands and also won't serve up new IPs to proxies leading to a cluster wide failure until the controller is restarted. We're runnings 300+ meshed pods with 5 pods per ctrl plane component so maybe this just shows up at this scale?

I have a job that runs every minute and diffs the IPs for a handful of endpoints per destination controller pod and per endpoint slice. During the outage this job was the canary in the coal mine because it stopped logging. Most likely because the diagnostics command hung. The destination controllers stopped logging anything at all (they usually complain about Failed to get node topology zone for node on a regular basis).

I'm going to downgrade to 2.12 as soon as possible.

risingspiral · 2023-08-02T18:26:29Z

Hey folks - thanks for the reports here! Linkerd maintainers are actively investigating multiple things raised in this thread. Thanks to @nikskiz for the repro steps. We're using these to investigate the loadshed behavior in the proxy and determine its root cause.

For the issues around destination controller stability - this is independent from the loadshed behavior. @alpeb will be following up to gather more information to determine what may be at play. More soon!

In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies shedding load much more aggressively while under high load to a single destination service, resulting in increased error rates in comparison to 2.12 (see linkerd/linkerd2#11055 for details). This commit changes the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests, the way they were in 2.12 and earlier. In manual load testing I've verified that increasing the queue capacity results in a substantial decrease in 503 Service Unavailable errors emitted by the proxy: with a queue capacity of 100 requests, the load test described [here] observed a failure rate of 51.51% of requests, while with a queue capacity of 10,000 requests, the same load test observes no failures. Note that I did not modify the TCP connection queue capacities, or the control plane request queue capacity. These were previously configured by the same variable before #2078, but were split out into separate vars in that change. I don't think the queue capacity limits for TCP connection establishment or for control plane requests are currently resulting in instability the way the decreased request queue capacity is, so I decided to make a more focused change to just the HTTP request queues for the proxies. [here]: linkerd/linkerd2#11055 (comment)

In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies shedding load much more aggressively while under high load to a single destination service, resulting in increased error rates in comparison to 2.12 (see #11055 for details). This commit changes the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests, the way they were in 2.12 and earlier. In manual load testing I've verified that increasing the queue capacity results in a substantial decrease in 503 Service Unavailable errors emitted by the proxy: with a queue capacity of 100 requests, the load test described [here] observed a failure rate of 51.51% of requests, while with a queue capacity of 10,000 requests, the same load test observes no failures. Note that I did not modify the TCP connection queue capacities, or the control plane request queue capacity. These were previously configured by the same variable before #2078, but were split out into separate vars in that change. I don't think the queue capacity limits for TCP connection establishment or for control plane requests are currently resulting in instability the way the decreased request queue capacity is, so I decided to make a more focused change to just the HTTP request queues for the proxies. [here]: #11055 (comment) --- * Increase HTTP request queue capacity (linkerd/linkerd2-proxy#2449) Signed-off-by: Eliza Weisman <eliza@buoyant.io>

@abhijeetgauravm

This edge release restores a proxy setting for it to shed load less aggressively while under high load, which should result in lower error rates (addressing #11055). It also removes the usage of host networking in the linkerd-cni extension. * Changed the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests (see #11055 and #11198) * Lifted need of using host networking in the linkerd-cni Daemonset (#11141) (thanks @abhijeetgauravm!)

alpeb · 2023-08-04T13:26:12Z

I'm happy to report that the aggressive load shedding behavior has just been fixed in edge-23.8.1 released yesterday, and will be back-ported into stable-2.13.6.
Now let's focus on the destination cache staleness issue. While investigating this I found it's possible that the outbound_http_balancer_endpoints metric has a bug, so let's disregard that one for now. @bjoernw's script above should help catching any discrepancies. Another thing to track are the endpoints_cache_size and endpointslices_cache_size on the Destination containers, which should match exactly the total number of endpoints in the cluster. I'm still trying to find a scenario to replicate this discrepancy. If you have debug logs for Destination under those circumstances that'd be helpful as well.

hadican · 2023-08-07T11:08:42Z

Thank you so much! @alpeb

hawkw · 2023-08-07T17:28:49Z

Hi folks, I just wanted to chime in about edge-23.8.1, which includes the fix for a regression that made proxies much more likely to shed load under moderate load (linkerd/linkerd2-proxy#2449). We believe that change should resolve the issues described in this thread — it certainly resolves the issue reproduced by @nikskiz in #11055 (comment). Therefore, if anyone has the opportunity to test the edge release in your staging/testing environments, we'd love confirmation that there are no additional issues going on here.

In any case, the fix will make it into stable-2.13.6 later this week.

In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies shedding load much more aggressively while under high load to a single destination service, resulting in increased error rates in comparison to 2.12 (see linkerd/linkerd2#11055 for details). This commit changes the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests, the way they were in 2.12 and earlier. In manual load testing I've verified that increasing the queue capacity results in a substantial decrease in 503 Service Unavailable errors emitted by the proxy: with a queue capacity of 100 requests, the load test described [here] observed a failure rate of 51.51% of requests, while with a queue capacity of 10,000 requests, the same load test observes no failures. Note that I did not modify the TCP connection queue capacities, or the control plane request queue capacity. These were previously configured by the same variable before #2078, but were split out into separate vars in that change. I don't think the queue capacity limits for TCP connection establishment or for control plane requests are currently resulting in instability the way the decreased request queue capacity is, so I decided to make a more focused change to just the HTTP request queues for the proxies. [here]: linkerd/linkerd2#11055 (comment)

In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies shedding load much more aggressively while under high load to a single destination service, resulting in increased error rates in comparison to 2.12 (see #11055 for details). This commit changes the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests, the way they were in 2.12 and earlier. In manual load testing I've verified that increasing the queue capacity results in a substantial decrease in 503 Service Unavailable errors emitted by the proxy: with a queue capacity of 100 requests, the load test described [here] observed a failure rate of 51.51% of requests, while with a queue capacity of 10,000 requests, the same load test observes no failures. Note that I did not modify the TCP connection queue capacities, or the control plane request queue capacity. These were previously configured by the same variable before #2078, but were split out into separate vars in that change. I don't think the queue capacity limits for TCP connection establishment or for control plane requests are currently resulting in instability the way the decreased request queue capacity is, so I decided to make a more focused change to just the HTTP request queues for the proxies. [here]: #11055 (comment) --- * Increase HTTP request queue capacity (linkerd/linkerd2-proxy#2449)

@harsh020

This stable release fixes a regression introduced in stable-2.13.0 which resulted in proxies shedding load too aggressively while under moderate request load to a single service ([#11055]). In addition, it updates the base image for the `linkerd-cni` initcontainer to resolve a CVE in `libdb` ([#11196]), fixes a race condition in the Destination controller that could cause it to crash ([#11163]), as well as fixing a number of other issues. * Control Plane * Fixed a race condition in the destination controller that could cause it to panic ([#11169]; fixes [#11193]) * Improved the granularity of logging levels in the control plane ([#11147]) * Replaced incorrect `server_port_subscribers` gauge in the Destination controller's metrics with `server_port_subscribes` and `server_port_unsubscribes` counters ([#11206]; fixes [#10764]) * Proxy * Changed the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests ([#11198]; fixes [#11055]) * CLI * Updated extension CLI commands to prefer the `--registry` flag over the `LINKERD_DOCKER_REGISTRY` environment variable, making the precedence more consistent (thanks @harsh020!) (see [#11144]) * CNI * Updated `linkerd-cni` base image to resolve [CVE-2019-8457] in `libdb` ([#11196]) * Changed the CNI plugin installer to always run in 'chained' mode; the plugin will now wait until another CNI plugin is installed before appending its configuration ([#10849]) * Removed `hostNetwork: true` from linkerd-cni Helm chart templates ([#11158]; fixes [#11141]) (thanks @abhijeetgauravm!) * Multicluster * Fixed the `linkerd multicluster check` command failing in the presence of lots of mirrored services ([#10764]) [#10764]: #10764 [#10849]: #10849 [#11055]: #11055 [#11141]: #11141 [#11144]: #11144 [#11147]: #11147 [#11158]: #11158 [#11163]: #11163 [#11169]: #11169 [#11196]: #11196 [#11198]: #11198 [#11206]: #11206 [CVE-2019-8457]: https://avd.aquasec.com/nvd/2019/cve-2019-8457/

@harsh020

This stable release fixes a regression introduced in stable-2.13.0 which resulted in proxies shedding load too aggressively while under moderate request load to a single service ([#11055]). In addition, it updates the base image for the `linkerd-cni` initcontainer to resolve a CVE in `libdb` ([#11196]), fixes a race condition in the Destination controller that could cause it to crash ([#11163]), as well as fixing a number of other issues. * Control Plane * Fixed a race condition in the destination controller that could cause it to panic ([#11169]; fixes [#11193]) * Improved the granularity of logging levels in the control plane ([#11147]) * Replaced incorrect `server_port_subscribers` gauge in the Destination controller's metrics with `server_port_subscribes` and `server_port_unsubscribes` counters ([#11206]; fixes [#10764]) * Proxy * Changed the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests ([#11198]; fixes [#11055]) * CLI * Updated extension CLI commands to prefer the `--registry` flag over the `LINKERD_DOCKER_REGISTRY` environment variable, making the precedence more consistent (thanks @harsh020!) (see [#11144]) * CNI * Updated `linkerd-cni` base image to resolve [CVE-2019-8457] in `libdb` ([#11196]) * Changed the CNI plugin installer to always run in 'chained' mode; the plugin will now wait until another CNI plugin is installed before appending its configuration ([#10849]) * Removed `hostNetwork: true` from linkerd-cni Helm chart templates ([#11158]; fixes [#11141]) (thanks @abhijeetgauravm!) * Multicluster * Fixed the `linkerd multicluster check` command failing in the presence of lots of mirrored services ([#10764]) [#10764]: #10764 [#10849]: #10849 [#11055]: #11055 [#11141]: #11141 [#11144]: #11144 [#11147]: #11147 [#11158]: #11158 [#11163]: #11163 [#11169]: #11169 [#11196]: #11196 [#11198]: #11198 [#11206]: #11206 [CVE-2019-8457]: https://avd.aquasec.com/nvd/2019/cve-2019-8457/

In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies shedding load much more aggressively while under high load to a single destination service, resulting in increased error rates in comparison to 2.12 (see #11055 for details). This commit changes the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests, the way they were in 2.12 and earlier. In manual load testing I've verified that increasing the queue capacity results in a substantial decrease in 503 Service Unavailable errors emitted by the proxy: with a queue capacity of 100 requests, the load test described [here] observed a failure rate of 51.51% of requests, while with a queue capacity of 10,000 requests, the same load test observes no failures. Note that I did not modify the TCP connection queue capacities, or the control plane request queue capacity. These were previously configured by the same variable before #2078, but were split out into separate vars in that change. I don't think the queue capacity limits for TCP connection establishment or for control plane requests are currently resulting in instability the way the decreased request queue capacity is, so I decided to make a more focused change to just the HTTP request queues for the proxies. [here]: #11055 (comment) --- * Increase HTTP request queue capacity (linkerd/linkerd2-proxy#2449)

@harsh020

This stable release fixes a regression introduced in stable-2.13.0 which resulted in proxies shedding load too aggressively while under moderate request load to a single service ([#11055]). In addition, it updates the base image for the `linkerd-cni` initcontainer to resolve a CVE in `libdb` ([#11196]), fixes a race condition in the Destination controller that could cause it to crash ([#11163]), as well as fixing a number of other issues. * Control Plane * Fixed a race condition in the destination controller that could cause it to panic ([#11169]; fixes [#11193]) * Improved the granularity of logging levels in the control plane ([#11147]) * Proxy * Changed the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests ([#11198]; fixes [#11055]) * CLI * Updated extension CLI commands to prefer the `--registry` flag over the `LINKERD_DOCKER_REGISTRY` environment variable, making the precedence more consistent (thanks @harsh020!) (see [#11144]) * CNI * Updated `linkerd-cni` base image to resolve [CVE-2019-8457] in `libdb` ([#11196]) * Changed the CNI plugin installer to always run in 'chained' mode; the plugin will now wait until another CNI plugin is installed before appending its configuration ([#10849]) * Removed `hostNetwork: true` from linkerd-cni Helm chart templates ([#11158]; fixes [#11141]) (thanks @abhijeetgauravm!) * Multicluster * Fixed the `linkerd multicluster check` command failing in the presence of lots of mirrored services ([#10764]) [#10764]: #10764 [#10849]: #10849 [#11055]: #11055 [#11141]: #11141 [#11144]: #11144 [#11147]: #11147 [#11158]: #11158 [#11163]: #11163 [#11169]: #11169 [#11196]: #11196 [#11198]: #11198 [CVE-2019-8457]: https://avd.aquasec.com/nvd/2019/cve-2019-8457/

vkperumal · 2023-08-21T13:16:30Z

@risingspiral We are still seeing the issue after upgrading to 2.13.6

[  1120.099428s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51902
[  1120.099830s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51900
[  1120.100077s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51898
[  1120.100493s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51892
[  1120.100737s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51882
[  1120.100947s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51794
[  1120.101318s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51796
[  1120.101548s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51782
[  1120.101791s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51778
[  1120.102012s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:51804
[  1240.260261s]  INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.25.53:42012

tgolly · 2023-08-23T06:23:42Z

Have upgraded to 2.13.6 to try and resolve the endpoint failures, and I'm still seeing tons of these Failed to get node topology zone for node errors on the destination controller like @bjoernw was seeing. This is under load.

2023-08-23T16:15:28+10:00 time="2023-08-23T06:15:28Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:49300" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:16:00+10:00 time="2023-08-23T06:16:00Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:60264" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:16:33+10:00 time="2023-08-23T06:16:33Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:40158" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:17:05+10:00 time="2023-08-23T06:17:05Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:53556" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:17:38+10:00 time="2023-08-23T06:17:38Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:36490" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:18:10+10:00 time="2023-08-23T06:18:10Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:60976" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:18:43+10:00 time="2023-08-23T06:18:43Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:34732" service="service-xxx.ns.svc.cluster.local:8080"
2023-08-23T16:19:16+10:00 time="2023-08-23T06:19:16Z" level=error msg="Failed to get node topology zone for node : nodes \"\" not found" addr=":8086" component=endpoint-translator remote="127.0.0.1:38158" service="service-xxx.ns.svc.cluster.local:8080"

wmorgan · 2023-08-24T23:38:21Z

Thanks for the update @tgolly @vkperumal. This is still on our radar.

tgolly · 2023-08-31T02:00:55Z

Those "Failed to get node topology zone" errors calmed down after a couple of days. Not seeing them anymore.

So is this issue considered resolved in 2.13.6?

risingspiral · 2023-09-06T22:54:41Z

Hey folks - thanks for all the help on this issue. For now we're marking this as resolved as a regression between 2.12 and 2.13 that led to premature loadshed due to a small queue size has been addressed. @vkperumal if you're still seeing errors it would be good to understand the total request volume the service is seeing when errors are encountered and what the outbound_http_error_total metric is reporting. We can re-open this issue or start with a fresh one.

@bjoernw for destination controller reliability and consistency under high endpoint churn we'll track investigation on #11279

vkperumal added the bug label Jun 23, 2023

alpeb assigned hawkw Jun 29, 2023

bjoernw mentioned this issue Jul 26, 2023

Linkerd stable-2.13.4 - linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused #11156

Closed

nikskiz mentioned this issue Jul 27, 2023

Load testing linkerd before its marked as stable. #11172

Closed

hawkw mentioned this issue Aug 3, 2023

Increase HTTP request queue capacity linkerd/linkerd2-proxy#2449

Merged

hawkw mentioned this issue Aug 3, 2023

proxy: v2.207.0 #11198

Merged

risingspiral mentioned this issue Aug 3, 2023

edge-23.8.1 #11200

Merged

risingspiral added this to the stable-2.13.6 milestone Aug 3, 2023

hawkw mentioned this issue Aug 9, 2023

[backport] Increase HTTP request queue capacity (#2449) linkerd/linkerd2-proxy#2453

Merged

hawkw mentioned this issue Aug 9, 2023

stable-2.13.6 #11227

Merged

adleong added the priority/triage label Aug 24, 2023

risingspiral closed this as completed Sep 6, 2023

bengtfredh mentioned this issue Sep 26, 2023

Upgrade linkerd to 2.13.7 platform and apps Altinn/altinn-platform#554

Closed

3 tasks

github-actions bot locked as resolved and limited conversation to collaborators Oct 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outbound HTTP call failing intermittently #11055

Outbound HTTP call failing intermittently #11055

vkperumal commented Jun 23, 2023 •

edited

Loading

someone-stole-my-name commented Jun 28, 2023

MarkSRobinson commented Jun 28, 2023

vkperumal commented Jun 29, 2023

someone-stole-my-name commented Jun 29, 2023

alpeb commented Jun 29, 2023

vkperumal commented Jun 29, 2023

juhomakitalo commented Jul 20, 2023 •

edited

Loading

tomhaddad commented Jul 20, 2023

ali-accesso commented Jul 22, 2023

bjoernw commented Jul 24, 2023 •

edited

Loading

jandersen-plaid commented Jul 24, 2023

bjoernw commented Jul 24, 2023 •

edited

Loading

bjoernw commented Jul 24, 2023

jandersen-plaid commented Jul 25, 2023 •

edited

Loading

bjoernw commented Jul 26, 2023

nikskiz commented Jul 26, 2023 •

edited

Loading

jandersen-plaid commented Jul 26, 2023

nikskiz commented Jul 27, 2023

bjoernw commented Aug 2, 2023 •

edited

Loading

risingspiral commented Aug 2, 2023

alpeb commented Aug 4, 2023

hadican commented Aug 7, 2023

hawkw commented Aug 7, 2023

vkperumal commented Aug 21, 2023

tgolly commented Aug 23, 2023

wmorgan commented Aug 24, 2023

tgolly commented Aug 31, 2023

risingspiral commented Sep 6, 2023

Outbound HTTP call failing intermittently #11055

Outbound HTTP call failing intermittently #11055

Comments

vkperumal commented Jun 23, 2023 • edited Loading

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

someone-stole-my-name commented Jun 28, 2023

MarkSRobinson commented Jun 28, 2023

vkperumal commented Jun 29, 2023

someone-stole-my-name commented Jun 29, 2023

alpeb commented Jun 29, 2023

vkperumal commented Jun 29, 2023

juhomakitalo commented Jul 20, 2023 • edited Loading

tomhaddad commented Jul 20, 2023

ali-accesso commented Jul 22, 2023

bjoernw commented Jul 24, 2023 • edited Loading

jandersen-plaid commented Jul 24, 2023

bjoernw commented Jul 24, 2023 • edited Loading

bjoernw commented Jul 24, 2023

jandersen-plaid commented Jul 25, 2023 • edited Loading

bjoernw commented Jul 26, 2023

nikskiz commented Jul 26, 2023 • edited Loading

Prerequisites

Steps to replicate proxy errors "service unavailable"

jandersen-plaid commented Jul 26, 2023

Linkerd 2.13.4

k6 Summary

Linkerd 2.12.5

k6 Summary

nikskiz commented Jul 27, 2023

bjoernw commented Aug 2, 2023 • edited Loading

risingspiral commented Aug 2, 2023

alpeb commented Aug 4, 2023

hadican commented Aug 7, 2023

hawkw commented Aug 7, 2023

vkperumal commented Aug 21, 2023

tgolly commented Aug 23, 2023

wmorgan commented Aug 24, 2023

tgolly commented Aug 31, 2023

risingspiral commented Sep 6, 2023

vkperumal commented Jun 23, 2023 •

edited

Loading

output of `linkerd check -o short`

juhomakitalo commented Jul 20, 2023 •

edited

Loading

bjoernw commented Jul 24, 2023 •

edited

Loading

bjoernw commented Jul 24, 2023 •

edited

Loading

jandersen-plaid commented Jul 25, 2023 •

edited

Loading

nikskiz commented Jul 26, 2023 •

edited

Loading

bjoernw commented Aug 2, 2023 •

edited

Loading