Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The proxy on a random pod enters fail-fast mode preventing it from working correctly. #8934

Closed
agalue opened this issue Jul 20, 2022 · 3 comments

Comments

@agalue
Copy link

agalue commented Jul 20, 2022

What is the issue?

The Linkerd proxy on a random Cortex Ingester enters into the fail-fast mode blocking the communication against the distributors, but not against other Cortex components like Queriers.

That effectively breaks replication as the distributors cannot see the Ingester in a healthy state, even if the communication via memberlist is unaffected and the Ingester appears active on the ring.

I tried restarting the Ingester, but the problem solves temporarily. The strange part is that, sometimes, another Ingester enters into the fail-fast state after restarting the affected one, which is why I used the term random to describe the problem.

How can it be reproduced?

The problem appears when handling a considerable amount of traffic. Currently, distributors are receiving a constant rate of 50K samples per second in batches of 500, meaning, effectively, distributors are receiving 100 requests (with 500 samples each), and according to linkerd viz dashboard, Ingesters are receiving a similar number of RPS.

On my initial tests with orders of magnitudes less traffic, the problem doesn't appear.

Logs, error output, etc

As the logs are very verbose due to the injection rate. Here are the last 5 seconds from the affected Ingester (i.e., ingester-0) and the two distributors:

https://gist.github.com/agalue/5ecbbfcf37ecf8b5798bf18bbe0473b1

Here is how I got the logs:

kubectl logs -n cortex distributor-ddd56cf9-4sz4s -c linkerd-proxy --since=5s > distributor-4sz4s-proxy-logs.txt
kubectl logs -n cortex distributor-ddd56cf9-wzjdd -c linkerd-proxy --since=5s > distributor-wzjdd-proxy-logs.txt
kubectl logs -n cortex ingester-0 linkerd-proxy --since=5s > ingester-0-proxy-logs.txt

output of linkerd check -o short

~ linkerd check -o short
Status check results are √

Environment

  • Kubernetes Version: 1.23.3
  • Cluster Environment: AKS with Kubenet and Calico
  • Host OS: Ubuntu 18.04.6 LTS, 5.4.0-1083-azure, containerd://1.5.11+azure-2 (managed by AKS)
  • Linkerd Version: 2.11.4

Note: the problem appears with and without Calico (tested on different clusters).

Possible solution

No response

Additional context

In Cortex, all components talk to each other via Pod IP, meaning all the communication happens via Pod-to-Pod through gRPC.

To give more context about what you would see on the proxy logs:

~ kubectl get pod -n cortex -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP             NODE                                   NOMINATED NODE   READINESS GATES
compactor-0                       2/2     Running   0          47h   10.244.24.10   aks-rpsupport-13025066-vmss000004      <none>           <none>
distributor-ddd56cf9-4sz4s        2/2     Running   0          46h   10.244.2.26    aks-distributors-15878912-vmss000000   <none>           <none>
distributor-ddd56cf9-wzjdd        2/2     Running   0          46h   10.244.28.8    aks-distributors-15878912-vmss00000d   <none>           <none>
ingester-0                        2/2     Running   0          24h   10.244.12.14   aks-ingesters-39599960-vmss000001      <none>           <none>
ingester-1                        2/2     Running   0          47h   10.244.4.13    aks-ingesters-39599960-vmss000000      <none>           <none>
ingester-2                        2/2     Running   0          43h   10.244.3.17    aks-ingesters-39599960-vmss000002      <none>           <none>
memcached-chunks-0                3/3     Running   0          47h   10.244.8.26    aks-rpsupport-13025066-vmss000001      <none>           <none>
memcached-chunks-1                3/3     Running   0          47h   10.244.10.28   aks-rpsupport-13025066-vmss000002      <none>           <none>
memcached-chunks-2                3/3     Running   0          47h   10.244.6.33    aks-rpsupport-13025066-vmss000003      <none>           <none>
memcached-frontend-0              3/3     Running   0          47h   10.244.10.30   aks-rpsupport-13025066-vmss000002      <none>           <none>
memcached-frontend-1              3/3     Running   0          47h   10.244.8.25    aks-rpsupport-13025066-vmss000001      <none>           <none>
memcached-frontend-2              3/3     Running   0          47h   10.244.6.32    aks-rpsupport-13025066-vmss000003      <none>           <none>
memcached-index-0                 3/3     Running   0          47h   10.244.10.29   aks-rpsupport-13025066-vmss000002      <none>           <none>
memcached-index-1                 3/3     Running   0          47h   10.244.8.24    aks-rpsupport-13025066-vmss000001      <none>           <none>
memcached-index-2                 3/3     Running   0          47h   10.244.6.31    aks-rpsupport-13025066-vmss000003      <none>           <none>
memcached-metadata-0              3/3     Running   0          47h   10.244.6.34    aks-rpsupport-13025066-vmss000003      <none>           <none>
querier-794978b45f-2b7z2          2/2     Running   0          47h   10.244.23.6    aks-storegws-30442145-vmss000013       <none>           <none>
querier-794978b45f-h2fmf          2/2     Running   0          47h   10.244.15.13   aks-storegws-30442145-vmss00000w       <none>           <none>
querier-794978b45f-vbjqn          2/2     Running   0          47h   10.244.17.8    aks-storegws-30442145-vmss00000z       <none>           <none>
query-frontend-5b57ddb6cf-bkvpk   2/2     Running   0          47h   10.244.6.35    aks-rpsupport-13025066-vmss000003      <none>           <none>
query-frontend-5b57ddb6cf-jxgq2   2/2     Running   0          47h   10.244.8.27    aks-rpsupport-13025066-vmss000001      <none>           <none>
store-gateway-0                   2/2     Running   0          47h   10.244.23.5    aks-storegws-30442145-vmss000013       <none>           <none>
store-gateway-1                   2/2     Running   0          47h   10.244.17.7    aks-storegws-30442145-vmss00000z       <none>           <none>
store-gateway-2                   2/2     Running   0          47h   10.244.15.12   aks-storegws-30442145-vmss00000w       <none>           <none>

The only error code I found on the distributors proxy is:

[165766.325281s] DEBUG ThreadId(01) outbound:accept{client.addr=10.244.2.26:49942}: linkerd_app_core::serve: Connection closed reason=connection error: server: Transport endpoint is not connected (os error 107)

In terms of the applications, the affected Ingester reports nothing on its log, as the distributor traffic is not reaching the application.

The distributors, on the other hand, are flooded with the following message; as I presume the proxy on the affected Ingester is rejecting the traffic:

level=warn ts=2022-07-18T15:57:05.907166697Z caller=pool.go:184 msg="removing ingester failing healthcheck" addr=10.244.3.14:9095 reason="rpc error: code = Unavailable desc = HTTP Logical service in fail-fast"

Would you like to work on fixing this bug?

No response

@agalue agalue added the bug label Jul 20, 2022
@olix0r olix0r added support and removed bug labels Jul 20, 2022
@agalue
Copy link
Author

agalue commented Jul 28, 2022

After much research and testing, I found that increasing the maximum number of concurrent streams helps.

Since I've done that on the affected microservice, I haven't seen the problem in about 24 hours. That was the only method I found to keep the system working for such time, as previously, an Ingester would enter fail-fast in between 30 minutes and 2 hours.

The way to increase the maximum stream on an Ingester is by adding the following argument to its StatefulSet:

-server.grpc-max-concurrent-streams=100000

The default is 100, and for testing purposes, I decided to use 100000 based on what some people suggested on Cortex's Slack Channel.

I'm considering this a workaround rather than a permanent solution, as, without Linkerd, I never needed to increase that limit when handling high traffic on previous tests.

However, I'm trying to understand why that helped.

I'll continue monitoring the system to ensure it stays working as it is now.

@wowq
Copy link
Contributor

wowq commented Aug 13, 2022

The same issue is in our grpc service。It can only be solved by restarting. No meaningful information can be found in the log。
Linkerd Version: stable-2.11.4
This occurs after upgrading the cluster from 1.20 to 1.22 and switching from init pod to CNI plugin。
This also leads to the failure from nginx ingress to the back-end service
Maybe this problem will be caused when CNI plugin and init container run at the same time?

@stale
Copy link

stale bot commented Nov 17, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Nov 17, 2022
@stale stale bot closed this as completed Dec 2, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants