-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outbound HTTP call failing intermittently #11055
Comments
Since our upgrade to 2.13.4 we've been experiencing the same errors in our cluster which result in transient 503 returned to the clients. It looks like the same or very similar problem reported in https://linkerd.buoyant.io/t/503-service-unavailable-from-proxy-with-large-number-of-connections/126. It doesn't affect any service specifically, the whole mesh is affected randomly during peak load. There's no throttling or any resource constraint issues, we're not using service profiles, Linkerd's circuit breaking or dynamic routing. ![]() The log of the affected clients is filled with information, but I don't see an answer as to why the proxy is marking services as unavailable:
Metrics are not helpful either, according to them, everything is working fine, but it isn't. The proxy is not adding those 503 as failed requests, they're not visible in viz or Prometheus. The only real clue comes from transport level metrics, where we see the same ![]() Sadly I don't have reproduction instructions, but something must have changed between 2.12 and 2.13 that causes this odd behaviour. |
@someone-stole-my-name What version were you running before? |
@MarkSRobinson Issues started for me after upgrading from 2.11.1 to 2.13.4. Issue still exists on 2.13.5 as well |
2.12.4 We've reverted just the data plane to the previous version and things are stable again, so far. We replaced some core components first, and in the rollback process noticed something that may give some clues on where to look. Reverting just the outbound proxies doesn't fix the issue, you still get the slightly different errors from v2.12 but same outcome, service marked as unavailable and 503 returned to the client:
Reverting the inbound proxy to 2.12 apparently fixes this issue. Here are some logs from the inbound side while still using 2.13.4. Notice how the pattern changes around |
@vkperumal Can you tell if the target services are under high load when those errors occur, like reported by @someone-stole-my-name ? |
@alpeb My application is a scheduler which makes outbound http calls to other apps. There is no load on the application |
Having same errors after upgrading to 2.13.5 and only load was communication between the services in our application.
|
Also experiencing the same issue - seems it was introduced in 2.13.0, as I have switched between 2.12.5 / 2.13.0 today and seen no issues in the former. It is also very specifically occurring when making over 100 concurrent outgoing requests. Any less and everything works as normal, any more (even 101) cause the the same errors described in the original issue and crash the respective client pod. |
We also experiencing unpredictable connections issue between our services since upgrading from 2.12.5 to 2.13.5 - no heavy load on our apps at the moment, and still can't figure out what could cause this issue.
|
I'm having this same issue on a cluster with several hundred meshed pods. I'm on 2.13.4. I worry that it's an out-of-date endpoints cache issue that is causing the whole service to be deemed unavailable. The only way to recover from this issue is to restart all the proxies that are logging this issue. |
Similar to other posters, we have run into this issue and reported it elsewhere (see https://linkerd.buoyant.io/t/linkerd-destination-has-stale-data-in-endpoint-cache/171). @bjoernw, we also think it is a stale endpoints cache, but we found that we could recover from incidences of it by restarting the destination controllers (presumably this sends a RESET to the affected proxies, and that clears their caches?). We could not reproduce it after ~2 weeks of trying, but our suspicion is that what happens is:
Of course, this is all conjecture: without a reproduction it is really hard to prove any of this is happening or get the necessary log info in order to track down which cache is failing in what way. Since we don't by default log debug level, all we have are our metrics which indicate that the Correlation between loadshedding behavior and stale ips (note that stale ips were being accessed before the major errors start, which indicates that some bad endpoints in the cache were being accessed on a relatively frequent basis, but not considered for routing http requests until something -- a rise in latency or a few 504s -- triggered Relationship between loadshedding errors and endpoints: (it is worth noting that the divot in the graph is where we attempted to restart the affected inbound proxies // pods, but were unable to recover from the situation as we did not restart the nginx pods which were sending the outbound traffic to the affected deployment. At the end, the Hopefully someone can come up with a reproduction, but we are going to downgrade to 1.12.5 for the time being. |
@jandersen-plaid Great write-up, makes sense to me. There have been a couple of issues related to endpoints being out-of-date recently #10003 I'm going to setup a job that continuously monitors the diff between the k8s endpoints api and linkerd's diagnostic command. I just ran this during one of our canary deploys and I did see some IPs in the linkerd list that weren't in the endpoints list for a few seconds. Definitely expected since this is eventually consistent but I'm going to be looking for the scenario where this list doesn't reconcile over time for a specific destination pod. I see you're also having this issue primarily on your ingress controller. That's usually where it starts for us but then it spreads randomly around to other service-to-service calls as well. I guess the one thing they all have in common is the cache in the destination controller... If we run this once a minute we should be able to see when a certain pod's cache doesn't seem to recover. #!/bin/bash
NAMESPACE=my-namespace
ENDPOINT=my-service
PORT=3000
ENDPOINTS=$(kubectl get endpointslice -n $NAMESPACE -l kubernetes.io/service-name=$ENDPOINT -o json | jq -r '.items[].endpoints[].addresses[]' | sort)
LINKERD_DESTINATION_PODS=$(kubectl get pods -n linkerd -l linkerd.io/control-plane-component=destination -o jsonpath='{.items[*].metadata.name}')
for POD in $LINKERD_DESTINATION_PODS
do
LINKERD_DIAGNOSTICS=$(linkerd diagnostics endpoints --destination-pod $POD $ENDPOINT.$NAMESPACE.svc.cluster.local:$PORT | awk 'NR>1 {print $2}' | sort)
ONLY_IN_ENDPOINTS=$(comm -23 <(echo "$ENDPOINTS") <(echo "$LINKERD_DIAGNOSTICS"))
ONLY_IN_LINKERD=$(comm -13 <(echo "$ENDPOINTS") <(echo "$LINKERD_DIAGNOSTICS"))
if [[ -z $ONLY_IN_ENDPOINTS && -z $ONLY_IN_LINKERD ]]
then
echo "Both IP sets are identical for Pod $POD."
else
if [[ -n $ONLY_IN_ENDPOINTS ]]
then
echo "IPs only in EndpointSlices for Pod $POD:"
echo "$ONLY_IN_ENDPOINTS"
fi
if [[ -n $ONLY_IN_LINKERD ]]
then
echo "IPs only in Linkerd Diagnostics for Pod $POD:"
echo "$ONLY_IN_LINKERD"
fi
fi
done
|
@jandersen-plaid Are you seeing any endpoint update failures in your k8s events around the time of the issue? I'm seeing an elevated rate of these on the problematic service:
|
Hmmm is this coming from the endpoint controller or linkerd? On the endpoint controller side, we do see this error (a 409 in api server logs because of the resource conflict) around the incident, but it also happens rather frequently in the cluster outside of the incident window as well. We may have seen it on the linkerd side during the incident, but some of our logs from the most recent occurrence have reduced granularity at this point, so I can't confirm that. |
@jandersen-plaid that's from the endpoint controller. In the past I had suspected that throttling at the API server level was causing the endpoints API itself to be out of sync with the current set of Pods hence leading the linkerd cache to be out of sync but it seems like the problem lies between endpoints and linkerd. I'm running a version of that script above in the cluster now with automatic restarts of the destination controller and extensive logging. Will report back if I find anything interesting. |
We have experienced the same issue, and rolling back the proxy with linkerd2 stable-2.12.x worked. When observing the linekrd metrics exposed we could see the "loadshed" error as outlined above by @jandersen-plaid This is easily reproduced by following the steps below PrerequisitesSteps to replicate proxy errors "service unavailable"
|
Thanks for the reproduction @nikskiz! the results of performing the test on 2.13.4 vs 2.12.5 are staggering: Note: the difference in requests sent can likely be explained by the faster failure in 2.13.4: less time on requests leads to more iterations, but they shouldn't fail in the first place! Linkerd 2.13.4k6 Summary
k6 logs
`linkerd-proxy` container logs on `web` pod
Logs from `web-svc` container on the `web` pod
Linkerd 2.12.5k6 Summary
k6 logs
`linkerd-proxy` container logs on `web` pod
Logs from `web-svc` container on the `web` pod
Clarifying a bit on the particular problem we are seeing in 2.13 (happy to file an additional issue if this is too far off topic):
@nikskiz your test perfectly encapsulates some really eager request failing (our second problem) within Linkerd. |
@jandersen-plaid Thanks for the details on your first issue. Maybe we can still try to replicate your first issue by modifying the load test parameters. We use the stages in k6 to ramp down the test and try to persist the error during "normal" load. The stages will look as follow
In the first 10 seconds, we should be able to get the failures. Then, in 5 seconds we ramp down to 20 iterations. We hold the 20 iterations for another 60s. You should restart the downstream app in the first 10 seconds so that it gets new endpoints. Assuming the proxy is still sending requests to the old endpoint we should see still see errors during the ramp-down (after 15 seconds). I did run this locally and wasn't able to reproduce any errors. I could see the errors disappear when the ramp-down started, even though the downstream app recieved a new endpoint. The only way I was able to see the "service unavailable" error again was when I forcefully deleted the voting app endpoind, so it had |
This is a huge problem. The destination-controller is broken on 2.13.4. Occasionally under load (i guess?) it will stop responding to diagnostics commands and also won't serve up new IPs to proxies leading to a cluster wide failure until the controller is restarted. We're runnings 300+ meshed pods with 5 pods per ctrl plane component so maybe this just shows up at this scale? I have a job that runs every minute and diffs the IPs for a handful of endpoints per destination controller pod and per endpoint slice. During the outage this job was the canary in the coal mine because it stopped logging. Most likely because the diagnostics command hung. The destination controllers stopped logging anything at all (they usually complain about I'm going to downgrade to 2.12 as soon as possible. |
Hey folks - thanks for the reports here! Linkerd maintainers are actively investigating multiple things raised in this thread. Thanks to @nikskiz for the repro steps. We're using these to investigate the loadshed behavior in the proxy and determine its root cause. For the issues around destination controller stability - this is independent from the loadshed behavior. @alpeb will be following up to gather more information to determine what may be at play. More soon! |
In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies shedding load much more aggressively while under high load to a single destination service, resulting in increased error rates in comparison to 2.12 (see linkerd/linkerd2#11055 for details). This commit changes the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests, the way they were in 2.12 and earlier. In manual load testing I've verified that increasing the queue capacity results in a substantial decrease in 503 Service Unavailable errors emitted by the proxy: with a queue capacity of 100 requests, the load test described [here] observed a failure rate of 51.51% of requests, while with a queue capacity of 10,000 requests, the same load test observes no failures. Note that I did not modify the TCP connection queue capacities, or the control plane request queue capacity. These were previously configured by the same variable before #2078, but were split out into separate vars in that change. I don't think the queue capacity limits for TCP connection establishment or for control plane requests are currently resulting in instability the way the decreased request queue capacity is, so I decided to make a more focused change to just the HTTP request queues for the proxies. [here]: linkerd/linkerd2#11055 (comment)
In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies shedding load much more aggressively while under high load to a single destination service, resulting in increased error rates in comparison to 2.12 (see linkerd/linkerd2#11055 for details). This commit changes the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests, the way they were in 2.12 and earlier. In manual load testing I've verified that increasing the queue capacity results in a substantial decrease in 503 Service Unavailable errors emitted by the proxy: with a queue capacity of 100 requests, the load test described [here] observed a failure rate of 51.51% of requests, while with a queue capacity of 10,000 requests, the same load test observes no failures. Note that I did not modify the TCP connection queue capacities, or the control plane request queue capacity. These were previously configured by the same variable before #2078, but were split out into separate vars in that change. I don't think the queue capacity limits for TCP connection establishment or for control plane requests are currently resulting in instability the way the decreased request queue capacity is, so I decided to make a more focused change to just the HTTP request queues for the proxies. [here]: linkerd/linkerd2#11055 (comment)
In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies shedding load much more aggressively while under high load to a single destination service, resulting in increased error rates in comparison to 2.12 (see #11055 for details). This commit changes the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests, the way they were in 2.12 and earlier. In manual load testing I've verified that increasing the queue capacity results in a substantial decrease in 503 Service Unavailable errors emitted by the proxy: with a queue capacity of 100 requests, the load test described [here] observed a failure rate of 51.51% of requests, while with a queue capacity of 10,000 requests, the same load test observes no failures. Note that I did not modify the TCP connection queue capacities, or the control plane request queue capacity. These were previously configured by the same variable before #2078, but were split out into separate vars in that change. I don't think the queue capacity limits for TCP connection establishment or for control plane requests are currently resulting in instability the way the decreased request queue capacity is, so I decided to make a more focused change to just the HTTP request queues for the proxies. [here]: #11055 (comment) --- * Increase HTTP request queue capacity (linkerd/linkerd2-proxy#2449) Signed-off-by: Eliza Weisman <eliza@buoyant.io>
This edge release restores a proxy setting for it to shed load less aggressively while under high load, which should result in lower error rates (addressing #11055). It also removes the usage of host networking in the linkerd-cni extension. * Changed the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests (see #11055 and #11198) * Lifted need of using host networking in the linkerd-cni Daemonset (#11141) (thanks @abhijeetgauravm!)
I'm happy to report that the aggressive load shedding behavior has just been fixed in edge-23.8.1 released yesterday, and will be back-ported into stable-2.13.6. |
Thank you so much! @alpeb |
Hi folks, I just wanted to chime in about edge-23.8.1, which includes the fix for a regression that made proxies much more likely to shed load under moderate load (linkerd/linkerd2-proxy#2449). We believe that change should resolve the issues described in this thread — it certainly resolves the issue reproduced by @nikskiz in #11055 (comment). Therefore, if anyone has the opportunity to test the edge release in your staging/testing environments, we'd love confirmation that there are no additional issues going on here. In any case, the fix will make it into stable-2.13.6 later this week. |
In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies shedding load much more aggressively while under high load to a single destination service, resulting in increased error rates in comparison to 2.12 (see linkerd/linkerd2#11055 for details). This commit changes the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests, the way they were in 2.12 and earlier. In manual load testing I've verified that increasing the queue capacity results in a substantial decrease in 503 Service Unavailable errors emitted by the proxy: with a queue capacity of 100 requests, the load test described [here] observed a failure rate of 51.51% of requests, while with a queue capacity of 10,000 requests, the same load test observes no failures. Note that I did not modify the TCP connection queue capacities, or the control plane request queue capacity. These were previously configured by the same variable before #2078, but were split out into separate vars in that change. I don't think the queue capacity limits for TCP connection establishment or for control plane requests are currently resulting in instability the way the decreased request queue capacity is, so I decided to make a more focused change to just the HTTP request queues for the proxies. [here]: linkerd/linkerd2#11055 (comment)
In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies shedding load much more aggressively while under high load to a single destination service, resulting in increased error rates in comparison to 2.12 (see linkerd/linkerd2#11055 for details). This commit changes the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests, the way they were in 2.12 and earlier. In manual load testing I've verified that increasing the queue capacity results in a substantial decrease in 503 Service Unavailable errors emitted by the proxy: with a queue capacity of 100 requests, the load test described [here] observed a failure rate of 51.51% of requests, while with a queue capacity of 10,000 requests, the same load test observes no failures. Note that I did not modify the TCP connection queue capacities, or the control plane request queue capacity. These were previously configured by the same variable before #2078, but were split out into separate vars in that change. I don't think the queue capacity limits for TCP connection establishment or for control plane requests are currently resulting in instability the way the decreased request queue capacity is, so I decided to make a more focused change to just the HTTP request queues for the proxies. [here]: linkerd/linkerd2#11055 (comment)
In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies shedding load much more aggressively while under high load to a single destination service, resulting in increased error rates in comparison to 2.12 (see #11055 for details). This commit changes the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests, the way they were in 2.12 and earlier. In manual load testing I've verified that increasing the queue capacity results in a substantial decrease in 503 Service Unavailable errors emitted by the proxy: with a queue capacity of 100 requests, the load test described [here] observed a failure rate of 51.51% of requests, while with a queue capacity of 10,000 requests, the same load test observes no failures. Note that I did not modify the TCP connection queue capacities, or the control plane request queue capacity. These were previously configured by the same variable before #2078, but were split out into separate vars in that change. I don't think the queue capacity limits for TCP connection establishment or for control plane requests are currently resulting in instability the way the decreased request queue capacity is, so I decided to make a more focused change to just the HTTP request queues for the proxies. [here]: #11055 (comment) --- * Increase HTTP request queue capacity (linkerd/linkerd2-proxy#2449)
This stable release fixes a regression introduced in stable-2.13.0 which resulted in proxies shedding load too aggressively while under moderate request load to a single service ([#11055]). In addition, it updates the base image for the `linkerd-cni` initcontainer to resolve a CVE in `libdb` ([#11196]), fixes a race condition in the Destination controller that could cause it to crash ([#11163]), as well as fixing a number of other issues. * Control Plane * Fixed a race condition in the destination controller that could cause it to panic ([#11169]; fixes [#11193]) * Improved the granularity of logging levels in the control plane ([#11147]) * Replaced incorrect `server_port_subscribers` gauge in the Destination controller's metrics with `server_port_subscribes` and `server_port_unsubscribes` counters ([#11206]; fixes [#10764]) * Proxy * Changed the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests ([#11198]; fixes [#11055]) * CLI * Updated extension CLI commands to prefer the `--registry` flag over the `LINKERD_DOCKER_REGISTRY` environment variable, making the precedence more consistent (thanks @harsh020!) (see [#11144]) * CNI * Updated `linkerd-cni` base image to resolve [CVE-2019-8457] in `libdb` ([#11196]) * Changed the CNI plugin installer to always run in 'chained' mode; the plugin will now wait until another CNI plugin is installed before appending its configuration ([#10849]) * Removed `hostNetwork: true` from linkerd-cni Helm chart templates ([#11158]; fixes [#11141]) (thanks @abhijeetgauravm!) * Multicluster * Fixed the `linkerd multicluster check` command failing in the presence of lots of mirrored services ([#10764]) [#10764]: #10764 [#10849]: #10849 [#11055]: #11055 [#11141]: #11141 [#11144]: #11144 [#11147]: #11147 [#11158]: #11158 [#11163]: #11163 [#11169]: #11169 [#11196]: #11196 [#11198]: #11198 [#11206]: #11206 [CVE-2019-8457]: https://avd.aquasec.com/nvd/2019/cve-2019-8457/
This stable release fixes a regression introduced in stable-2.13.0 which resulted in proxies shedding load too aggressively while under moderate request load to a single service ([#11055]). In addition, it updates the base image for the `linkerd-cni` initcontainer to resolve a CVE in `libdb` ([#11196]), fixes a race condition in the Destination controller that could cause it to crash ([#11163]), as well as fixing a number of other issues. * Control Plane * Fixed a race condition in the destination controller that could cause it to panic ([#11169]; fixes [#11193]) * Improved the granularity of logging levels in the control plane ([#11147]) * Replaced incorrect `server_port_subscribers` gauge in the Destination controller's metrics with `server_port_subscribes` and `server_port_unsubscribes` counters ([#11206]; fixes [#10764]) * Proxy * Changed the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests ([#11198]; fixes [#11055]) * CLI * Updated extension CLI commands to prefer the `--registry` flag over the `LINKERD_DOCKER_REGISTRY` environment variable, making the precedence more consistent (thanks @harsh020!) (see [#11144]) * CNI * Updated `linkerd-cni` base image to resolve [CVE-2019-8457] in `libdb` ([#11196]) * Changed the CNI plugin installer to always run in 'chained' mode; the plugin will now wait until another CNI plugin is installed before appending its configuration ([#10849]) * Removed `hostNetwork: true` from linkerd-cni Helm chart templates ([#11158]; fixes [#11141]) (thanks @abhijeetgauravm!) * Multicluster * Fixed the `linkerd multicluster check` command failing in the presence of lots of mirrored services ([#10764]) [#10764]: #10764 [#10849]: #10849 [#11055]: #11055 [#11141]: #11141 [#11144]: #11144 [#11147]: #11147 [#11158]: #11158 [#11163]: #11163 [#11169]: #11169 [#11196]: #11196 [#11198]: #11198 [#11206]: #11206 [CVE-2019-8457]: https://avd.aquasec.com/nvd/2019/cve-2019-8457/
In 2.13, the default inbound and outbound HTTP request queue capacity decreased from 10,000 requests to 100 requests (in PR #2078). This change results in proxies shedding load much more aggressively while under high load to a single destination service, resulting in increased error rates in comparison to 2.12 (see #11055 for details). This commit changes the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests, the way they were in 2.12 and earlier. In manual load testing I've verified that increasing the queue capacity results in a substantial decrease in 503 Service Unavailable errors emitted by the proxy: with a queue capacity of 100 requests, the load test described [here] observed a failure rate of 51.51% of requests, while with a queue capacity of 10,000 requests, the same load test observes no failures. Note that I did not modify the TCP connection queue capacities, or the control plane request queue capacity. These were previously configured by the same variable before #2078, but were split out into separate vars in that change. I don't think the queue capacity limits for TCP connection establishment or for control plane requests are currently resulting in instability the way the decreased request queue capacity is, so I decided to make a more focused change to just the HTTP request queues for the proxies. [here]: #11055 (comment) --- * Increase HTTP request queue capacity (linkerd/linkerd2-proxy#2449)
This stable release fixes a regression introduced in stable-2.13.0 which resulted in proxies shedding load too aggressively while under moderate request load to a single service ([#11055]). In addition, it updates the base image for the `linkerd-cni` initcontainer to resolve a CVE in `libdb` ([#11196]), fixes a race condition in the Destination controller that could cause it to crash ([#11163]), as well as fixing a number of other issues. * Control Plane * Fixed a race condition in the destination controller that could cause it to panic ([#11169]; fixes [#11193]) * Improved the granularity of logging levels in the control plane ([#11147]) * Proxy * Changed the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests ([#11198]; fixes [#11055]) * CLI * Updated extension CLI commands to prefer the `--registry` flag over the `LINKERD_DOCKER_REGISTRY` environment variable, making the precedence more consistent (thanks @harsh020!) (see [#11144]) * CNI * Updated `linkerd-cni` base image to resolve [CVE-2019-8457] in `libdb` ([#11196]) * Changed the CNI plugin installer to always run in 'chained' mode; the plugin will now wait until another CNI plugin is installed before appending its configuration ([#10849]) * Removed `hostNetwork: true` from linkerd-cni Helm chart templates ([#11158]; fixes [#11141]) (thanks @abhijeetgauravm!) * Multicluster * Fixed the `linkerd multicluster check` command failing in the presence of lots of mirrored services ([#10764]) [#10764]: #10764 [#10849]: #10849 [#11055]: #11055 [#11141]: #11141 [#11144]: #11144 [#11147]: #11147 [#11158]: #11158 [#11163]: #11163 [#11169]: #11169 [#11196]: #11196 [#11198]: #11198 [CVE-2019-8457]: https://avd.aquasec.com/nvd/2019/cve-2019-8457/
@risingspiral We are still seeing the issue after upgrading to 2.13.6
|
Have upgraded to 2.13.6 to try and resolve the endpoint failures, and I'm still seeing tons of these
|
Thanks for the update @tgolly @vkperumal. This is still on our radar. |
Those "Failed to get node topology zone" errors calmed down after a couple of days. Not seeing them anymore. So is this issue considered resolved in 2.13.6? |
Hey folks - thanks for all the help on this issue. For now we're marking this as resolved as a regression between 2.12 and 2.13 that led to premature loadshed due to a small queue size has been addressed. @vkperumal if you're still seeing errors it would be good to understand the total request volume the service is seeing when errors are encountered and what the @bjoernw for destination controller reliability and consistency under high endpoint churn we'll track investigation on #11279 |
What is the issue?
Outbound HTTP calls to application fails with connection closed error intermittently. Source application from which HTTP calls failing is a scheduler application which makes regular calls to other application based on schedule. After the failure, the next schedule works fine. Please see the logs below from linkerd-proxy
How can it be reproduced?
Linkerd Version: 2.13.4
Logs, error output, etc
[129664.685446s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39860
[129664.691331s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39872
[129664.691557s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39878
[129664.692554s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39894
[129664.695277s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:39900
[129784.342535s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42724
[129784.342604s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42738
[129784.342642s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42746
[129784.342682s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=service unavailable client.addr=10.50.29.85:42760
output of
linkerd check -o short
Environment
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: