-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linkerd cni plugin blocks pods initialisation on GKE #10849
Comments
For now, I have got it working by adding an init container to the daemon with a sleep for a second. However, I am not sure if this has any ramifications. |
The network validator init container gets stuck which is an expected behaviour based on the implementation. Restarting the pod manually or rolling out resolves the issue. However, if the pod gets restarted by the controller instead of manually restarting it, that would be a comprehensive solution to this. |
Thanks for the detailed description and the follow-ups. We're currently working on a solution for this. We'll let you know when we have something that you can test. OTOH, if you manage to find a way to reproduce this consistently, it'd be of great help! :-) |
Thanks @alpeb for responding. I am able to reproduce it consistently where I am scaling the nodes from a dozen to dozen of dozens, I am getting this issue on 25-40% of the workload. Since I have introduced a delay of 5 seconds on cni daemon, the first race condition issue is not appearing any more. Basically I am waiting for GKE to create its own CNI and then the daemon is appending the linkerd configuration. Network Validator is a great way to ascertain pods don't go crazy. However, these pods get scheduled as soon as the node is up and network validator is an init container for each one of them. They are not waiting unlike my workaround for linkerd-cni plugin. If linkerd cni plugin waits till GKE creates the CNI without giving any wait explicitly, this issue can be solved. Or if the network validation fails, the entire pod restarts, it can be fixed as well. |
I need some clarification: config_file_count=0 retry_count=$((retry_count + 1)) sleep 2 find "${HOST_CNI_NET}" -maxdepth 1 -type f ( -iname '*conflist' -o -iname '*conf' ) -print0 | I want to understand if there is any harm with it. |
Right, the "interface" mode, where an empty linkerd-cni config file was created even if no other CNI plugin had a chance to drop its config, has been abandoned. Please try with a more recent linkerd version to test that out. However, there remains a corner case described in #11073 |
This was fixed with linkerd/linkerd2-proxy-init#242 and released as part of edge-23.6.1 |
This stable release fixes a regression introduced in stable-2.13.0 which resulted in proxies shedding load too aggressively while under moderate request load to a single service ([#11055]). In addition, it updates the base image for the `linkerd-cni` initcontainer to resolve a CVE in `libdb` ([#11196]), fixes a race condition in the Destination controller that could cause it to crash ([#11163]), as well as fixing a number of other issues. * Control Plane * Fixed a race condition in the destination controller that could cause it to panic ([#11169]; fixes [#11193]) * Improved the granularity of logging levels in the control plane ([#11147]) * Replaced incorrect `server_port_subscribers` gauge in the Destination controller's metrics with `server_port_subscribes` and `server_port_unsubscribes` counters ([#11206]; fixes [#10764]) * Proxy * Changed the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests ([#11198]; fixes [#11055]) * CLI * Updated extension CLI commands to prefer the `--registry` flag over the `LINKERD_DOCKER_REGISTRY` environment variable, making the precedence more consistent (thanks @harsh020!) (see [#11144]) * CNI * Updated `linkerd-cni` base image to resolve [CVE-2019-8457] in `libdb` ([#11196]) * Changed the CNI plugin installer to always run in 'chained' mode; the plugin will now wait until another CNI plugin is installed before appending its configuration ([#10849]) * Removed `hostNetwork: true` from linkerd-cni Helm chart templates ([#11158]; fixes [#11141]) (thanks @abhijeetgauravm!) * Multicluster * Fixed the `linkerd multicluster check` command failing in the presence of lots of mirrored services ([#10764]) [#10764]: #10764 [#10849]: #10849 [#11055]: #11055 [#11141]: #11141 [#11144]: #11144 [#11147]: #11147 [#11158]: #11158 [#11163]: #11163 [#11169]: #11169 [#11196]: #11196 [#11198]: #11198 [#11206]: #11206 [CVE-2019-8457]: https://avd.aquasec.com/nvd/2019/cve-2019-8457/
This stable release fixes a regression introduced in stable-2.13.0 which resulted in proxies shedding load too aggressively while under moderate request load to a single service ([#11055]). In addition, it updates the base image for the `linkerd-cni` initcontainer to resolve a CVE in `libdb` ([#11196]), fixes a race condition in the Destination controller that could cause it to crash ([#11163]), as well as fixing a number of other issues. * Control Plane * Fixed a race condition in the destination controller that could cause it to panic ([#11169]; fixes [#11193]) * Improved the granularity of logging levels in the control plane ([#11147]) * Replaced incorrect `server_port_subscribers` gauge in the Destination controller's metrics with `server_port_subscribes` and `server_port_unsubscribes` counters ([#11206]; fixes [#10764]) * Proxy * Changed the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests ([#11198]; fixes [#11055]) * CLI * Updated extension CLI commands to prefer the `--registry` flag over the `LINKERD_DOCKER_REGISTRY` environment variable, making the precedence more consistent (thanks @harsh020!) (see [#11144]) * CNI * Updated `linkerd-cni` base image to resolve [CVE-2019-8457] in `libdb` ([#11196]) * Changed the CNI plugin installer to always run in 'chained' mode; the plugin will now wait until another CNI plugin is installed before appending its configuration ([#10849]) * Removed `hostNetwork: true` from linkerd-cni Helm chart templates ([#11158]; fixes [#11141]) (thanks @abhijeetgauravm!) * Multicluster * Fixed the `linkerd multicluster check` command failing in the presence of lots of mirrored services ([#10764]) [#10764]: #10764 [#10849]: #10849 [#11055]: #11055 [#11141]: #11141 [#11144]: #11144 [#11147]: #11147 [#11158]: #11158 [#11163]: #11163 [#11169]: #11169 [#11196]: #11196 [#11198]: #11198 [#11206]: #11206 [CVE-2019-8457]: https://avd.aquasec.com/nvd/2019/cve-2019-8457/
This stable release fixes a regression introduced in stable-2.13.0 which resulted in proxies shedding load too aggressively while under moderate request load to a single service ([#11055]). In addition, it updates the base image for the `linkerd-cni` initcontainer to resolve a CVE in `libdb` ([#11196]), fixes a race condition in the Destination controller that could cause it to crash ([#11163]), as well as fixing a number of other issues. * Control Plane * Fixed a race condition in the destination controller that could cause it to panic ([#11169]; fixes [#11193]) * Improved the granularity of logging levels in the control plane ([#11147]) * Proxy * Changed the default HTTP request queue capacities for the inbound and outbound proxies back to 10,000 requests ([#11198]; fixes [#11055]) * CLI * Updated extension CLI commands to prefer the `--registry` flag over the `LINKERD_DOCKER_REGISTRY` environment variable, making the precedence more consistent (thanks @harsh020!) (see [#11144]) * CNI * Updated `linkerd-cni` base image to resolve [CVE-2019-8457] in `libdb` ([#11196]) * Changed the CNI plugin installer to always run in 'chained' mode; the plugin will now wait until another CNI plugin is installed before appending its configuration ([#10849]) * Removed `hostNetwork: true` from linkerd-cni Helm chart templates ([#11158]; fixes [#11141]) (thanks @abhijeetgauravm!) * Multicluster * Fixed the `linkerd multicluster check` command failing in the presence of lots of mirrored services ([#10764]) [#10764]: #10764 [#10849]: #10849 [#11055]: #11055 [#11141]: #11141 [#11144]: #11144 [#11147]: #11147 [#11158]: #11158 [#11163]: #11163 [#11169]: #11169 [#11196]: #11196 [#11198]: #11198 [CVE-2019-8457]: https://avd.aquasec.com/nvd/2019/cve-2019-8457/
What is the issue?
There is a random behaviour on the GKE cluster while installing linkerd cni plugin. During autoscaling, some daemons are running fine and they don't block pod initialisation as the linkerd cni plugin is installed in chained mode. However, when linkerd pods are created prior to gke cni plugin installation i.e. prior to creation of 10-gke-ptp.conflist, it doesn't find the file and creates another file 01-linkerd-cni.conf. The pods are then stuck in init state
(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "": plugin type="linkerd-cni" name="linkerd-cni" failed (add): cannot convert: no valid IP addresses.
Is there a way this daemon can wait till it finds the k8s cni conf file and then add linkerd-cni in chained mode?
How can it be reproduced?
It is a random behaviour. Happens when linkerd-cni installation happens prior to k8s cni conf file creation in /etc/cni/net.d.
Logs, error output, etc
(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "": plugin type="linkerd-cni" name="linkerd-cni" failed (add): cannot convert: no valid IP addresses.
When k8s cni conf is not found:
No active CNI configuration files found; installing in "interface" mode in /host/etc/cni/net.d/01-linkerd-cni.conf
When found:
Installing CNI configuration in "chained" mode for /host/etc/cni/net.d/10-gke-ptp.conflist
output of
linkerd check -o short
linkerd-version
‼ can determine the latest version
unexpected versioncheck response: 403 Forbidden
see https://linkerd.io/2.13/checks/#l5d-version-latest for hints
‼ cli is up-to-date
unsupported version channel: stable-2.13.2
see https://linkerd.io/2.13/checks/#l5d-version-cli for hints
control-plane-version
‼ control plane is up-to-date
unsupported version channel: stable-2.13.2
see https://linkerd.io/2.13/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-dd8f7cc48-hs8d5 (stable-2.13.2)
* linkerd-identity-fd6b4d8b7-tv2qk (stable-2.13.2)
* linkerd-proxy-injector-7d79958b59-4jrlw (stable-2.13.2)
see https://linkerd.io/2.13/checks/#l5d-cp-proxy-version for hints
I have disabled version check cron.
Environment
Possible solution
daemon should wait till it finds the k8s cni conf file
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: