Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent misconfigured pods from coming up when running in CNI mode #8120

Closed
mateiidavid opened this issue Mar 21, 2022 · 13 comments
Closed

Prevent misconfigured pods from coming up when running in CNI mode #8120

mateiidavid opened this issue Mar 21, 2022 · 13 comments

Comments

@mateiidavid
Copy link
Member

mateiidavid commented Mar 21, 2022

What problem are you trying to solve?

As described in #8070, the linkerd-cni plugin has two race conditions that come into play usually when a node is drained or the cluster is restarted:

  1. It's possible for the configuration to be out-of-sync when running in chainedMode (as documented in Watch CNI plugin conf files for changes #8070)
  2. Pods may come up before the CNI daemonset has a chance to configure iptables rules.

This feature request focuses on the second condition. If we end up in a state where pods come up before our cni daemonset has had a chance to run, pods will come up without any iptables rules. This breaks routing and all pods have to be manually restarted.

How should the problem be solved?

There have been a few suggestions as to how this problem should be fixed. Notably, #4049 (comment) and #7855 (comment). This approach is also used by other service meshes whereby an init container validates netfilter rules were created successfully, and if they are not, misconfigured pods are labeled and/or terminated.

I think this solution would serve us well too. I propose the following:

  1. In our init container, add a validation mode where we test redirection (inbound/outbound)
  2. Add a new sidecar container to our cni daemonset to monitor pods that are scheduled on the host the daemonset runs on. If pods are misconfigured, restart them.

I'll talk at length about both solutions below.


1) Validate routing

The approach by other service meshes is to run this logic in the init container responsible for setting up iptables. The key difference between running in cni mode, as opposed to "normal" mode, is that we run without any capabilities (i.e NET_ADMIN and NET_RAW). We can drop these capabilities since we do not need to set up routing.

When the init container runs in "validation" mode, we can create a client and a server and test that a TCP connection can be established between the two. The trick is for the client to send to an arbitrary port (say, 80) and have the server bind on a different port such that iptables rules redirect the TCP connection packets from port 80 to whatever port the server is running on. This will in effect validate that redirection works. If redirection does not work, we exit with a non-zero return code from the container.

Another approach would be to do validation in the proxy injector. Instead of checking for redirection, we can check that the host the pod has been scheduled to run on has already scheduled and run the cni plugin DaemonSet. For example, suppose a pod is scheduled to run on node A: if our cni plugin is already running on node A, then we assume iptables has been set-up. If it's not, we label the pod as being misconfigured.

There's a trade-off between the two approaches.

The first approach is a bit more rigorous and correct, but the logic is more complex. It also requires an additional iptables rule only when CNI mode has been enabled to test the redirection. Under normal conditions, localhost traffic (or traffic sent over the lo interface, as would be the case for any request from a pod to itself) is not redirected back to the inbound side. We'd either have to test this logic on the outbound side (i.e outbound iptables redirect chain is hit) or add in a special rule specifically for testing purposes (do DNAT to have it redirected back to what the proxy's inbound server is supposed to be, or use a packet mark, or something similar).

Note: other meshes do not require this iptable change for validation. We skip traffic on the inbound side when a pod sends traffic to itself (over lo) -- this has been done to avoid infinite loops when proxying traffic to ORG_DST instead of localhost -- this is handled differently in other projects. They bind the inbound proxy to a different loopback address (!localhost) allowing them to treat this type of traffic as a special case. We can either add in a dummy validation rule, or make some changes to our current iptables logic. I can provide further details here.

The second approach is less complicated and would avoid making use of any init containers. The downside is that this will not let us catch any CNI misconfiguration; there's still a possibility that iptables rules are not set-up correctly, or that the host does not have our cni plugin binary, so on and so forth. By not explicitly validating it works, we function under the assumption that the DaemonSet can rarely fail to set up CNI config.

2) Re-create pods

This race condition (or misconfiguration) can be fixed by restarting all pods that fail our validation check (wherever that may happen: init container or injector). The plan here would be straightforward:

  1. In the CNI daemonset, add a new container that establishes a watch on all injected pods running on the same host as the daemonset.
    2a. If we do the validation in the init container, check the init container's exit code. If we detect a non-zero exit code, we label the pod as misconfigured and then restart it.
    2b. If we do the validation in the injector, we restart the pod if it's been labeled as misconfigured.

Any alternatives you've considered?

No, open to suggestions though. There are multiple ways to solve the validation problem.

How would users interact with this feature?

No response

Would you like to work on this feature?

No response

@adleong
Copy link
Member

adleong commented Mar 31, 2022

Would it be possible for the validator to bind a server on port 4140 and then create a client which attempts to connect to some non-localhost address? The existing iptables rules should cause this to be redirected to localhost:4140 (where the proxy's outbound will eventually be listening) and we can validate that we have connected to our validation server. Once validation is complete, we can tear down the server so as not to interfere with the proxy binding to :4140.

If this works, I think it allows us to validate that the iptables redirects have been set up and and doesn't require adding any additional rules.

@mateiidavid
Copy link
Member Author

Would it be possible for the validator to bind a server on port 4140 and then create a client which attempts to connect to some non-localhost address? The existing iptables rules should cause this to be redirected to localhost:4140 (where the proxy's outbound will eventually be listening) and we can validate that we have connected to our validation server. Once validation is complete, we can tear down the server so as not to interfere with the proxy binding to :4140.

If this works, I think it allows us to validate that the iptables redirects have been set up and and doesn't require adding any additional rules.

Yup, definitely possible! I can think of three different ways to solve this problem. I think the approach you described would work really well, it also allows us to get by without making any changes to our existing iptables configuration. The downside of this approach is that a connection may be established even if routing rules are not in place, if the target address belongs to an existing pod.

For example, suppose we have two pods: one to validate routing rules for, and another pod with the IP address 10.0.0.1. If we bind a server on localhost, we can spin up a client inside the validator process that will attempt to open a socket to a random IP address; if configuration is present, we'd expect the client to establish a TCP connection to our localhost-bound server. If the RNG gods are not in our favour, the client may attempt to connect to an IP address that belongs to an actual pod in the cluster -- supposing routing configuration has not been set up properly, we will get a false positive result from our validation process. Our client will attempt to establish a connection to a pod, not our server, and if the connection is established successfully, we will consider validation a success.

I think there are ways to mitigate the risk of a false positive. We could for example use a reserved port (lower the odds of having listeners on that port), or perhaps write a response back from the server to our client so we can guarantee we are talking to our validation server and not someone else.

Second option would be to have additional routing rules, like you have pointed out. I'm not very keen on the idea of having additional rules added. I'd rather take our chances with a false positive instead of impacting routing and placing constraints on the network (which is what a dummy/validation rule would do in effect).

The third option is a bit more involved and would have to flip some assumptions around and have the proxy bind on a different client address on the inbound side (I've talked about it briefly off-GitHub).

I'd be interested to see what you think of the false positive scenario @adleong and if you think that's something to be cautious of, or go ahead with the change and try to mitigate this risk as much as possible.

@adleong
Copy link
Member

adleong commented Apr 1, 2022

I think having the validator server respond with a specific message that we check for in the client is the cleanest and simplest solution to avoid false positives. To make things even more predictable, we should also probably use a specific IP address rather than a random one. Ideally one which is known to not be routable (such as an IP from a reserved subnet) or which routes to a known location e.g. a public internet IP which we control.

@olix0r
Copy link
Member

olix0r commented Aug 4, 2022

Work for this is in-flight, but I'm uneasy about dropping this in late in the process before the stable release. We should probably figure how to make this an optional feature that we can pull in a patch release and then enable by default in 2.13.

@jeremychase jeremychase added this to the stable-2.13.0 milestone Aug 31, 2022
@ron1
Copy link

ron1 commented Sep 5, 2022

It is disappointing this fix did not make it into stable-2.12.0. I hope it can make it into a 2.12.x release behind a feature flag.

@nathanmcgarvey
Copy link

nathanmcgarvey commented Sep 8, 2022

I agree with @ron1 regarding making it into a 2.12.x release. This is a fairly major issue for anyone using linkerd2 in production with the CNI plugin. (non-deterministic intra-mesh (pod->pod) communication failure that requires manual intervention to resolve.)

Node reboots, cluster autoscaling, or any other node-level maintenance will eventually trigger this hard-to-diagnose but impacting bug. It will manifest in "unusual" ways, like having "corrupted" message errors (E.g. pod->pod communication where one side is talking TLS and the other is not.) It's non-deterministic, with very difficult to understand errors. Even kublet probes will come back as successful as they avoid the TLS negotiation portion so a pod will receive traffic from non-meshed clients, but not from meshed ones.

As such, I'm guessing that many folks that use linkerd with the CNI plugin don't even realize they are running into it and are chasing "other" red-herring errors.

EDIT: Issues that are likely related and, if so, would be mitigated if this issue is fixed:

#7855 (comment)
#8343
#8496
#8943 (comment) (and other related comments)
#8934

@adleong adleong modified the milestones: stable-2.13.0, stable-2.12.2 Sep 15, 2022
@sigurdfalk
Copy link

sigurdfalk commented Oct 1, 2022

I really agree with @ron1 @nathanmcgarvey here. We have been struggling with auto node reboots and autoscaling events for quite some time in our AKS clusters due to what is likely related to the issue described here. Hope this feature in the lovely Linkerd soon 🙏🏻

@adleong adleong modified the milestones: stable-2.12.2, stable-2.13.0 Oct 6, 2022
mateiidavid pushed a commit that referenced this issue Oct 26, 2022
When users use CNI, we want to ensure that network rewriting inside the pod is setup before allowing linkerd to start. When rewriting isn't happening, we want to exit with a clear error message and enough information in the container log for the administrator to either file a bug report with us or fix their configuration.

This change adds a validator initContainer to all injected workloads, when linkerd is installed with "cniEnabled=false". The validator replaces the noop init container, and will prevent pods from starting up if iptables is not configured.

Part of #8120

Signed-off-by: Steve Jenson <stevej@buoyant.io>
@mateiidavid
Copy link
Member Author

Quick update for people following this issue. We've shipped a few features to address misconfiguration at the CNI layer resulting from non-deterministic ordering (in both CNI configuration & installation, and in pods).

stable-2.12.x versions have a CNI plugin change that will watch the files on the host and react accordingly. This should ensure that CNI configurations are properly reconciled; linkerd-cni will append itself to any CNI configuration file, regardless of when it is added to the host (e.g it won't matter if linkerd-cni DS runs first, or if cool-cni runs first, end result will be the same).

edge-22.10.3 includes a network validator init container that will be automatically added to any injected workloads iff the CNI plugin is used (you must install linkerd with cniEnabled=true for this to take effect). The validator will check if iptables rules are correctly initialized. If a pod comes up before the DaemonSet has had a chance to run, the validator will fail and will block the pod from successfully running. Pods at this point will still need to be manually restarted (the network namespaces have to be re-created so they can be passed through the CNI plugin chain). This will ensure pods do not start-up misconfigured.

sgrzemski pushed a commit to sgrzemski/linkerd2 that referenced this issue Jan 9, 2023
When users use CNI, we want to ensure that network rewriting inside the pod is setup before allowing linkerd to start. When rewriting isn't happening, we want to exit with a clear error message and enough information in the container log for the administrator to either file a bug report with us or fix their configuration.

This change adds a validator initContainer to all injected workloads, when linkerd is installed with "cniEnabled=false". The validator replaces the noop init container, and will prevent pods from starting up if iptables is not configured.

Part of linkerd#8120

Signed-off-by: Steve Jenson <stevej@buoyant.io>
Signed-off-by: Szymon Grzemski <sz.grzemski@gmail.com>
@olix0r olix0r removed the priority/P1 Planned for Release label Jan 18, 2023
@mateiidavid
Copy link
Member Author

A final update. We're on track to release stable-2.13.0. In my previous update, I mentioned that we've introduced two new mechanisms to deal with race conditions for CNI plugins:

  • a file watcher (to ensure linkerd's cni plugin can be installed successfully when other plugins exist in the cluster)
  • a network validator (to ensure iptables has been set-up correctly in a network namespace).

For now, these two mechanisms should be enough to significantly reduce any race conditions that arise when nodes are rebooted or restarted. While the network validator will not restart pods, it will block any misconfigured pod from running; moving the failure point at a different layer will allow operators or engineers to diagnose issues better and to take manual intervention when required.

While restarting misconfigured pods through a controller was part of the original scope of this issue, it won't be part of the stable release. A new controller needs a bit more thought and design before being introduced. We'll likely open up a separate issue to discuss the approach and design.

I will be closing this issue for now. As of stable-2.13.0 the handling of race conditions will be improved. After we trial out the network validator and get more feedback, we'll figure out where the automation fits in. Hopefully by then the Kubernetes community will have also figured out a better way to bootstrap CNIs before node provisioning.

Any questions, concerns, or feedback, feel free to open up a new issue or ping me in an existing one. Thanks.

@nathanmcgarvey
Copy link

nathanmcgarvey commented Jan 24, 2023

EDIT: Much of this was due to my mis-understanding about implementation details. Please read follow-ups for more clarity:

#8120 (comment)
#8120 (comment)


Thanks for the update and the hard work @mateiidavid. I do have a couple of comments as I'm really not sure this issue should be closed as-is. Comments/questions are in-line:

A final update. We're on track to release stable-2.13.0. In my previous update, I mentioned that we've introduced two new mechanisms to deal with race conditions for CNI plugins:

  • a file watcher (to ensure linkerd's cni plugin can be installed successfully when other plugins exist in the cluster)

This issue (#8120) is specifically for the "Pods may come up before the CNI daemonset has a chance to configure iptables rules." #8070 takes care of the first condition (the file watcher) and helps an enormous amount, at least in my testing. I cannot stress enough how much small changes like that have huge production-level impact. However, still irrelevant for the argument of closing this issue.

  • a network validator (to ensure iptables has been set-up correctly in a network namespace).

For now, these two mechanisms should be enough to significantly reduce any race conditions that arise when nodes are rebooted or restarted. While the network validator will not restart pods, it will block any misconfigured pod from running; moving the failure point at a different layer will allow operators or engineers to diagnose issues better and to take manual intervention when required.

The validating admission webhook doesn't reduce any race condition, but merely detects a symptom of it. Additionally, now that there isn't a misconfigured pod sitting around to delete, depending on your logging/alerting, it actually may make diagnosis harder/worse for some. (Please correct me as I may misunderstand the effect of the validating webhook has when it fails admission. Where does it get logged? Is that in the Linkerd2 controller/identity/destination pod(s) or just the Kubernetes control plane? Not everyone has access to control-plane logs, especially for hosted k8s like GKE, EKS, etc. How does Kubernetes deal with this type of failure if the pod is part of a ReplicaSet/Deployment? What "manual intervention" is now required to rectify the issue? There isn't a pod to delete... it was denied admission.)

My apologies as my validating-webhook knowledge and practical application experience is limited.) Unless I'm thinking about this incorrectly, it feels like this went from a hard-to-diagnose, but visible error, and made it effectively invisible in many cases.

While restarting misconfigured pods through a controller was part of the original scope of this issue, it won't be part of the stable release. A new controller needs a bit more thought and design before being introduced. We'll likely open up a separate issue to discuss the approach and design.

Understood. And thank you for the clear update. I would recommend pinging the watchers on this issue when that gets opened. (Or at least me as I'm still interested.) Perhaps the scope of this issue should be changed before closing it? (You closed it as "completed" with the title and description still being to "Restart misconfigured pods when running in CNI mode", but then in your closing remarks say that is exactly what isn't implemented, yet. Change it to "Detect" or "Prevent... from running" or something for the casual reader.)

I will be closing this issue for now. As of stable-2.13.0 the handling of race conditions will be improved. After we trial out the network validator and get more feedback, we'll figure out where the automation fits in. Hopefully by then the Kubernetes community will have also figured out a better way to bootstrap CNIs before node provisioning.

I, for one, certainly hope so given the wide-spread use of CNIs.

Any questions, concerns, or feedback, feel free to open up a new issue or ping me in an existing one. Thanks.

@mateiidavid
Copy link
Member Author

@nathanmcgarvey thanks for the comment! I think you've made some great points. I'm not sure if all of your concerns apply though, since we don't run an admission webhook to validate the CNI plugin has configured pods correctly. I see why you would think that though, given its name.

The network validator we've introduced in a recent edge version is an initContainer. Overly simplified, it will create a server that listens on the proxy's port and a client that will attempt to connect to a random IP address. If the client is redirected to the server we've created, it proves redirection rules have been configured properly.

When the validator fails, it will force the pod into a crash loop. You'll be able to inspect the logs to see exactly why the tool failed. I'd argue it makes diagnosing easier than before when you had to figure out everything based on proxy logs; it clarifies assumptions and helps people pinpoint the issue without requiring them to understand how the proxy works (yay for simplicity). This does fix the race we have since it ensures we run with minimal undefined behaviour. All pods will be subject to the same conditions. Initially, the race was annoying to deal with because it made pods come up without any rules and there would be barely any indication that this is happening. By blocking the rest of the containers from starting up, we ensure behaviour is uniform across all workloads (you can either fail fast or work correctly).

To summarise, we've essentially covered the first point from my initial post ((1) Validate routing). You'll be able to diagnose misconfigured pods through the usual kubectl commands (describe, logs). If the validator detects misconfiguration, it will error the pod out. Operators will have to act on that by deleting the pod or rolling out its parent resource, that will force the workload to go through the CNI plugin chain again. Hopefully the validator's output will be easy enough to understand, we'll probably refine it even more moving forward.

@mateiidavid mateiidavid changed the title Restart misconfigured pods when running in CNI mode Prevent misconfigured pods from coming up when running in CNI mode Jan 24, 2023
@nathanmcgarvey
Copy link

@mateiidavid Thanks for the clarification! I had completely misread/misinterpreted the bulk of the applied changes. You are absolutely correct that my mind automatically made the jump from "network validator init container" and "block any misconfigured pod from running" ==> "admission webhook". Your descriptions from start to end always specified it as an init container and I apparently need more sleep and a good reading comprehension course. 📖

For those like myself that want to tune their alerting systems ahead of time pre-upgrade, looks like this commit contains the latest error messages emitted by the init container. (Once again, please do correct me if I am wrong @mateiidavid):

linkerd/linkerd2-proxy-init@62ac9b1

@mateiidavid
Copy link
Member Author

No probs! We're here to answer questions, after all.

That's the latest commit that touches the validator's logging, yes.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants