-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent misconfigured pods from coming up when running in CNI mode #8120
Comments
Would it be possible for the validator to bind a server on port 4140 and then create a client which attempts to connect to some non-localhost address? The existing iptables rules should cause this to be redirected to localhost:4140 (where the proxy's outbound will eventually be listening) and we can validate that we have connected to our validation server. Once validation is complete, we can tear down the server so as not to interfere with the proxy binding to :4140. If this works, I think it allows us to validate that the iptables redirects have been set up and and doesn't require adding any additional rules. |
Yup, definitely possible! I can think of three different ways to solve this problem. I think the approach you described would work really well, it also allows us to get by without making any changes to our existing iptables configuration. The downside of this approach is that a connection may be established even if routing rules are not in place, if the target address belongs to an existing pod. For example, suppose we have two pods: one to validate routing rules for, and another pod with the IP address I think there are ways to mitigate the risk of a false positive. We could for example use a reserved port (lower the odds of having listeners on that port), or perhaps write a response back from the server to our client so we can guarantee we are talking to our validation server and not someone else. Second option would be to have additional routing rules, like you have pointed out. I'm not very keen on the idea of having additional rules added. I'd rather take our chances with a false positive instead of impacting routing and placing constraints on the network (which is what a dummy/validation rule would do in effect). The third option is a bit more involved and would have to flip some assumptions around and have the proxy bind on a different client address on the inbound side (I've talked about it briefly off-GitHub). I'd be interested to see what you think of the false positive scenario @adleong and if you think that's something to be cautious of, or go ahead with the change and try to mitigate this risk as much as possible. |
I think having the validator server respond with a specific message that we check for in the client is the cleanest and simplest solution to avoid false positives. To make things even more predictable, we should also probably use a specific IP address rather than a random one. Ideally one which is known to not be routable (such as an IP from a reserved subnet) or which routes to a known location e.g. a public internet IP which we control. |
Work for this is in-flight, but I'm uneasy about dropping this in late in the process before the stable release. We should probably figure how to make this an optional feature that we can pull in a patch release and then enable by default in 2.13. |
It is disappointing this fix did not make it into stable-2.12.0. I hope it can make it into a 2.12.x release behind a feature flag. |
I agree with @ron1 regarding making it into a 2.12.x release. This is a fairly major issue for anyone using linkerd2 in production with the CNI plugin. (non-deterministic intra-mesh (pod->pod) communication failure that requires manual intervention to resolve.) Node reboots, cluster autoscaling, or any other node-level maintenance will eventually trigger this hard-to-diagnose but impacting bug. It will manifest in "unusual" ways, like having "corrupted" message errors (E.g. pod->pod communication where one side is talking TLS and the other is not.) It's non-deterministic, with very difficult to understand errors. Even kublet probes will come back as successful as they avoid the TLS negotiation portion so a pod will receive traffic from non-meshed clients, but not from meshed ones. As such, I'm guessing that many folks that use linkerd with the CNI plugin don't even realize they are running into it and are chasing "other" red-herring errors. EDIT: Issues that are likely related and, if so, would be mitigated if this issue is fixed: #7855 (comment) |
I really agree with @ron1 @nathanmcgarvey here. We have been struggling with auto node reboots and autoscaling events for quite some time in our AKS clusters due to what is likely related to the issue described here. Hope this feature in the lovely Linkerd soon 🙏🏻 |
When users use CNI, we want to ensure that network rewriting inside the pod is setup before allowing linkerd to start. When rewriting isn't happening, we want to exit with a clear error message and enough information in the container log for the administrator to either file a bug report with us or fix their configuration. This change adds a validator initContainer to all injected workloads, when linkerd is installed with "cniEnabled=false". The validator replaces the noop init container, and will prevent pods from starting up if iptables is not configured. Part of #8120 Signed-off-by: Steve Jenson <stevej@buoyant.io>
Quick update for people following this issue. We've shipped a few features to address misconfiguration at the CNI layer resulting from non-deterministic ordering (in both CNI configuration & installation, and in pods).
|
When users use CNI, we want to ensure that network rewriting inside the pod is setup before allowing linkerd to start. When rewriting isn't happening, we want to exit with a clear error message and enough information in the container log for the administrator to either file a bug report with us or fix their configuration. This change adds a validator initContainer to all injected workloads, when linkerd is installed with "cniEnabled=false". The validator replaces the noop init container, and will prevent pods from starting up if iptables is not configured. Part of linkerd#8120 Signed-off-by: Steve Jenson <stevej@buoyant.io> Signed-off-by: Szymon Grzemski <sz.grzemski@gmail.com>
A final update. We're on track to release
For now, these two mechanisms should be enough to significantly reduce any race conditions that arise when nodes are rebooted or restarted. While the network validator will not restart pods, it will block any misconfigured pod from running; moving the failure point at a different layer will allow operators or engineers to diagnose issues better and to take manual intervention when required. While restarting misconfigured pods through a controller was part of the original scope of this issue, it won't be part of the stable release. A new controller needs a bit more thought and design before being introduced. We'll likely open up a separate issue to discuss the approach and design. I will be closing this issue for now. As of Any questions, concerns, or feedback, feel free to open up a new issue or ping me in an existing one. Thanks. |
EDIT: Much of this was due to my mis-understanding about implementation details. Please read follow-ups for more clarity: #8120 (comment) Thanks for the update and the hard work @mateiidavid. I do have a couple of comments as I'm really not sure this issue should be closed as-is. Comments/questions are in-line:
This issue (#8120) is specifically for the "Pods may come up before the CNI daemonset has a chance to configure iptables rules." #8070 takes care of the first condition (the file watcher) and helps an enormous amount, at least in my testing. I cannot stress enough how much small changes like that have huge production-level impact. However, still irrelevant for the argument of closing this issue.
The validating admission webhook doesn't reduce any race condition, but merely detects a symptom of it. Additionally, now that there isn't a misconfigured pod sitting around to delete, depending on your logging/alerting, it actually may make diagnosis harder/worse for some. (Please correct me as I may misunderstand the effect of the validating webhook has when it fails admission. Where does it get logged? Is that in the Linkerd2 controller/identity/destination pod(s) or just the Kubernetes control plane? Not everyone has access to control-plane logs, especially for hosted k8s like GKE, EKS, etc. How does Kubernetes deal with this type of failure if the pod is part of a ReplicaSet/Deployment? What "manual intervention" is now required to rectify the issue? There isn't a pod to delete... it was denied admission.) My apologies as my validating-webhook knowledge and practical application experience is limited.) Unless I'm thinking about this incorrectly, it feels like this went from a hard-to-diagnose, but visible error, and made it effectively invisible in many cases.
Understood. And thank you for the clear update. I would recommend pinging the watchers on this issue when that gets opened. (Or at least me as I'm still interested.) Perhaps the scope of this issue should be changed before closing it? (You closed it as "completed" with the title and description still being to "Restart misconfigured pods when running in CNI mode", but then in your closing remarks say that is exactly what isn't implemented, yet. Change it to "Detect" or "Prevent... from running" or something for the casual reader.)
I, for one, certainly hope so given the wide-spread use of CNIs.
|
@nathanmcgarvey thanks for the comment! I think you've made some great points. I'm not sure if all of your concerns apply though, since we don't run an admission webhook to validate the CNI plugin has configured pods correctly. I see why you would think that though, given its name. The network validator we've introduced in a recent edge version is an When the validator fails, it will force the pod into a crash loop. You'll be able to inspect the logs to see exactly why the tool failed. I'd argue it makes diagnosing easier than before when you had to figure out everything based on proxy logs; it clarifies assumptions and helps people pinpoint the issue without requiring them to understand how the proxy works (yay for simplicity). This does fix the race we have since it ensures we run with minimal undefined behaviour. All pods will be subject to the same conditions. Initially, the race was annoying to deal with because it made pods come up without any rules and there would be barely any indication that this is happening. By blocking the rest of the containers from starting up, we ensure behaviour is uniform across all workloads (you can either fail fast or work correctly). To summarise, we've essentially covered the first point from my initial post ((1) Validate routing). You'll be able to diagnose misconfigured pods through the usual kubectl commands ( |
@mateiidavid Thanks for the clarification! I had completely misread/misinterpreted the bulk of the applied changes. You are absolutely correct that my mind automatically made the jump from "network validator init container" and "block any misconfigured pod from running" ==> "admission webhook". Your descriptions from start to end always specified it as an init container and I apparently need more sleep and a good reading comprehension course. 📖 For those like myself that want to tune their alerting systems ahead of time pre-upgrade, looks like this commit contains the latest error messages emitted by the init container. (Once again, please do correct me if I am wrong @mateiidavid): |
No probs! We're here to answer questions, after all. That's the latest commit that touches the validator's logging, yes. |
What problem are you trying to solve?
As described in #8070, the
linkerd-cni
plugin has two race conditions that come into play usually when a node is drained or the cluster is restarted:chainedMode
(as documented in Watch CNI plugin conf files for changes #8070)This feature request focuses on the second condition. If we end up in a state where pods come up before our cni daemonset has had a chance to run, pods will come up without any iptables rules. This breaks routing and all pods have to be manually restarted.
How should the problem be solved?
There have been a few suggestions as to how this problem should be fixed. Notably, #4049 (comment) and #7855 (comment). This approach is also used by other service meshes whereby an init container validates netfilter rules were created successfully, and if they are not, misconfigured pods are labeled and/or terminated.
I think this solution would serve us well too. I propose the following:
I'll talk at length about both solutions below.
1) Validate routing
The approach by other service meshes is to run this logic in the init container responsible for setting up iptables. The key difference between running in cni mode, as opposed to "normal" mode, is that we run without any capabilities (i.e
NET_ADMIN
andNET_RAW
). We can drop these capabilities since we do not need to set up routing.When the init container runs in "validation" mode, we can create a client and a server and test that a TCP connection can be established between the two. The trick is for the client to send to an arbitrary port (say, 80) and have the server bind on a different port such that iptables rules redirect the TCP connection packets from port 80 to whatever port the server is running on. This will in effect validate that redirection works. If redirection does not work, we exit with a non-zero return code from the container.
Another approach would be to do validation in the proxy injector. Instead of checking for redirection, we can check that the host the pod has been scheduled to run on has already scheduled and run the cni plugin DaemonSet. For example, suppose a pod is scheduled to run on node A: if our cni plugin is already running on node A, then we assume iptables has been set-up. If it's not, we label the pod as being misconfigured.
There's a trade-off between the two approaches.
The first approach is a bit more rigorous and correct, but the logic is more complex. It also requires an additional iptables rule only when CNI mode has been enabled to test the redirection. Under normal conditions, localhost traffic (or traffic sent over the
lo
interface, as would be the case for any request from a pod to itself) is not redirected back to the inbound side. We'd either have to test this logic on the outbound side (i.e outbound iptables redirect chain is hit) or add in a special rule specifically for testing purposes (do DNAT to have it redirected back to what the proxy's inbound server is supposed to be, or use a packet mark, or something similar).Note: other meshes do not require this iptable change for validation. We skip traffic on the inbound side when a pod sends traffic to itself (over
lo
) -- this has been done to avoid infinite loops when proxying traffic to ORG_DST instead of localhost -- this is handled differently in other projects. They bind the inbound proxy to a different loopback address (!localhost) allowing them to treat this type of traffic as a special case. We can either add in a dummy validation rule, or make some changes to our current iptables logic. I can provide further details here.The second approach is less complicated and would avoid making use of any init containers. The downside is that this will not let us catch any CNI misconfiguration; there's still a possibility that iptables rules are not set-up correctly, or that the host does not have our cni plugin binary, so on and so forth. By not explicitly validating it works, we function under the assumption that the DaemonSet can rarely fail to set up CNI config.
2) Re-create pods
This race condition (or misconfiguration) can be fixed by restarting all pods that fail our validation check (wherever that may happen: init container or injector). The plan here would be straightforward:
2a. If we do the validation in the init container, check the init container's exit code. If we detect a non-zero exit code, we label the pod as misconfigured and then restart it.
2b. If we do the validation in the injector, we restart the pod if it's been labeled as misconfigured.
Any alternatives you've considered?
No, open to suggestions though. There are multiple ways to solve the validation problem.
How would users interact with this feature?
No response
Would you like to work on this feature?
No response
The text was updated successfully, but these errors were encountered: