New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubelet takes long time to update Pod.Status.PodIP #39113
Comments
@kubernetes/sig-node-misc This is interesting. to consider. We get used to the system responding ~INSTANTLY, and when it doesn't we aren't happy. |
@thockin And this actually happens for all network plugins. In this case, it should be fixed. |
NB, we have already pushed network plugins to runtime, so it's not easy to write pod status immediately because kubelet doesn't know CNI. |
We could change NetworkPlugin.SetUpPod() to return (*PodNetworkStatus, error), and for plugins other than CNI, just have SetUpPod() end with "return plugin.GetPodNetworkStatus(...)" |
@danwinship As feiskyer implies, kubelet does not know network plugin and it relies on a independent loop to update pod status, while SetUpPod() is executed in another main loop. We may have to change CRI and build a channel between those loops to update pod status immediately ... Another approach is to improve this reporting path "at best". @caseydavenport Is it possible from you side to figure out which phase cause so much time? Give kubelet --v=4 and pay attention to this call would help to track. And do you need 100% immediate status report, or just the faster the better? (I guess it's later, then the second approach would be preferred) |
@resouer Yep, I'll give that a try. I'm traveling / vacationing right now, so might not get back here until the new year :)
The issue is that if it takes 3-5 seconds for connectivity to appear after the pod has been started, some clients that try to connect at startup will fail, and not gracefully. So unless someone has written their app specifically to handle this you can end up with a broken pod. So, it essentially needs to be immediate. |
IIUC, This updates pod status to api server. Probably need to do it in |
Can we identify the bottleneck first and define what "acceptable" latency is? "Immediate" seems a little vague to me... AFAIK, there are two known, relatively significant latency, in the critical path.
I am specifically concerned about inserting random calls to force a status update for (1). These arbitrary updates are hard to track and may cause kubelet/container runtime to be overwhelmed at times. We had a pretty bad experience in the past where each goroutine individually queried the container runtime and caused only slower operations in general. I think relying on "events" reported by the container runtime would be a better solution, but that requires a significant amount of work. |
Sorry it's taken me so long to get back to this issue - I did a little bit of digging into the kubelet logs and I've found what appears to be some strange behavior. I'm following the status_manager logs for the pod Right after the Pod is started, a status update is queued including the newly assigned IP address.
1 second later I see a log indicating that the status update was pulled off the queue and is being ignored because it hasn’t changed.
About 30 seconds later, I see a the status manager get an update in a syncbatch:
Then, I see a log indicating that the status was finally updated:
What I find strange is the second log, indicating that the Pod status hasn't been changed and thus is being ignored. |
@yujuhong - how do we detect if the pod status changed? Is it simple "DeepEqual" operation? If so, that shouldn't happen, right? If not, maybe there is some bug in how we do the comparison? |
The status manager uses the apimachienery definition of DeepEqual. Is that the first status update for that pod that you see? |
That means that the kubelet has the up-to-date status stored in the status manager, it doesn't mean that the status manager has already send updates to the apiserver. The status manager has a internal channel where it used to store the updates to send. To prevent the overflow, it also performs the syncBatch periodically. It's possible that you hit the overflow situation if many pods had status changes during that window. |
@wojtek-t you should see |
Right, but I don't see a log anywhere prior to that one indicating that status has already been sent to the status manager, so how would it have that update in its cache? I'd expect to see either this log or this log if the status was already in the queue (from when it was added to the queue), but I see neither. Or am I missing something? |
@dashpole it's not - I do see a status update occur when the kubelet receives the
|
So I added some additional debug logging and I think I'm seeing the pod status channel start to back up due to more incoming status updates than are being pulled off. For example, I see one update get queued behind 55 other pod status updates. It eventually gets synced as part of a batch 30s later. Then, the sync thread finally starts pulling all the updates off the channel and notices that the status from the channel is up-to-date. i.e. I get a LOT of these in a row:
So the next question I had was "why isn't the status thread pulling the updates off the channel quickly enough?" Looks like I'm seeing this order of events:
So, it looks like it's taking a long time because the status update is received in the middle of a pretty long batch sync and thus isn't picked up until the next batch sync. But why is the batch sync taking so long? I'm starting about 300 pods across 10 nodes when I see this happen. |
Most likely caused by the QPS limit I mentioned in #39113 (comment) |
I thought this as well, but I'm seeing this behavior even when I set the |
You need to set the |
d'oh, thanks. I totally missed those options... I'll give it another run and see if that makes things better. |
I re-tested with all the above flags set to 1000, results look significantly better but I'm still seeing this take a while. In my tests, the 99th percentile time drops from ~50 seconds to ~10 seconds by adjusting the QPS. A quick look at the logs seems to indicate that the channel is no longer filling up and we're not relying on the batch sync anymore, which is good, but it's still taking almost 10 seconds for the status to reach the status manager in the first place. I'll need to investigate this part now. As far as defining acceptable latency, I think we need at least a 99th percentile of < 5s. |
Do we have any sort of priority queue for Kubelet -> API ops?
…On Sat, Jun 3, 2017 at 1:03 PM, Casey Davenport ***@***.***> wrote:
I re-tested with all the above flags set to 1000, results look
significantly better but I'm still seeing this take a while.
In my tests, the 99th percentile time drops from ~50 seconds to ~10
seconds by adjusting the QPS. A quick look at the logs seems to indicate
that the channel is no longer filling up and we're not relying on the batch
sync anymore, which is good, but it's still taking almost 10 seconds for
the status to reach the status manager in the first place. I'll need to
investigate this part now.
As far as defining acceptable latency, I think we need at *least* a 99th
percentile of < 5s.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#39113 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVNIcJNYQyx7gOgi-sziL1W5NPb5Tks5sAbwHgaJpZM4LTgyU>
.
|
/reopen |
@wreed4: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Is there any likelihood that this will get actioned and fixed?? In particular using https://argoproj.github.io/argo-workflows/, we run hundreds or thousands of workflows at once time which launch hundreds or thousands of pods onto ~300 nodes. it's very very common that >50 of them will get scheduled onto a node at once time. further more, these pods are short-lived, so they are replaced by other pods very quickly. The kubelet starts to slow down very quickly. I've found that setting the qps limit much higher than the default fixes this case, but it would be good to remove the race condition entirely. |
/reopen |
@dims: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Note that the AWS VPC CNI documentation released on 19 Oct 2021 describes a workaround for this issue in versions released in the last six months (1.9.3+). https://github.com/aws/amazon-vpc-cni-k8s#annotate_pod_ip-v193
|
Note to future people who are having this issue: you should probably open a new issue instead of recycling this one, as the original bug filed is from 2016 and the cause of the problem is likely something different. |
@bowei although this is a 2016 issue, that AWS VPC CNI 1.9.3 release was only a few months ago (19 Oct 2021) and specifically cited this issue #39113 as the issue it fixed, so it seems likely to me to be directly relevant for peeps coming here?
|
might be related to #85966 |
@bowei Wearing my skeptic hat, after fighting with this 4 years ago and still seeing it from time to time (different org now, but we track pod bring up latencies as an internal SLO), and theres been no evidence that this has been fixed systematically and has found a new way to the same symptoms: instead, explicitly, this is just bitrotting because no one with enough cycles to tackle it has done so (e.g. a drive by contributor with time to drive the k8s process, or a core contributor that wants to fix it). |
I think that the first step is to add a metric, so users can report more accurate information |
I think this is how it goes, IPs are determined here kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go Lines 845 to 846 in 11686e1
it's being patched here kubernetes/pkg/kubelet/status/status_manager.go Lines 684 to 685 in 11686e1
|
@dgrisonnet are you familiar with the metrics on the kubelet? |
I am not familiar enough with the kubelet code to be able to say whether that could be possible or not, but theoretically, it would make sense in my opinion to record the timestamp at which the runtime assigned an IP to the pod and convey this information to the status manager to then create a histogram based on the time passed since the IP was assigned to the pod. After looking a bit at the code, in practice, it seems a bit difficult to wire that up since the pod status manager is an independent process and the only way to convey information to it seems to be via the versionPodStatus structure, but I am unsure if that really makes sense since I am not familiar with that code so we will need people from node to chime in. Overall this sounds like a good observability improvement that is worth its own separate issue. |
created an issue to track the effort #110815 |
Any update on this? |
there are metrics now, if someone report the issue with the metrics and the logs is it possible to investigate it, but as this comment correctly points out #39113 (comment) it is better to open a new issue to avoid misunderstanding |
I've seen this when implementing Calico NetworkPolicy against the Kubernetes API (as opposed to via Calico's etcd)
Since the Pod status reporting is not synchronous with Pod set up, it often takes a long time for the API to get updated with the IP address of a newly networked Pod, which means that it can take multiple seconds before any NetworkPolicy implementation based off the k8s API learns about the new Pod's assigned IP.
I've seen this have an impact on 99%ile time to first connectivity being measured in seconds (because it takes seconds for kubelet -> apiserver -> controller to occur).
A naive fix would be to write the pod status immediately after CNI execution completes based on the result returned by the CNI plugin (though this might have performance impacts due to increased number of writes to the API).
CC @thockin @freehan @dchen1107 @wojtek-t
The text was updated successfully, but these errors were encountered: