Kubelet takes long time to update Pod.Status.PodIP #39113

caseydavenport · 2016-12-22T00:15:12Z

I've seen this when implementing Calico NetworkPolicy against the Kubernetes API (as opposed to via Calico's etcd)

Since the Pod status reporting is not synchronous with Pod set up, it often takes a long time for the API to get updated with the IP address of a newly networked Pod, which means that it can take multiple seconds before any NetworkPolicy implementation based off the k8s API learns about the new Pod's assigned IP.

I've seen this have an impact on 99%ile time to first connectivity being measured in seconds (because it takes seconds for kubelet -> apiserver -> controller to occur).

A naive fix would be to write the pod status immediately after CNI execution completes based on the result returned by the CNI plugin (though this might have performance impacts due to increased number of writes to the API).

CC @thockin @freehan @dchen1107 @wojtek-t

thockin · 2016-12-22T06:13:42Z

@kubernetes/sig-node-misc
@kubernetes/sig-network-misc

This is interesting. to consider. We get used to the system responding ~INSTANTLY, and when it doesn't we aren't happy.

resouer · 2016-12-22T06:27:44Z

@thockin And this actually happens for all network plugins. In this case, it should be fixed.

feiskyer · 2016-12-22T06:31:34Z

A naive fix would be to write the pod status immediately after CNI execution completes based on the result returned by the CNI plugin (though this might have performance impacts due to increased number of writes to the API).

NB, we have already pushed network plugins to runtime, so it's not easy to write pod status immediately because kubelet doesn't know CNI.

danwinship · 2016-12-22T13:41:05Z

We could change NetworkPlugin.SetUpPod() to return (*PodNetworkStatus, error), and for plugins other than CNI, just have SetUpPod() end with "return plugin.GetPodNetworkStatus(...)"

resouer · 2016-12-23T04:34:58Z

@danwinship As feiskyer implies, kubelet does not know network plugin and it relies on a independent loop to update pod status, while SetUpPod() is executed in another main loop. We may have to change CRI and build a channel between those loops to update pod status immediately ...

Another approach is to improve this reporting path "at best". @caseydavenport Is it possible from you side to figure out which phase cause so much time? Give kubelet --v=4 and pay attention to this call would help to track.

And do you need 100% immediate status report, or just the faster the better? (I guess it's later, then the second approach would be preferred)

caseydavenport · 2016-12-23T17:02:56Z

@resouer Yep, I'll give that a try. I'm traveling / vacationing right now, so might not get back here until the new year :)

And do you need 100% immediate status report, or just the faster the better?

The issue is that if it takes 3-5 seconds for connectivity to appear after the pod has been started, some clients that try to connect at startup will fail, and not gracefully. So unless someone has written their app specifically to handle this you can end up with a broken pod.

So, it essentially needs to be immediate.

freehan · 2016-12-27T18:17:45Z

IIUC, This updates pod status to api server. Probably need to do it in containerRuntime.SyncPod.

yujuhong · 2017-01-04T23:45:39Z

Can we identify the bottleneck first and define what "acceptable" latency is? "Immediate" seems a little vague to me...

AFAIK, there are two known, relatively significant latency, in the critical path.

kubelet polls the container runtime every 1s to detect changes. Shortening this needs more profiling and extra care as it could cause a significant increase in resource usage and latency. If a change is detected, kubelet sequentially inspects the pods to update the internal podstauts cache. This could take longer (a few seconds) if many pods change in one iteration, and can be optimized by parallelizing the inspections.
kubelet reporting the status to the apiserver. This is essentially limited by the QPS limit of the apiserver client kubelet uses. In the case of batch creation of pods, this would become the bottleneck. See Replace client QPS limits with apiserver backpressure #14955 for the potential improvement on this.

I am specifically concerned about inserting random calls to force a status update for (1). These arbitrary updates are hard to track and may cause kubelet/container runtime to be overwhelmed at times. We had a pretty bad experience in the past where each goroutine individually queried the container runtime and caused only slower operations in general. I think relying on "events" reported by the container runtime would be a better solution, but that requires a significant amount of work.

resouer · 2017-01-05T02:34:46Z

@yujuhong maybe take a look at if #26538 can help?

caseydavenport · 2017-05-19T17:05:53Z

Sorry it's taken me so long to get back to this issue -

I did a little bit of digging into the kubelet logs and I've found what appears to be some strange behavior. I'm following the status_manager logs for the pod client0-cali-ns-0-3203360797-3mk9m with UID 1c4d94dd-39ad-11e7-a787-42010a800122. This is running on k8s v1.6.2.

Right after the Pod is started, a status update is queued including the newly assigned IP address.

May 15 20:28:56 casey-test-default-pool-b0d6f3ea-wqtj kubelet[6842]: I0515 20:28:56.047043    6842 status_manager.go:340] Status Manager: adding pod: "1c4d94dd-39ad-11e7-a787-42010a800122", with status: ('\x02', {Running [{Initialized True 0001-01-01 00:00:00 +       0000 UTC 2017-05-15 20:28:52 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2017-05-15 20:28:56 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-05-15 20:28:52 +0000 UTC  }]   10.240.0.45 10.180.13.22 2017-05-15 20:28:52 +0000 UTC [] [{ttf       p {nil &ContainerStateRunning{StartedAt:2017-05-15 20:28:55 +0000 UTC,} nil} {nil nil nil} true 0 calico/ttfp:v0.6.0 docker://sha256:41a315c2fae94ab27e4242ff822b86641096208df1de4634f7f339aea9934e70 docker://93b563fd43b7805261ccc20d22e1bb3a339d82dc0ee88834410f4f6af0       4e1169}] Burstable}) to podStatusChannel

1 second later I see a log indicating that the status update was pulled off the queue and is being ignored because it hasn’t changed.

May 15 20:28:57 casey-test-default-pool-b0d6f3ea-wqtj kubelet[6842]: I0515 20:28:57.369531    6842 status_manager.go:325] Ignoring same status for pod "client0-cali-ns-0-3203360797-3mk9m_cali-ns-0(1c4d94dd-39ad-11e7-a787-42010a800122)", status: {Phase:Running C       onditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2017-05-15 20:28:52 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2017-05-15 20:28:56 +0000        UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2017-05-15 20:28:52 +0000 UTC Reason: Message:}] Message: Reason: HostIP:10.240.0.45 PodIP:10.180.13.22 StartTime:2017-05-15 20:28:52 +0000 UTC InitC       ontainerStatuses:[] ContainerStatuses:[{Name:ttfp State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2017-05-15 20:28:55 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:calico/ttfp:v       0.6.0 ImageID:docker://sha256:41a315c2fae94ab27e4242ff822b86641096208df1de4634f7f339aea9934e70 ContainerID:docker://93b563fd43b7805261ccc20d22e1bb3a339d82dc0ee88834410f4f6af04e1169}] QOSClass:Burstable}

About 30 seconds later, I see a the status manager get an update in a syncbatch:

May 15 20:29:26 casey-test-default-pool-b0d6f3ea-wqtj kubelet[6842]: I0515 20:29:26.695017    6842 status_manager.go:410] Status Manager: syncPod in syncbatch. pod UID: "1c4d94dd-39ad-11e7-a787-42010a800122"

Then, I see a log indicating that the status was finally updated:

May 15 20:29:27 casey-test-default-pool-b0d6f3ea-wqtj kubelet[6842]: I0515 20:29:27.099492    6842 status_manager.go:444] Status for pod "client0-cali-ns-0-3203360797-3mk9m_cali-ns-0(1c4d94dd-39ad-11e7-a787-42010a800122)" updated successfully: {status:{Phase:Ru       nning Conditions:[{Type:Initialized Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4d1fec0}} LastTransitionTime:{Time:{sec:63630476932 nsec:0 loc:0x4d1fec0}} Reason: Message:} {Type:Ready Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4d1fec0}} LastTransi       tionTime:{Time:{sec:63630476936 nsec:0 loc:0x4d1fec0}} Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4d1fec0}} LastTransitionTime:{Time:{sec:63630476932 nsec:0 loc:0x4d1fec0}} Reason: Message:}] Message: Reason: HostIP:10.       240.0.45 PodIP:10.180.13.22 StartTime:0xc4214e56c0 InitContainerStatuses:[] ContainerStatuses:[{Name:ttfp State:{Waiting:<nil> Running:0xc422028f20 Terminated:<nil>} LastTerminationState:{Waiting:<nil> Running:<nil> Terminated:<nil>} Ready:true RestartCount:0 Image       :calico/ttfp:v0.6.0 ImageID:docker://sha256:41a315c2fae94ab27e4242ff822b86641096208df1de4634f7f339aea9934e70 ContainerID:docker://93b563fd43b7805261ccc20d22e1bb3a339d82dc0ee88834410f4f6af04e1169}] QOSClass:Burstable} version:2 podName:client0-cali-ns-0-3203360797-3       mk9m podNamespace:cali-ns-0}

What I find strange is the second log, indicating that the Pod status hasn't been changed and thus is being ignored.

wojtek-t · 2017-05-19T20:06:56Z

What I find strange is the second log, indicating that the Pod status hasn't been changed and thus is being ignored.

@yujuhong - how do we detect if the pod status changed? Is it simple "DeepEqual" operation? If so, that shouldn't happen, right? If not, maybe there is some bug in how we do the comparison?

dashpole · 2017-05-19T20:28:08Z

The status manager uses the apimachienery definition of DeepEqual. Is that the first status update for that pod that you see?

yujuhong · 2017-05-19T20:54:40Z

What I find strange is the second log, indicating that the Pod status hasn't been changed and thus is being ignored.

That means that the kubelet has the up-to-date status stored in the status manager, it doesn't mean that the status manager has already send updates to the apiserver.

The status manager has a internal channel where it used to store the updates to send. To prevent the overflow, it also performs the syncBatch periodically. It's possible that you hit the overflow situation if many pods had status changes during that window.

dashpole · 2017-05-19T21:01:04Z

It's possible that you hit the overflow situation if many pods had status changes during that window.

@wojtek-t you should see Skpping the status update for pod %q for now because the channel is full; status: %+v in the logs before seeing Status Manager: adding pod: ... if that is the case.

caseydavenport · 2017-05-19T21:01:27Z

That means that the kubelet has the up-to-date status stored in the status manager, it doesn't mean that the status manager has already send updates to the apiserver.

Right, but I don't see a log anywhere prior to that one indicating that status has already been sent to the status manager, so how would it have that update in its cache?

I'd expect to see either this log or this log if the status was already in the queue (from when it was added to the queue), but I see neither. Or am I missing something?

caseydavenport · 2017-05-19T21:06:31Z

Is that the first status update for that pod that you see?

@dashpole it's not - I do see a status update occur when the kubelet receives the Pending Pod a little bit before.

May 15 20:28:52 casey-test-default-pool-b0d6f3ea-wqtj kubelet[6842]: I0515 20:28:52.848200    6842 status_manager.go:340] Status Manager: adding pod: "1c4d94dd-39ad-11e7-a787-42010a800122", with status: ('\x01', {Pending [{Initialized True 0001-01-01 00:00:00 +       0000 UTC 2017-05-15 20:28:52 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2017-05-15 20:28:52 +0000 UTC ContainersNotReady containers with unready status: [ttfp]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-05-15 20:28:52 +0000 UTC  }]   10.240       .0.45  2017-05-15 20:28:52 +0000 UTC [] [{ttfp {&ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 calico/ttfp:v0.6.0  }] Burstable}) to podStatusChannel

May 15 20:28:52 casey-test-default-pool-b0d6f3ea-wqtj kubelet[6842]: I0515 20:28:52.935286    6842 status_manager.go:147] Status Manager: syncing pod: "1c4d94dd-39ad-11e7-a787-42010a800122", with status: (1, {Pending [{Initialized True 0001-01-01 00:00:00 +0000        UTC 2017-05-15 20:28:52 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2017-05-15 20:28:52 +0000 UTC ContainersNotReady containers with unready status: [ttfp]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2017-05-15 20:28:52 +0000 UTC  }]   10.240.0.4       5  2017-05-15 20:28:52 +0000 UTC [] [{ttfp {&ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 calico/ttfp:v0.6.0  }] Burstable}) from podStatusChannel

May 15 20:28:52 casey-test-default-pool-b0d6f3ea-wqtj kubelet[6842]: I0515 20:28:52.998079    6842 status_manager.go:444] Status for pod "client0-cali-ns-0-3203360797-3mk9m_cali-ns-0(1c4d94dd-39ad-11e7-a787-42010a800122)" updated successfully: {status:{Phase:Pe       nding Conditions:[{Type:Initialized Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4d1fec0}} LastTransitionTime:{Time:{sec:63630476932 nsec:0 loc:0x4d1fec0}} Reason: Message:} {Type:Ready Status:False LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4d1fec0}} LastTrans       itionTime:{Time:{sec:63630476932 nsec:0 loc:0x4d1fec0}} Reason:ContainersNotReady Message:containers with unready status: [ttfp]} {Type:PodScheduled Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4d1fec0}} LastTransitionTime:{Time:{sec:63630476932 nsec:0 loc:0       x4d1fec0}} Reason: Message:}] Message: Reason: HostIP:10.240.0.45 PodIP: StartTime:0xc4214e56c0 InitContainerStatuses:[] ContainerStatuses:[{Name:ttfp State:{Waiting:0xc4214e5680 Running:<nil> Terminated:<nil>} LastTerminationState:{Waiting:<nil> Running:<nil> Term       inated:<nil>} Ready:false RestartCount:0 Image:calico/ttfp:v0.6.0 ImageID: ContainerID:}] QOSClass:Burstable} version:1 podName:client0-cali-ns-0-3203360797-3mk9m podNamespace:cali-ns-0}

caseydavenport · 2017-06-02T01:14:00Z

So I added some additional debug logging and I think I'm seeing the pod status channel start to back up due to more incoming status updates than are being pulled off.

For example, I see one update get queued behind 55 other pod status updates. It eventually gets synced as part of a batch 30s later.

Then, the sync thread finally starts pulling all the updates off the channel and notices that the status from the channel is up-to-date.

i.e. I get a LOT of these in a row:

Status for pod "7c31c4a4-4720-11e7-a47a-42010a800002" is up-to-date; skipping

So the next question I had was "why isn't the status thread pulling the updates off the channel quickly enough?"

Looks like I'm seeing this order of events:

00:51:11.132894: Status manager starts a batch sync
00:51:14.870472: Status manager receives the Pod status update
00:51:37.936072: Status manager finishes the batch sync.
00:51:37.936249: Status manager syncs a single Pod status update.
00:51:37.936418: Status manager starts another batch sync.
00:51:50.336730: Status manager syncs the Pod status from 00:51:14 during the batch
00:51:52.530999: Status manager finishes the second batch sync.

So, it looks like it's taking a long time because the status update is received in the middle of a pretty long batch sync and thus isn't picked up until the next batch sync. But why is the batch sync taking so long?

I'm starting about 300 pods across 10 nodes when I see this happen.

yujuhong · 2017-06-02T01:49:57Z

So the next question I had was "why isn't the status thread pulling the updates off the channel quickly enough?"

Most likely caused by the QPS limit I mentioned in #39113 (comment)
(ref: #14955)

caseydavenport · 2017-06-02T22:00:58Z

Most likely caused by the QPS limit I mentioned in

I thought this as well, but I'm seeing this behavior even when I set the --event-qps=0 option on the kubelet which I would expect to disable the rate limiting.

yujuhong · 2017-06-02T22:04:41Z

You need to set the kube-api-qps and the kube-api-burst, in addition to event-qps and event-burst
https://github.com/kubernetes/kubernetes/blob/v1.8.0-alpha.0/cmd/kubelet/app/options/options.go#L267-L268
https://github.com/kubernetes/kubernetes/blob/v1.8.0-alpha.0/cmd/kubelet/app/options/options.go#L199-L200

caseydavenport · 2017-06-02T22:07:24Z

d'oh, thanks. I totally missed those options... I'll give it another run and see if that makes things better.

caseydavenport · 2017-06-03T20:03:00Z

I re-tested with all the above flags set to 1000, results look significantly better but I'm still seeing this take a while.

In my tests, the 99th percentile time drops from ~50 seconds to ~10 seconds by adjusting the QPS. A quick look at the logs seems to indicate that the channel is no longer filling up and we're not relying on the batch sync anymore, which is good, but it's still taking almost 10 seconds for the status to reach the status manager in the first place. I'll need to investigate this part now.

As far as defining acceptable latency, I think we need at least a 99th percentile of < 5s.

thockin · 2017-06-03T22:22:04Z

Do we have any sort of priority queue for Kubelet -> API ops?

…

On Sat, Jun 3, 2017 at 1:03 PM, Casey Davenport ***@***.***> wrote: I re-tested with all the above flags set to 1000, results look significantly better but I'm still seeing this take a while. In my tests, the 99th percentile time drops from ~50 seconds to ~10 seconds by adjusting the QPS. A quick look at the logs seems to indicate that the channel is no longer filling up and we're not relying on the batch sync anymore, which is good, but it's still taking almost 10 seconds for the status to reach the status manager in the first place. I'll need to investigate this part now. As far as defining acceptable latency, I think we need at *least* a 99th percentile of < 5s. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#39113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVNIcJNYQyx7gOgi-sziL1W5NPb5Tks5sAbwHgaJpZM4LTgyU> .

wreed4 · 2022-04-01T22:25:33Z

/reopen

k8s-ci-robot · 2022-04-01T22:25:50Z

@wreed4: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wreed4 · 2022-04-01T22:29:06Z

Is there any likelihood that this will get actioned and fixed?? In particular using https://argoproj.github.io/argo-workflows/, we run hundreds or thousands of workflows at once time which launch hundreds or thousands of pods onto ~300 nodes. it's very very common that >50 of them will get scheduled onto a node at once time. further more, these pods are short-lived, so they are replaced by other pods very quickly. The kubelet starts to slow down very quickly.

I've found that setting the qps limit much higher than the default fixes this case, but it would be good to remove the race condition entirely.

dims · 2022-04-01T22:41:41Z

/reopen

cc @SergeyKanzhelev @ehashman

k8s-ci-robot · 2022-04-01T22:41:58Z

@dims: Reopened this issue.

In response to this:

/reopen

cc @SergeyKanzhelev @ehashman

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2022-05-01T22:54:11Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-05-01T22:54:25Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

whereisaaron · 2022-06-24T16:19:50Z

Note that the AWS VPC CNI documentation released on 19 Oct 2021 describes a workaround for this issue in versions released in the last six months (1.9.3+).

https://github.com/aws/amazon-vpc-cni-k8s#annotate_pod_ip-v193

ANNOTATE_POD_IP (v1.9.3+)
Type: Boolean as a String

Default: false

Setting ANNOTATE_POD_IP to true will allow IPAMD to add an annotation vpc.amazonaws.com/pod-ips to the pod with pod IP.

There is a known #39113 with kubelet taking time to update Pod.Status.PodIP leading to calico being blocked on programming the policy. Setting ANNOTATE_POD_IP to true will enable AWS VPC CNI plugin to add Pod IP as an annotation to the pod spec to address this race condition.

To annotate the pod with pod IP, you will have to add "patch" permission for pods resource in aws-node clusterrole. You can use the below command -
cat << EOF > append.yaml
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - patch
EOF
kubectl apply -f <(cat <(kubectl get clusterrole aws-node -o yaml) append.yaml)

bowei · 2022-06-24T16:22:53Z

Note to future people who are having this issue: you should probably open a new issue instead of recycling this one, as the original bug filed is from 2016 and the cause of the problem is likely something different.

whereisaaron · 2022-06-24T16:28:29Z

@bowei although this is a 2016 issue, that AWS VPC CNI 1.9.3 release was only a few months ago (19 Oct 2021) and specifically cited this issue #39113 as the issue it fixed, so it seems likely to me to be directly relevant for peeps coming here?

There is a known #39113 with kubelet taking time to update Pod.Status.PodIP leading to calico being blocked on programming the policy. Setting ANNOTATE_POD_IP to true will enable AWS VPC CNI plugin to add Pod IP as an annotation to the pod spec to address this race condition.

danwinship · 2022-06-24T19:54:41Z

might be related to #85966

rbtcollins · 2022-06-24T20:40:31Z

@bowei Wearing my skeptic hat, after fighting with this 4 years ago and still seeing it from time to time (different org now, but we track pod bring up latencies as an internal SLO), and theres been no evidence that this has been fixed systematically and has found a new way to the same symptoms: instead, explicitly, this is just bitrotting because no one with enough cycles to tackle it has done so (e.g. a drive by contributor with time to drive the k8s process, or a core contributor that wants to fix it).

aojea · 2022-06-24T20:50:38Z

I think that the first step is to add a metric, so users can report more accurate information

aojea · 2022-06-24T22:00:01Z

I think this is how it goes,

IPs are determined here

kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go

Lines 845 to 846 in 11686e1

    
           podIPs = m.determinePodSandboxIPs(pod.Namespace, pod.Name, resp.GetStatus()) 
        
           klog.V(4).InfoS("Determined the ip for pod after sandbox changed", "IPs", podIPs, "pod", klog.KObj(pod))

it's being patched here

kubernetes/pkg/kubelet/status/status_manager.go

Lines 684 to 685 in 11686e1

    
           newPod, patchBytes, unchanged, err := statusutil.PatchPodStatus(m.kubeClient, pod.Namespace, pod.Name, pod.UID, pod.Status, mergedStatus) 
        
           klog.V(3).InfoS("Patch status for pod", "pod", klog.KObj(pod), "patch", string(patchBytes))

aojea · 2022-06-24T22:05:51Z

@dgrisonnet are you familiar with the metrics on the kubelet?
^^^ this process seems to cross several bounderies on the kubelet logic?
is it possible to have such metric?

dgrisonnet · 2022-06-27T15:45:12Z

I am not familiar enough with the kubelet code to be able to say whether that could be possible or not, but theoretically, it would make sense in my opinion to record the timestamp at which the runtime assigned an IP to the pod and convey this information to the status manager to then create a histogram based on the time passed since the IP was assigned to the pod.

After looking a bit at the code, in practice, it seems a bit difficult to wire that up since the pod status manager is an independent process and the only way to convey information to it seems to be via the versionPodStatus structure, but I am unsure if that really makes sense since I am not familiar with that code so we will need people from node to chime in.

Overall this sounds like a good observability improvement that is worth its own separate issue.

aojea · 2022-06-27T15:52:18Z

created an issue to track the effort #110815

darkn3rd · 2023-06-22T21:08:54Z

Any update on this?

aojea · 2023-06-22T21:24:57Z

Any update on this?

there are metrics now, if someone report the issue with the metrics and the logs is it possible to investigate it, but as this comment correctly points out #39113 (comment) it is better to open a new issue to avoid misunderstanding

resouer added sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Dec 22, 2016

mdshuai mentioned this issue Dec 22, 2016

podIP is empty if container exit quickly #39138

Closed

ozdanborne mentioned this issue Mar 22, 2017

status.podIP empty for HostNetwork tasks #43531

Closed

caseydavenport mentioned this issue May 26, 2017

[Do Not Merge] Remove Apply support for KDD weps projectcalico/libcalico-go#427

Closed

thockin added this to the v1.7 milestone May 27, 2017

thockin added the approved-for-milestone label May 27, 2017

jayanthvn mentioned this issue Sep 21, 2021

Pod startup latency with Calico and EKS aws/amazon-vpc-cni-k8s#1629

Merged

This was referenced Oct 5, 2021

first request slow projectcalico/calico#4680

Closed

Use Amazon VPC annotation to get pod IPs, if set projectcalico/libcalico-go#1523

Merged

abhipth mentioned this issue Oct 27, 2021

fix flaky test and remove canary focus from tests aws/amazon-vpc-resource-controller-k8s#76

Merged

caseydavenport mentioned this issue Feb 16, 2022

Delay between Calico pick IP address and assign address to pod projectcalico/calico#5612

Closed

k8s-ci-robot reopened this Apr 1, 2022

k8s-ci-robot closed this as completed May 1, 2022

aojea mentioned this issue Jun 27, 2022

Kubelet metric: time to update pod.Status.PodIP #110815

Closed

jayanthvn mentioned this issue Dec 7, 2022

Update Calico documentation awsdocs/amazon-eks-user-guide#638

Closed

mijndert mentioned this issue Jan 25, 2023

[EKS] [request]: "patch" permissions for pods resource in aws-node clusterrole aws/containers-roadmap#1940

Closed

eyltl mentioned this issue May 16, 2024

Fix the short delay between when the pod starts and when Calico allows outbound traffic from the pod in azure cni overlay Azure/AKS#4290

Open

Kubelet takes long time to update Pod.Status.PodIP #39113

Kubelet takes long time to update Pod.Status.PodIP #39113

Comments

caseydavenport commented Dec 22, 2016

thockin commented Dec 22, 2016

resouer commented Dec 22, 2016 • edited

feiskyer commented Dec 22, 2016

danwinship commented Dec 22, 2016

resouer commented Dec 23, 2016 • edited

caseydavenport commented Dec 23, 2016

freehan commented Dec 27, 2016

yujuhong commented Jan 4, 2017 • edited

resouer commented Jan 5, 2017

caseydavenport commented May 19, 2017 • edited

wojtek-t commented May 19, 2017

dashpole commented May 19, 2017

yujuhong commented May 19, 2017

dashpole commented May 19, 2017

caseydavenport commented May 19, 2017 • edited

caseydavenport commented May 19, 2017

caseydavenport commented Jun 2, 2017 • edited

yujuhong commented Jun 2, 2017

caseydavenport commented Jun 2, 2017

yujuhong commented Jun 2, 2017

caseydavenport commented Jun 2, 2017

caseydavenport commented Jun 3, 2017

thockin commented Jun 3, 2017 via email

wreed4 commented Apr 1, 2022

k8s-ci-robot commented Apr 1, 2022

wreed4 commented Apr 1, 2022

dims commented Apr 1, 2022

k8s-ci-robot commented Apr 1, 2022

k8s-triage-robot commented May 1, 2022

k8s-ci-robot commented May 1, 2022

whereisaaron commented Jun 24, 2022 • edited

bowei commented Jun 24, 2022

whereisaaron commented Jun 24, 2022

danwinship commented Jun 24, 2022

rbtcollins commented Jun 24, 2022

aojea commented Jun 24, 2022

aojea commented Jun 24, 2022

aojea commented Jun 24, 2022

dgrisonnet commented Jun 27, 2022

aojea commented Jun 27, 2022

darkn3rd commented Jun 22, 2023

aojea commented Jun 22, 2023

resouer commented Dec 22, 2016 •

edited

resouer commented Dec 23, 2016 •

edited

yujuhong commented Jan 4, 2017 •

edited

caseydavenport commented May 19, 2017 •

edited

caseydavenport commented May 19, 2017 •

edited

caseydavenport commented Jun 2, 2017 •

edited

whereisaaron commented Jun 24, 2022 •

edited