-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU metrics are missing after running for several hours #2660
Comments
I found a simple ways to reproduce this problem:
After restarting the container, it will become normal. |
I found it's related to this issue and can reproduce using just docker. moby/moby#20152 (comment) |
@tony612 I am guessing this is not a kops specific issue then? |
I've been trying to track down variants of this issue in multiple ec2 kops clusters for weeks now. It breaks autoscaling and is a major blocker for production. On kops 1.5 I saw some pods come up without CPU and stay that way; on kops 1.6 I'm seeing all pods come up with CPU intact and then almost all pods lose CPU by the 24-hour mark. This change happened without a k8s upgrade. (I also have a bare metal kubeadm cluster that hasn't shown the issue at all, but it's a newer k8s (1.6) so it's not a good comparison) |
this is cAdvisor issue, look here: google/cadvisor#1510 google/cadvisor#1572 and its already fixed in google/cadvisor#1573 k8s 1.6 includes the fix |
@chrislovecnm @danopia @shamil It seems a problem related to Docker and systemd. See the comment below mine in the moby issue moby/moby#20152 (comment) So I think we should upgrade the systemd in Debian image to support btw, k8s 1.6 works because kops set |
I found the CPU metrics of a container will be 0 after running for several hours. The processes and cgroup of the container changes.
I can reproduce this using kops 1.5.3:
Then I save some useful(maybe) info:
The cAdvisor result is right:
![cadvisor0](https://cloud.githubusercontent.com/assets/1253659/26621166/5ba7167c-4618-11e7-8c29-1ac41c2dbf48.png)
Then I run this script to monitor the cgroup changes:
After about 4 hours(May 31 06:50 UTC), I got the slack notification, and I checked the corresponding info and some logs:
The cAdvisor result is wrong now:
![cadvisor1](https://cloud.githubusercontent.com/assets/1253659/26621174/67761c00-4618-11e7-902d-9b9b022075cc.png)
Any idea?
The text was updated successfully, but these errors were encountered: