Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU metrics are missing after running for several hours #2660

Closed
tony612 opened this issue May 31, 2017 · 6 comments
Closed

CPU metrics are missing after running for several hours #2660

tony612 opened this issue May 31, 2017 · 6 comments

Comments

@tony612
Copy link

tony612 commented May 31, 2017

I found the CPU metrics of a container will be 0 after running for several hours. The processes and cgroup of the container changes.

I can reproduce this using kops 1.5.3:

$ kops version
Version 1.5.3 (git-46364f6)
$ kops create cluster --cloud=aws --zones ap-northeast-1a ${NAME}
# just changed the nodes number to 1 from 2
$ kops update cluster ${NAME} --yes
(AMI: k8s-1.5-debian-jessie-amd64-hvm-ebs-2017-05-02 (ami-56695f31))

Then I save some useful(maybe) info:

$ sudo docker info
Containers: 3
 Running: 3
 Paused: 0
 Stopped: 0
Images: 3
Server Version: 1.12.3
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.4.65-k8s
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.862 GiB
Name: ip-172-20-36-154
ID: J2K6:5YGH:LAHZ:DMRU:W2XB:OILL:EH53:WANY:UK4J:7UIY:ULVP:44PQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
WARNING: No kernel memory limit support
Insecure Registries:
 127.0.0.0/8

$ sudo docker ps
CONTAINER ID        IMAGE                                        COMMAND                  CREATED             STATUS              PORTS               NAMES
04ce9bd6939b        gcr.io/google_containers/kube-proxy:v1.5.7   "/bin/sh -c 'echo -99"   16 minutes ago      Up 16 minutes                           k8s_kube-proxy.2c2d203f_kube-proxy-ip-172-20-36-154.ap-northeast-1.compute.internal_kube-system_7b35e1830aa0d2f2e3e3bf9fbae75ac5_3da5329f
$ sudo docker inspect -f '{{.State.Pid}}' 04ce9bd6939b
1510

$ cat /proc/1510/cgroup
10:pids:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
9:perf_event:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
8:net_cls,net_prio:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
7:freezer:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
6:devices:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
5:memory:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
4:blkio:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
3:cpu,cpuacct:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
2:cpuset:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
1:name=systemd:/system.slice/docker.service/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16

$ sudo docker top 04ce9bd6939b
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                1510                1495                0                   03:10               ?                   00:00:00            /bin/sh -c echo -998 > /proc/$$/oom_score_adj && kube-proxy --kubeconfig=/var/lib/kube-proxy/kubeconfig --resource-container="" --master=https://api.internal.test0.llstestk8s.com --v=2 1>>/var/log/kube-proxy.log 2>&1
root                1520                1510                0                   03:10               ?                   00:00:02            kube-proxy --kubeconfig=/var/lib/kube-proxy/kubeconfig --resource-container= --master=https://api.internal.test0.llstestk8s.com --v=2

The cAdvisor result is right:
cadvisor0

Then I run this script to monitor the cgroup changes:

#!/bin/bash

while :
do
  l=$(grep docker /proc/1510/cgroup | wc -l)
  if [ $l != 10 ]; then
    # send notification to Slack
  else
    printf "."
  fi
  sleep 60
done

After about 4 hours(May 31 06:50 UTC), I got the slack notification, and I checked the corresponding info and some logs:

$ cat /proc/1510/cgroup
10:pids:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
9:perf_event:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
8:net_cls,net_prio:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
7:freezer:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
6:devices:/
5:memory:/
4:blkio:/
3:cpu,cpuacct:/
2:cpuset:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
1:name=systemd:/system.slice/docker.service/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16

$ sudo docker top 04ce9bd6939b
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD

$ less /var/log/syslog
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Stopping RPC Port Mapper.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Stopped target RPC Port Mapper.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Stopping LSB: RPC portmapper replacement...
May 31 06:50:18 ip-172-20-36-154 rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w"
May 31 06:50:18 ip-172-20-36-154 rpcbind[23556]: Stopping rpcbind daemon....
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Stopped LSB: RPC portmapper replacement.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:19 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:19 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:19 ip-172-20-36-154 systemd[1]: Stopped Provide limited super user privileges to specific users.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Starting LSB: RPC portmapper replacement...
May 31 06:50:20 ip-172-20-36-154 rpcbind[24955]: Starting rpcbind daemon....
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Started LSB: RPC portmapper replacement.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Starting RPC Port Mapper.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Reached target RPC Port Mapper.

The cAdvisor result is wrong now:
cadvisor1

Any idea?

@tony612
Copy link
Author

tony612 commented May 31, 2017

I found a simple ways to reproduce this problem:

$ sudo service rpcbind stop
$ sudo systemctl daemon-reload
$ sudo service rpcbind start
$ cat /proc/xxx/cgroup

After restarting the container, it will become normal.

@tony612
Copy link
Author

tony612 commented May 31, 2017

I found it's related to this issue and can reproduce using just docker. moby/moby#20152 (comment)

@chrislovecnm
Copy link
Contributor

@tony612 I am guessing this is not a kops specific issue then?

@danopia
Copy link

danopia commented May 31, 2017

I've been trying to track down variants of this issue in multiple ec2 kops clusters for weeks now. It breaks autoscaling and is a major blocker for production. On kops 1.5 I saw some pods come up without CPU and stay that way; on kops 1.6 I'm seeing all pods come up with CPU intact and then almost all pods lose CPU by the 24-hour mark. This change happened without a k8s upgrade.

(I also have a bare metal kubeadm cluster that hasn't shown the issue at all, but it's a newer k8s (1.6) so it's not a good comparison)

@shamil
Copy link
Contributor

shamil commented May 31, 2017

this is cAdvisor issue, look here: google/cadvisor#1510 google/cadvisor#1572 and its already fixed in google/cadvisor#1573

k8s 1.6 includes the fix

@tony612
Copy link
Author

tony612 commented Jun 1, 2017

@chrislovecnm @danopia @shamil It seems a problem related to Docker and systemd. See the comment below mine in the moby issue moby/moby#20152 (comment) So I think we should upgrade the systemd in Debian image to support Delegate=true. I'll close this issue to wait for this #1749

btw, k8s 1.6 works because kops set --cgroup-root to / at 1.6 https://github.com/kubernetes/kops/pull/1955/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants