CPU metrics are missing after running for several hours #2660

tony612 · 2017-05-31T07:49:31Z

I found the CPU metrics of a container will be 0 after running for several hours. The processes and cgroup of the container changes.

I can reproduce this using kops 1.5.3:

$ kops version
Version 1.5.3 (git-46364f6)
$ kops create cluster --cloud=aws --zones ap-northeast-1a ${NAME}
# just changed the nodes number to 1 from 2
$ kops update cluster ${NAME} --yes
(AMI: k8s-1.5-debian-jessie-amd64-hvm-ebs-2017-05-02 (ami-56695f31))

Then I save some useful(maybe) info:

$ sudo docker info
Containers: 3
 Running: 3
 Paused: 0
 Stopped: 0
Images: 3
Server Version: 1.12.3
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.4.65-k8s
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.862 GiB
Name: ip-172-20-36-154
ID: J2K6:5YGH:LAHZ:DMRU:W2XB:OILL:EH53:WANY:UK4J:7UIY:ULVP:44PQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
WARNING: No kernel memory limit support
Insecure Registries:
 127.0.0.0/8

$ sudo docker ps
CONTAINER ID        IMAGE                                        COMMAND                  CREATED             STATUS              PORTS               NAMES
04ce9bd6939b        gcr.io/google_containers/kube-proxy:v1.5.7   "/bin/sh -c 'echo -99"   16 minutes ago      Up 16 minutes                           k8s_kube-proxy.2c2d203f_kube-proxy-ip-172-20-36-154.ap-northeast-1.compute.internal_kube-system_7b35e1830aa0d2f2e3e3bf9fbae75ac5_3da5329f
$ sudo docker inspect -f '{{.State.Pid}}' 04ce9bd6939b
1510

$ cat /proc/1510/cgroup
10:pids:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
9:perf_event:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
8:net_cls,net_prio:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
7:freezer:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
6:devices:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
5:memory:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
4:blkio:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
3:cpu,cpuacct:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
2:cpuset:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
1:name=systemd:/system.slice/docker.service/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16

$ sudo docker top 04ce9bd6939b
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                1510                1495                0                   03:10               ?                   00:00:00            /bin/sh -c echo -998 > /proc/$$/oom_score_adj && kube-proxy --kubeconfig=/var/lib/kube-proxy/kubeconfig --resource-container="" --master=https://api.internal.test0.llstestk8s.com --v=2 1>>/var/log/kube-proxy.log 2>&1
root                1520                1510                0                   03:10               ?                   00:00:02            kube-proxy --kubeconfig=/var/lib/kube-proxy/kubeconfig --resource-container= --master=https://api.internal.test0.llstestk8s.com --v=2

The cAdvisor result is right:

Then I run this script to monitor the cgroup changes:

#!/bin/bash

while :
do
  l=$(grep docker /proc/1510/cgroup | wc -l)
  if [ $l != 10 ]; then
    # send notification to Slack
  else
    printf "."
  fi
  sleep 60
done

After about 4 hours(May 31 06:50 UTC), I got the slack notification, and I checked the corresponding info and some logs:

$ cat /proc/1510/cgroup
10:pids:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
9:perf_event:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
8:net_cls,net_prio:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
7:freezer:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
6:devices:/
5:memory:/
4:blkio:/
3:cpu,cpuacct:/
2:cpuset:/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16
1:name=systemd:/system.slice/docker.service/docker/04ce9bd6939b0921b31506cc8ed9e57bb5e85697b93422154957d2ec927cce16

$ sudo docker top 04ce9bd6939b
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD

$ less /var/log/syslog
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Stopping RPC Port Mapper.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Stopped target RPC Port Mapper.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Stopping LSB: RPC portmapper replacement...
May 31 06:50:18 ip-172-20-36-154 rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w"
May 31 06:50:18 ip-172-20-36-154 rpcbind[23556]: Stopping rpcbind daemon....
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Stopped LSB: RPC portmapper replacement.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:18 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:18 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:19 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:19 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:19 ip-172-20-36-154 systemd[1]: Stopped Provide limited super user privileges to specific users.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Reloading.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: [/lib/systemd/system/docker.service:20] Unknown lvalue 'Delegate' in section 'Service'
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Starting LSB: RPC portmapper replacement...
May 31 06:50:20 ip-172-20-36-154 rpcbind[24955]: Starting rpcbind daemon....
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Started LSB: RPC portmapper replacement.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Starting RPC Port Mapper.
May 31 06:50:20 ip-172-20-36-154 systemd[1]: Reached target RPC Port Mapper.

The cAdvisor result is wrong now:

Any idea?

The text was updated successfully, but these errors were encountered:

tony612 · 2017-05-31T09:15:37Z

I found a simple ways to reproduce this problem:

$ sudo service rpcbind stop
$ sudo systemctl daemon-reload
$ sudo service rpcbind start
$ cat /proc/xxx/cgroup

After restarting the container, it will become normal.

tony612 · 2017-05-31T12:56:32Z

I found it's related to this issue and can reproduce using just docker. moby/moby#20152 (comment)

chrislovecnm · 2017-05-31T15:41:52Z

@tony612 I am guessing this is not a kops specific issue then?

danopia · 2017-05-31T17:45:57Z

I've been trying to track down variants of this issue in multiple ec2 kops clusters for weeks now. It breaks autoscaling and is a major blocker for production. On kops 1.5 I saw some pods come up without CPU and stay that way; on kops 1.6 I'm seeing all pods come up with CPU intact and then almost all pods lose CPU by the 24-hour mark. This change happened without a k8s upgrade.

(I also have a bare metal kubeadm cluster that hasn't shown the issue at all, but it's a newer k8s (1.6) so it's not a good comparison)

shamil · 2017-05-31T19:03:31Z

this is cAdvisor issue, look here: google/cadvisor#1510 google/cadvisor#1572 and its already fixed in google/cadvisor#1573

k8s 1.6 includes the fix

tony612 · 2017-06-01T02:11:29Z

@chrislovecnm @danopia @shamil It seems a problem related to Docker and systemd. See the comment below mine in the moby issue moby/moby#20152 (comment) So I think we should upgrade the systemd in Debian image to support Delegate=true. I'll close this issue to wait for this #1749

btw, k8s 1.6 works because kops set --cgroup-root to / at 1.6 https://github.com/kubernetes/kops/pull/1955/files

tony612 closed this as completed Jun 1, 2017

This was referenced Jun 1, 2017

Docker systemd: unknown lvalue 'Delegate' in section 'Service' #1749

Closed

CPU metrics of some docker containers are wrong google/cadvisor#1660

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU metrics are missing after running for several hours #2660

CPU metrics are missing after running for several hours #2660

tony612 commented May 31, 2017

tony612 commented May 31, 2017 •

edited

Loading

tony612 commented May 31, 2017

chrislovecnm commented May 31, 2017

danopia commented May 31, 2017

shamil commented May 31, 2017

tony612 commented Jun 1, 2017 •

edited

Loading

CPU metrics are missing after running for several hours #2660

CPU metrics are missing after running for several hours #2660

Comments

tony612 commented May 31, 2017

tony612 commented May 31, 2017 • edited Loading

tony612 commented May 31, 2017

chrislovecnm commented May 31, 2017

danopia commented May 31, 2017

shamil commented May 31, 2017

tony612 commented Jun 1, 2017 • edited Loading

tony612 commented May 31, 2017 •

edited

Loading

tony612 commented Jun 1, 2017 •

edited

Loading