origin-node excessive logging unable to clean up volumes / orphaned pods #13111

canit00 · 2017-02-25T23:51:04Z

Troubleshooting a separate issue we noticed excessive logging on three particular nodes that have been designated to run infra services.

Excessive logging is causing minor issues such as inability to capture logging via sosreport, however more concerning is not being able to address the root of these events.

Version

oc v1.4.1+3f9807a
kubernetes v1.4.0+776c994
openshift v1.4.1+3f9807a
kubernetes v1.4.0+776c994

Steps To Reproduce

We have an rsyslog daemon set and ipf-routers running on these hosts. Simply rebooting
the systems causes this to occur.

Current Result

Excessive logging from origin-node from what appears to be unable to remove volumes left behind by orphan pods.

Example log entry:

Orphaned pod "ce557650-f6d8-11e6-878f-005056909c57" found, but volumes are still mounted; err: , volumes: [wrapped_cassandra-token-9cn27.deleting~427042738

Volumes can be seen not mounted and still exist in on disk.

df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/atomicos-root 200G 16G 185G 8% /
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 16G 1.3M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/sda1 297M 109M 189M 37% /boot
tmpfs 3.2G 0 3.2G 0% /run/user/1001
tmpfs 16G 32K 16G 1% /var/lib/origin/openshift.local.volumes/pods/a99e2014-fb84-11e6-a034-005056903ca4/volumes/kubernetes.iosecret/default-token-gexr0
tmpfs 16G 8.0K 16G 1% /var/lib/origin/openshift.local.volumes/pods/a1353e0d-fb8e-11e6-a034-005056903ca4/volumes/kubernetes.iosecret/server-certificate
tmpfs 16G 32K 16G 1% /var/lib/origin/openshift.local.volumes/pods/a1353e0d-fb8e-11e6-a034-005056903ca4/volumes/kubernetes.io~secret/router-token-i0wgz

ls -l /var/lib/origin/openshift.local.volumes/pods/
total 0
drwxr-x---. 5 root root 54 Feb 17 14:11 54dcecb5-f54d-11e6-bfbb-005056909c57
drwxr-x---. 5 root root 54 Feb 17 13:58 64242d73-f54b-11e6-ba0e-005056902089
drwxr-x---. 5 root root 54 Feb 25 13:14 a1353e0d-fb8e-11e6-a034-005056903ca4
drwxr-x---. 5 root root 71 Feb 25 12:03 a99e2014-fb84-11e6-a034-005056903ca4
drwxr-x---. 5 root root 71 Feb 19 13:28 ce557650-f6d8-11e6-878f-005056909c57

Expected Result

Additional Information

New Containerized Origin cluster running on Atomic Host 7.3 7.3.2 receiving approximately 2k messages a minute on three designated infra nodes.

Others are experiencing the same issue and been reported kubernetes/kubernetes#38498

I tried exporting journald logs for origin-node just for today and it exceeded a couple of gigs.

Looking for guidance how to go about normalizing this situation and how to provide more detailed logs.

logs.pdf

canit00 · 2017-02-26T03:25:40Z

I was able to stop the overly excessive logging by the origin-node following the comment by @gnufied kubernetes/kubernetes#38498 (comment)

ls -l ce557650-f6d8-11e6-878f-005056909c57/volumes-old/
total 476
drwxr-x---. 2 root root 6 Feb 23 15:35 kubernetes.ionfs
drwxr-xr-x. 4829 root root 409600 Feb 25 21:09 kubernetes.iosecret

In my case, the volumes dir had contents and moving it and creating an empty volumes dir was the missing key.

Just an after thought just being done recovering a couple of our infra-nodes after Docker stopped responding. Appreciate everyones' hard work and passion from the community. I look forward putting this one behind us. It's a tough one to deal with.

Cheers!

canit00 · 2017-03-17T16:08:34Z

Good afternoon @gnufied!

Would you be able to provide any suggestions by means to manually clean up orphaned pods and stop the repeated logs?

Earlier I reported that manually moving the volumes directory inside the pod volume and creating an empty volume would clean up orphaned pods.

This approach does not appear to be the fix for every single occurrence.

Sincerely,
canit00

gnufied · 2017-03-24T22:05:38Z

@canit00 thanks for bug report. I am evaluating this from bunch of angles.

Just to confirm - the directories which are not getting cleaned up, are they mounted volumes like one of the commenter reported in kubernetes bug or are they plain directories?

If they are mounted directories we have to reconsider our approach a bit because it will require unmounting the volume first

…rphanedPodDirs This is basically a cherry-pick of: 1. kubernetes/kubernetes#38909 2. https://github.com/kubernetes/kubernetes/pull/41626/files But those commits can't be directly cherry picked because of whole lot of dependencies. Fixes openshift#13111

openshift-bot · 2018-02-09T19:22:45Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2018-03-13T23:12:15Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2018-04-13T05:00:46Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

pweil- assigned gnufied Feb 28, 2017

pweil- added component/storage kind/bug Categorizes issue or PR as related to a bug. priority/P2 component/kubernetes labels Feb 28, 2017

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 9, 2018

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 13, 2018

openshift-ci-robot closed this as completed Apr 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

origin-node excessive logging unable to clean up volumes / orphaned pods #13111

origin-node excessive logging unable to clean up volumes / orphaned pods #13111

canit00 commented Feb 25, 2017 •

edited

Loading

canit00 commented Feb 26, 2017 •

edited

Loading

canit00 commented Mar 17, 2017

gnufied commented Mar 24, 2017

openshift-bot commented Feb 9, 2018

openshift-bot commented Mar 13, 2018

openshift-bot commented Apr 13, 2018

origin-node excessive logging unable to clean up volumes / orphaned pods #13111

origin-node excessive logging unable to clean up volumes / orphaned pods #13111

Comments

canit00 commented Feb 25, 2017 • edited Loading

Version

Steps To Reproduce

Current Result

Expected Result

Additional Information

canit00 commented Feb 26, 2017 • edited Loading

canit00 commented Mar 17, 2017

gnufied commented Mar 24, 2017

openshift-bot commented Feb 9, 2018

openshift-bot commented Mar 13, 2018

openshift-bot commented Apr 13, 2018

canit00 commented Feb 25, 2017 •

edited

Loading

canit00 commented Feb 26, 2017 •

edited

Loading