Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

origin-node excessive logging unable to clean up volumes / orphaned pods #13111

Closed
canit00 opened this issue Feb 25, 2017 · 6 comments
Closed
Assignees
Labels
component/kubernetes component/storage kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P2

Comments

@canit00
Copy link

canit00 commented Feb 25, 2017

Troubleshooting a separate issue we noticed excessive logging on three particular nodes that have been designated to run infra services.

Excessive logging is causing minor issues such as inability to capture logging via sosreport, however more concerning is not being able to address the root of these events.

Version

oc v1.4.1+3f9807a
kubernetes v1.4.0+776c994
openshift v1.4.1+3f9807a
kubernetes v1.4.0+776c994

Steps To Reproduce

We have an rsyslog daemon set and ipf-routers running on these hosts. Simply rebooting
the systems causes this to occur.

Current Result

Excessive logging from origin-node from what appears to be unable to remove volumes left behind by orphan pods.

Example log entry:

Orphaned pod "ce557650-f6d8-11e6-878f-005056909c57" found, but volumes are still mounted; err: , volumes: [wrapped_cassandra-token-9cn27.deleting~427042738

Volumes can be seen not mounted and still exist in on disk.

df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/atomicos-root 200G 16G 185G 8% /
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 16G 1.3M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/sda1 297M 109M 189M 37% /boot
tmpfs 3.2G 0 3.2G 0% /run/user/1001
tmpfs 16G 32K 16G 1% /var/lib/origin/openshift.local.volumes/pods/a99e2014-fb84-11e6-a034-005056903ca4/volumes/kubernetes.iosecret/default-token-gexr0
tmpfs 16G 8.0K 16G 1% /var/lib/origin/openshift.local.volumes/pods/a1353e0d-fb8e-11e6-a034-005056903ca4/volumes/kubernetes.io
secret/server-certificate
tmpfs 16G 32K 16G 1% /var/lib/origin/openshift.local.volumes/pods/a1353e0d-fb8e-11e6-a034-005056903ca4/volumes/kubernetes.io~secret/router-token-i0wgz

ls -l /var/lib/origin/openshift.local.volumes/pods/
total 0
drwxr-x---. 5 root root 54 Feb 17 14:11 54dcecb5-f54d-11e6-bfbb-005056909c57
drwxr-x---. 5 root root 54 Feb 17 13:58 64242d73-f54b-11e6-ba0e-005056902089
drwxr-x---. 5 root root 54 Feb 25 13:14 a1353e0d-fb8e-11e6-a034-005056903ca4
drwxr-x---. 5 root root 71 Feb 25 12:03 a99e2014-fb84-11e6-a034-005056903ca4
drwxr-x---. 5 root root 71 Feb 19 13:28 ce557650-f6d8-11e6-878f-005056909c57

Expected Result
Additional Information

New Containerized Origin cluster running on Atomic Host 7.3 7.3.2 receiving approximately 2k messages a minute on three designated infra nodes.

Others are experiencing the same issue and been reported kubernetes/kubernetes#38498

I tried exporting journald logs for origin-node just for today and it exceeded a couple of gigs.

Looking for guidance how to go about normalizing this situation and how to provide more detailed logs.

logs.pdf

@canit00
Copy link
Author

canit00 commented Feb 26, 2017

I was able to stop the overly excessive logging by the origin-node following the comment by @gnufied kubernetes/kubernetes#38498 (comment)

ls -l ce557650-f6d8-11e6-878f-005056909c57/volumes-old/
total 476
drwxr-x---. 2 root root 6 Feb 23 15:35 kubernetes.ionfs
drwxr-xr-x. 4829 root root 409600 Feb 25 21:09 kubernetes.io
secret

In my case, the volumes dir had contents and moving it and creating an empty volumes dir was the missing key.

Just an after thought just being done recovering a couple of our infra-nodes after Docker stopped responding. Appreciate everyones' hard work and passion from the community. I look forward putting this one behind us. It's a tough one to deal with.

Cheers!

@canit00
Copy link
Author

canit00 commented Mar 17, 2017

Good afternoon @gnufied!

Would you be able to provide any suggestions by means to manually clean up orphaned pods and stop the repeated logs?

Earlier I reported that manually moving the volumes directory inside the pod volume and creating an empty volume would clean up orphaned pods.

This approach does not appear to be the fix for every single occurrence.

Sincerely,
canit00

@gnufied
Copy link
Member

gnufied commented Mar 24, 2017

@canit00 thanks for bug report. I am evaluating this from bunch of angles.

Just to confirm - the directories which are not getting cleaned up, are they mounted volumes like one of the commenter reported in kubernetes bug or are they plain directories?

If they are mounted directories we have to reconsider our approach a bit because it will require unmounting the volume first

gnufied added a commit to gnufied/origin that referenced this issue Mar 28, 2017
…rphanedPodDirs

This is basically a cherry-pick of:

1. kubernetes/kubernetes#38909
2. https://github.com/kubernetes/kubernetes/pull/41626/files

But those commits can't be directly cherry picked because of whole
lot of dependencies.

Fixes openshift#13111
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 9, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 13, 2018
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/kubernetes component/storage kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P2
Projects
None yet
Development

No branches or pull requests

5 participants