Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cephcsi-cephfs nodeplugin pod got OOM killing when using FUSE mounter #554

Closed
yydzhou opened this issue Aug 14, 2019 · 15 comments
Closed

cephcsi-cephfs nodeplugin pod got OOM killing when using FUSE mounter #554

yydzhou opened this issue Aug 14, 2019 · 15 comments
Assignees
Labels
bug Something isn't working component/cephfs Issues related to CephFS wontfix This will not be worked on

Comments

@yydzhou
Copy link

yydzhou commented Aug 14, 2019

Describe the bug

When using cephfs FUSE mounter in cephfs csi, the cephcsi-cephfs nodeplugin pod is very easy to get killed due to OOM. I have tried with 256M and then 1G resource setting, but the issue still happens.

A clear and concise description of what the bug is.

Environment details

  • Image/version of Ceph CSI driver
    image: quay.io/cephcsi/cephfsplugin:v1.0.0
  • helm chart version
  • Kubernetes cluster version
    1.13.4
  • Logs

Steps to reproduce

Steps to reproduce the behavior:
deploying ceph and then cephfs csi driver + storageclass (mounter: fuse).
Then create multiple cephfs pv/pvc and consume them with pods.

  1. Setup details: '...'
  2. Deployment to trigger the issue '....'
  3. See error

Actual results

Describe what happened
[167475.834265] Hardware name: RDO OpenStack Compute, BIOS 1.11.0-2.el7 04/01/2014 [167475.837017] Call Trace: [167475.838397] [<ffffffffb390e78e>] dump_stack+0x19/0x1b [167475.840546] [<ffffffffb390a110>] dump_header+0x90/0x229 [167475.842691] [<ffffffffb34d805b>] ? cred_has_capability+0x6b/0x120 [167475.845070] [<ffffffffb3397c44>] oom_kill_process+0x254/0x3d0 [167475.847349] [<ffffffffb34d813e>] ? selinux_capable+0x2e/0x40 [167475.849595] [<ffffffffb340f326>] mem_cgroup_oom_synchronize+0x546/0x570 [167475.852109] [<ffffffffb340e7a0>] ? mem_cgroup_charge_common+0xc0/0xc0 [167475.854595] [<ffffffffb33984d4>] pagefault_out_of_memory+0x14/0x90 [167475.856997] [<ffffffffb3908232>] mm_fault_error+0x6a/0x157 [167475.859178] [<ffffffffb391b8c6>] __do_page_fault+0x496/0x4f0 [167475.861395] [<ffffffffb391ba06>] trace_do_page_fault+0x56/0x150 [167475.863692] [<ffffffffb391af92>] do_async_page_fault+0x22/0xf0 [167475.865966] [<ffffffffb39177b8>] async_page_fault+0x28/0x30 [167475.868162] Task in /kubepods/burstable/pod6c6d1804-bd30-11e9-9fb1-fa163e63ddad/725f282aadbd6b6c9b540a8d64ce719560cca6dd9db685e52538f4012f08061f killed as a result of limit of /kubepods/burstable/pod6c6d1804-bd30-11e9-9fb1-fa163e63ddad/725f282aadbd6b6c9b540a8d64ce719560cca6dd9db685e52538f4012f08061f [167475.877505] memory: usage 1048576kB, limit 1048576kB, failcnt 54 [167475.879928] memory+swap: usage 1048576kB, limit 1048576kB, failcnt 0 [167475.882395] kmem: usage 10440kB, limit 9007199254740988kB, failcnt 0 [167475.884797] Memory cgroup stats for /kubepods/burstable/pod6c6d1804-bd30-11e9-9fb1-fa163e63ddad/725f282aadbd6b6c9b540a8d64ce719560cca6dd9db685e52538f4012f08061f: cache:0KB rss:1038136KB rss_huge:51200KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:1038064KB inactive_file:0KB active_file:0KB unevictable:0KB [167475.975145] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [167475.977958] [30026] 0 30026 32644 3917 24 0 994 cephcsi-cephfs [167475.980763] [32479] 0 32479 407518 83034 252 0 994 ceph-fuse [167475.983480] [19628] 0 19628 525737 176645 467 0 994 ceph-fuse [167475.986162] Memory cgroup out of memory: Kill process 23228 (ceph-fuse) score 1648 or sacrifice child [167475.989086] Killed process 19628 (ceph-fuse) total-vm:2102948kB, anon-rss:700684kB, file-rss:5896kB, shmem-rss:0kB

Expected behavior

A clear and concise description of what you expected to happen.
The nodeplugin should be more stable and have option to set how much memory to use.

Additional context

Add any other context about the problem here.

For example:

Any existing bug report which describe about the similar issue/behavior

@yydzhou yydzhou changed the title cephcsi-cephfs nodeplugin pod got killed when using FUSE mounter cephcsi-cephfs nodeplugin pod got OOM killing when using FUSE mounter Aug 14, 2019
@ShyamsundarR
Copy link
Contributor

@yydzhou Is the test to only create pods that consume the PVCs or do the pods also perform any IO or filesystem operation?

cc @ajarr can you help understand the behavior here?

@yydzhou
Copy link
Author

yydzhou commented Aug 14, 2019

Yes, the pods will also perform I/O and filesystem operations.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Aug 16, 2019

@ajarr @poornimag PTAL

@ajarr ajarr added bug Something isn't working component/cephfs Issues related to CephFS labels Aug 16, 2019
@ajarr
Copy link
Contributor

ajarr commented Aug 16, 2019

@yydzhou , what's the version of Ceph cluster and FUSE client (ceph-fuse)? Exact version would be helpful to know, 14.2.x?

After how many PVCs consumed by the pods and doing I/Os, do you hit OOM?

@ajarr
Copy link
Contributor

ajarr commented Aug 16, 2019

@ShyamsundarR , are 256 M and 1G typical memory limit setting for CephFS and RBD node plugins?

@ShyamsundarR
Copy link
Contributor

@ShyamsundarR , are 256 M and 1G typical memory limit setting for CephFS and RBD node plugins?

I am not aware of the memory constraints or usage of RBD. I think we need to understand this better.

For example, switching to kernel cephfs and/or krbd instead of rbd-nbd may not charge the container namespace (in this case the nodeplugin) the memory overhead and further would let the kernel manage space reclamation based on usage. I think this may be a better direction in the longer run (unless I understood rbd-ndb incorrectly, i.e where the blocks are cached).

The issue may come down to, how to let cephfs FUSE know how much space it has to operate with cached data (beyond the usual must-have memory footprint) and if we can share this across the various FUSE mounts, or we need to think about a single FUSE mount to control this consumption (i.e #476 ).

cc @dillaman

@yydzhou
Copy link
Author

yydzhou commented Aug 16, 2019

@ajarr The ceph version is 13.2.4. The OOM happens when mounting the 3rd pv on the node. I am using cephfs-csi chart release 1.0. So I assume the ceph.fuse version is 14.2.x? I think Once the pv is mounted the related pods will issue some I/O against it. But not that much. Because the failure happens during the deployment of our cluster.

@yydzhou
Copy link
Author

yydzhou commented Aug 16, 2019

JFYI, Using of ceph.FUSE mounting is to support centos with old kernel (3.10). The external cephfs provisioner has an option to disable RADOS pool namespace ioslation to allow old kernel mount (ref kubernetes-retired/external-storage@4fefaf6#diff-3ccb4687fb599e0453570308087f8252). But seems the cephfs-csi does not support that. So we have to use FUSE mounting, before we upgraded all our kernel version.

@ajarr
Copy link
Contributor

ajarr commented Aug 16, 2019

@ShyamsundarR , are 256 M and 1G typical memory limit setting for CephFS and RBD node plugins?

I am not aware of the memory constraints or usage of RBD. I think we need to understand this better.

For example, switching to kernel cephfs and/or krbd instead of rbd-nbd may not charge the container namespace (in this case the nodeplugin) the memory overhead and further would let the kernel manage space reclamation based on usage. I think this may be a better direction in the longer run (unless I understood rbd-ndb incorrectly, i.e where the blocks are cached).

The issue may come down to, how to let cephfs FUSE know how much space it has to operate with cached data (beyond the usual must-have memory footprint) and if we can share this across the various FUSE mounts, or we need to think about a single FUSE mount to control this consumption (i.e #476 ).

cc @dillaman

@ batrick FYI

@ajarr
Copy link
Contributor

ajarr commented Aug 16, 2019

@ajarr The ceph version is 13.2.4. The OOM happens when mounting the 3rd pv on the node.

This is surprising. Not sure how other CephFS CSI v1.0.0 users didn't hit this.

I am using cephfs-csi chart release 1.0. So I assume the ceph.fuse version is 14.2.x? I think Once the pv is mounted the related pods will issue some I/O against it. But not that much. Because the failure happens during the deployment of our cluster.

This is helpful information. I'll try tracking down the issue.

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Sep 11, 2019

even I have seen this issue with more I/O the memory consumption of fuse increases.
@poornimag can you confirm it.

@ajarr
Copy link
Contributor

ajarr commented Sep 11, 2019

@joscollin can you also take a look?

@rochaporto
Copy link

Any news on this one? We're seeing the same issue, even in nodes with no PVs mounted after a couple days they go OOM. Dropping the resource requests the nodes eventually die.

@stale
Copy link

stale bot commented Oct 4, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Oct 4, 2020
@stale
Copy link

stale bot commented Oct 12, 2020

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component/cephfs Issues related to CephFS wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

6 participants