Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to handle kernel paging request #298

Closed
mevzosvlad opened this issue Jun 25, 2019 · 2 comments
Closed

unable to handle kernel paging request #298

mevzosvlad opened this issue Jun 25, 2019 · 2 comments

Comments

@mevzosvlad
Copy link

mevzosvlad commented Jun 25, 2019

It is started to happened after upgrade from 0.13 to the latest master couple days ago.

When i provision new instance in AWS and new metrics starting to be created after some time go-carbon stuck in running state , if i am trying to stop it , whole instance stuck and i need to restart it .
I still have plenty free RAM available.

There is about 10M unique metrics at storage. incoming flow of metrics about 150k/sec
Storage: mdraid stripe + 4 ephemeral drives
OS: ubuntu 14.04
EC2 : i3.8xlarge

Related parts of gocarbon.conf:

[common]
 max-cpu = 8
 [whisper]
 workers = 8
 max-updates-per-second = 15000 
 max-creates-per-second = 2000
 [carbonserver]
 scan-frequency = "60m0s"

Here is stack trace from /var/log/messages:

 kernel: [ 4363.116940] BUG: unable to handle     kernel paging request at ffffeafffffffff0
 kernel: [ 4363.121965] IP: [<ffffffff8173ebee>] _raw_spin_lock+0xe/0x50
 kernel: [ 4363.125916] PGD 0
 kernel: [ 4363.127392] Oops: 0002 [#1] SMP
 kernel: [ 4363.129580] Modules linked in: dm_crypt serio_raw isofs raid10 raid456 async_memcpy async_raid6_recov async_pq async_xor async_tx xor raid6_pq raid1 multipath linear raid0 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse floppy ena nvme
 kernel: [ 4363.149542] CPU: 5 PID: 3201 Comm: go-carbon Not tainted 3.13.0-158-generic #208-Ubuntu
 kernel: [ 4363.154199] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
 kernel: [ 4363.157969] task: ffff883c2f428000 ti: ffff883c2f576000 task.ti: ffff883c2f576000
 kernel: [ 4363.163068] RIP: 0010:[<ffffffff8173ebee>]  [<ffffffff8173ebee>] _raw_spin_lock+0xe/0x50
 kernel: [ 4363.170423] RSP: 0000:ffff883c2f577d90  EFLAGS: 00010206
 kernel: [ 4363.175647] RAX: 0000000000020000 RBX: 000000cea3202000 RCX: 00003ffffffff000
 kernel: [ 4363.182540] RDX: ffff880000000010 RSI: f000c00000000f53 RDI: ffffeafffffffff0
 kernel: [ 4363.189536] RBP: ffff883c2f577d90 R08: f000ff53f000ff53 R09: 00000000000000a9
 kernel: [ 4363.196411] R10: ffff883d7fffbf00 R11: 000000ffffffffc0 R12: ffff883ad300ca80
 kernel: [ 4363.204094] R13: ffff882ac806b8c8 R14: ffff883c17ad2a00 R15: 0000000000000000
 kernel: [ 4363.212002] FS:  00007feb137fe700(0000) GS:ffff883c8a6a0000(0000) knlGS:0000000000000000
 kernel: [ 4363.221488] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 kernel: [ 4363.228149] CR2: ffffeafffffffff0 CR3: 0000003c0c67e000 CR4: 0000000000160670
 kernel: [ 4363.236111] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 kernel: [ 4363.244316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 kernel: [ 4363.252503] Stack:
 kernel: [ 4363.255214]  ffff883c2f577e20 ffffffff8118037e 0000000000000079 ffff883c2f577e70
 kernel: [ 4363.265492]  ffffffff8161ce84 ffff883c2f577df8 f000c00000000f53 ffffffff000000a9
 kernel: [ 4363.275390]  f000ff53f000ff53 ffff880000000010 ffffeafffffffff0 0000008000000000
 kernel: [ 4363.286280] Call Trace:
 kernel: [ 4363.289977]  [<ffffffff8118037e>] handle_mm_fault+0x3ee/0xeb0
 kernel: [ 4363.297426]  [<ffffffff8161ce84>] ? sock_aio_read.part.5+0x104/0x120
 kernel: [ 4363.305303]  [<ffffffff81743183>] __do_page_fault+0x183/0x570
 kernel: [ 4363.312145]  [<ffffffff8161cec1>] ? sock_aio_read+0x21/0x30
 kernel: [ 4363.319112]  [<ffffffff811c3950>] ? do_sync_read+0x60/0xa0
 kernel: [ 4363.325444]  [<ffffffff8174358a>] do_page_fault+0x1a/0x70
 kernel: [ 4363.333275]  [<ffffffff8173f6e8>] page_fault+0x28/0x30
 kernel: [ 4363.339188] Code: 00 00 55 48 89 e5 f0 81 07 00 00 10 00 48 89 f7 57 9d 0f 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f b7
 kernel: [ 4363.372047] RIP  [<ffffffff8173ebee>] _raw_spin_lock+0xe/0x50
 kernel: [ 4363.376840]  RSP <ffff883c2f577d90>
 kernel: [ 4363.379912] CR2: ffffeafffffffff0
 kernel: [ 4363.385724] ---[ end trace fddd0f2f0f756e50 ]---

There is no limits on process itself:

cat /proc/3038/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             1966789              1966789              processes
Max open files            65536                65536                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       1966789              1966789              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

let me know if you need any extra details

@Civil
Copy link
Member

Civil commented Jun 25, 2019

This is a kernel bug, not go-carbon's one.

According to trace, it happens in socket aio operations. So the advice is to either open a bug in bugzilla.kernel.org (or on bugzilla of your distro) or to try upgrading kernel to latest stable version first.

P.S. 3.13 is VERY old, about 4 years or so. So definitely consider upgrading.

@mevzosvlad
Copy link
Author

upgrade to kernel 4.4 helped

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants