Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic system hang #7425

Closed
nivedita76 opened this issue Apr 11, 2018 · 38 comments
Closed

Sporadic system hang #7425

nivedita76 opened this issue Apr 11, 2018 · 38 comments
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@nivedita76
Copy link
Contributor

nivedita76 commented Apr 11, 2018

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version
Linux Kernel 4.16.1
Architecture x86_64
ZFS Version git 421_gbdc0b4bf1
SPL Version git 32_g581bc01

SPL/ZFS are builtin to the kernel

Describe the problem you're observing

Kernel triggers hung task timeout. System kind of works after that but many commands hang.

Describe how to reproduce the problem

This happens when running a full system recompile along with folding @ home client, no easy repro known.
hung.log

Include any warning/errors/backtraces from the system logs

@RubenKelevra
Copy link

Can you provide your settings?

zfs get all $pool/$volume

@nivedita76
Copy link
Contributor Author

pool.txt

@RubenKelevra
Copy link

Might be a skein issue ... did anyone do a benchmark with skein on? I use edonr which is the fastest apart from fletcher, which isn't actually very good at detecting multiple bitflips. So edonr should do the trick here.

You might want to provoke the system hang with an IO-Benchmark and take a look at atop 1 which resource is actually limiting the performance.

@rincebrain
Copy link
Contributor

@nivedita76 Can you share what zpool status looks like? Also, you could try kicking (writing nonzero) to /sys/module/spl/parameters/spl_taskq_kick and it'll try spawning more threads to handle any taskq that's waited for longer than...I forget, 30s?

Is this a regression, or did you have his problem before? If it's a regression, bisecting it would be the most straightforward way to try and find the problematic area without more of a smoking gun.

SysRq-T's output might also be useful, though that's quite a lot (I think that's the one for all tasks...maybe l or d would be more useful and terse)

@nivedita76
Copy link
Contributor Author

The docs say skein is supposed to be relatively fast as well.
This is a fairly new pool only a couple months I think so it’s always been on recent zfs git. I’ve been having a different race condition that just got fixed so noticed this hang only recently.
I’ll try to get that other data while it is hung. The pool status itself is clean with no errors. Scrubs don’t find anything to fix so it doesn’t seem like it corrupts anything.
I’m also trying with lock debugging i’ll Post a few things about that once I get home.

@rincebrain
Copy link
Contributor

@nivedita76 I should have been more clear, sorry. I meant that I wanted to see all of zpool status's output to know whether this is 2 or 50 disks, whether there's a dedicated log device, etc.

@nivedita76
Copy link
Contributor Author

Ah sorry. This is a mirror of two nvme partitions. No dedicated zlog.

@RubenKelevra
Copy link

RubenKelevra commented Apr 12, 2018

Do setting sync to disabled change anything on your issue? Maybe the nvme's are slow on synchronous operations.

The docs say skein is supposed to be relatively fast as well.

I know that, but I was more curious if someone actually tested the implementation. Since those are pretty new, this might not be the case. I can only speak for edonr, which is a fast as fletcher on subjectively perceived performance while sha256 was, on slower machines with high throughput, noticeably slower

@nivedita76
Copy link
Contributor Author

This is not a performance problem. The system hung at night and was still hung in the morning. There’s a deadlock or something like that.

@rincebrain
Copy link
Contributor

@nivedita76 Then you probably want to get the output of SysRq-D and/or L a/o T, though I think your dmesg output says you'd need to build a kernel with CONFIG_LOCKDEP enabled to get D. I guess W would also do pretty well depending on what kind of deadlock it is, since any deadlock for a long enough time should log to dmesg about being stuck (more than once), unless there's a "suppressed duplicates" message somewhere.

@nivedita76
Copy link
Contributor Author

I'm trying to turn on the lock validator but running into some issues. First couple I managed to work around with attached patch.
lockpatch.txt

But I still get a lock inversion warning pretty quickly and then it turns itself off.

kern.log

So for now I turned off the lock validator and am trying to reproduce the hang.

@nivedita76
Copy link
Contributor Author

I got a list of locks held this time, although not sure I understand it, it shows zl_issuer_lock held by two processes. SysRq-d/w/t/l didn't print out anything on the console, not sure why.
hung.log

I forgot to try the taskq kick thing before rebooting unfortunately.

@rincebrain
Copy link
Contributor

@nivedita76 Hm, it might be useful to have a version of ZFS with --enable-debug and zfs_dbgmsg enabled, since it's not an obvious cause of the deadlock (I mean, other than the tautological one, they're both trying to take the same lock).

@nivedita76
Copy link
Contributor Author

I do have enable debug. Should I just grab the contents of dbgmsg or do I have to set something to make it more verbose

@rincebrain
Copy link
Contributor

You need to set zfs_flags=[integer value corresponding to the bitmask of messages you want] cf. https://github.com/zfsonlinux/zfs/blob/master/include/sys/zfs_debug.h#L46

We could just start with 1 and go from there.

@rincebrain
Copy link
Contributor

@nivedita76 also, did you end up needing even more of a patch to get CONFIG_LOCKDEP rolling than the one in #7425 (comment), and if so, could you please share it? I'm helping someone else with a lock issue and would like to use CONFIG_LOCKDEP there too.

@RubenKelevra
Copy link

This is not a performance problem. The system hung at night and was still hung in the morning. There’s a deadlock or something like that.

Ah, sorry about that. Sounded more like a stall for a minute or several minutes. :)

@nivedita76
Copy link
Contributor Author

@rincebrain for lockdep only the bit in zvol.c that changes the zv_suspend_lock to RW_NOLOCKDEP is needed. The other change was to try to get the lock validator (CONFIG_PROVE_LOCKING) to work. It is also nice as it makes the namelen function consistent with the name function in control flow, but isn't needed if you don't turn on PROVE_LOCKING. I haven't made any more patches as I couldn't figure out the next circular dependency it found.

@behlendorf
Copy link
Contributor

I'd suggest leaving CONFIG_LOCKDEP disabled, the ZFS code base hasn't been updated to fully integrate with the kernel's lock checker so it incorrectly report some problems. If anyone's interested in improving this there are some open issues.

@nivedita76
Copy link
Contributor Author

hang.log

More detailed log from the hang this time after a few alt-sysrq's.

@nivedita76
Copy link
Contributor Author

kern.log
dbgmsg.log

Another set of logs. This is the full dmesg from boot up. kworker/u178 is spinning at 100% cpu usage with the following backtrace.

[49375.616381] CPU: 82 PID: 4924 Comm: kworker/u178:1 Tainted: G        W        4.16.2-stable-rani-zfs+ #16
[49375.616386] Hardware name: ASUSTeK COMPUTER INC. Z10PE-D8 WS/Z10PE-D8 WS, BIOS 3407 03/10/2017
[49375.616395] Workqueue: writeback wb_workfn (flush-zfs-1)
[49375.616407] RIP: 0010:reacquire_held_locks+0xc6/0x160
[49375.616409] RSP: 0018:ffffa37ce3ca7640 EFLAGS: 00000086
[49375.616413] RAX: ffffffffbae25460 RBX: ffff9d5bcb81b370 RCX: 0000000000000006
[49375.616415] RDX: 0000000000000007 RSI: 0000000000000000 RDI: ffff9d5bcb81a9c0
[49375.616416] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[49375.616418] R10: ffffa37ce3ca7678 R11: 0000000000000000 R12: 0000000000000000
[49375.616420] R13: ffffffffb83a5dbe R14: 0000000000000000 R15: 0000000000000000
[49375.616423] FS:  0000000000000000(0000) GS:ffff9d5d3ea00000(0000) knlGS:0000000000000000
[49375.616425] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[49375.616427] CR2: 0000000801450000 CR3: 00000001e0412002 CR4: 00000000003626e0
[49375.616429] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[49375.616431] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[49375.616433] Call Trace:
[49375.616442]  ? find_held_lock+0x8c/0xa0
[49375.616449]  ? dmu_zfetch+0x6d4/0x7d0
[49375.616452]  lock_release+0x113/0x420
[49375.616457]  __mutex_unlock_slowpath+0x4b/0x2b0
[49375.616461]  dmu_zfetch+0x6d4/0x7d0
[49375.616466]  dbuf_read+0xb99/0xfa0
[49375.616470]  dnode_hold_impl+0x193/0xf30
[49375.616475]  dmu_bonus_hold_impl+0x51/0x320
[49375.616482]  zfs_zget+0x12a/0x390
[49375.616487]  zfs_get_data+0x57/0x2a0
[49375.616490]  zil_commit_impl+0x125d/0x1670
[49375.616496]  zpl_writepages+0xc8/0x140
[49375.616503]  do_writepages+0x61/0x80
[49375.616506]  __writeback_single_inode+0x65/0x460
[49375.616508]  writeback_sb_inodes+0x22f/0x520
[49375.616512]  wb_writeback+0x13b/0x390
[49375.616515]  ? wb_workfn+0xd8/0x520
[49375.616517]  wb_workfn+0xd8/0x520
[49375.616522]  ? sched_clock_cpu+0xc/0xb0
[49375.616524]  ? sched_clock_cpu+0xc/0xb0
[49375.616526]  ? __lock_acquire.isra.32+0x169/0x830
[49375.616533]  ? process_one_work+0x20f/0x560
[49375.616535]  process_one_work+0x2a6/0x560
[49375.616539]  worker_thread+0x37/0x3f0
[49375.616541]  ? process_one_work+0x560/0x560
[49375.616545]  kthread+0x121/0x140
[49375.616547]  ? kthread_create_worker_on_cpu+0x70/0x70
[49375.616551]  ? SyS_exit_group+0x10/0x10
[49375.616553]  ret_from_fork+0x35/0x40

@nivedita76
Copy link
Contributor Author

It seems to be doing something as the backtrace keeps changing a little bit, but it's been at it for hours now.
kern1.log

@rincebrain
Copy link
Contributor

This looks like the same thing as #7038, maybe. If it's changing, kicking spl_taskq_kick might do something, but I'm not extremely hopeful.

@nivedita76
Copy link
Contributor Author

Nope didn't help. The dbgmsg shows a txg_sync_thread waiting on a particular txg?

@nivedita76
Copy link
Contributor Author

The txg that dbgmsg says is waiting is 423376, but that number doesn't show up in pool/txgs.

dbgmsg:

1523739079   txg.c:510:txg_sync_thread(): waiting; tx_synced=424668 waiting=423376 dp=00000000525bb20a

txgs:

26 0 0x01 100 11200 17826919751 16540461236723
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
424575   15009013092138   C     5220352      0            3850240      0        315      15359895794  124128       52598        24150814    
424576   15024372987932   C     22661120     0            20103168     0        415      15359933573  96968        40741        42284944    
424577   15039732921505   C     22275072     0            17588224     0        673      15359929360  149098       53047        62332157    
424578   15055092850865   C     7988224      0            5668864      0        323      15359917805  105688       41317        27930076    
424579   15070452768670   C     2933760      0            2883584      0        290      15359873452  74497        31344        14819390    
424580   15085812642122   C     2651648      0            2686976      0        273      15359934063  71327        33636        13635385    
424581   15101172576185   C     2640896      0            2473984      0        293      15359890074  68730        29883        13533216    
424582   15116532466259   C     4107776      0            3579904      0        266      15359905940  68232        31542        15676176    

@AndiDog
Copy link

AndiDog commented Jul 8, 2018

Is this similar? Using FreeBSD 11.2-RELEASE-p0. Came here because zil_commit_impl shows up in my backtrace as well.

# ps -p 69994
  PID TT  STAT     TIME COMMAND
69994  -  SJ   40:08.73 /usr/local/bin/python3.6 /usr/local/bin/buildbot start --nodaemon /var/buildbot-master

# procstat -kk 69994
  PID    TID COMM                TDNAME              KSTACK
69994 100411 python3.6           -                   mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_timedwait_sig+0x14 _cv_timedwait_sig_sbt+0x18c seltdwait+0x7b kern_poll+0x3fd sys_poll+0x50 amd64_syscall+0xa38 fast_syscall_common+0x101
69994 101219 python3.6           -                   mi_switch+0xe6 sleepq_catch_signals+0x40c sleepq_wait_sig+0xf _sleep+0x231 umtxq_sleep+0x143 do_sem2_wait+0x68a __umtx_op_sem2_wait+0x4b amd64_syscall+0xa38 fast_syscall_common+0x101
69994 101224 python3.6           -                   mi_switch+0xe6 sleepq_wait+0x2c _sx_xlock_hard+0x306 zil_commit_impl+0x11d zfs_freebsd_putpages+0x635 VOP_PUTPAGES_APV+0x82 vnode_pager_putpages+0x8e vm_pageout_flush+0xea vm_object_page_collect_flush+0x213 vm_object_page_clean+0x146 vm_object_terminate+0x93 zfs_freebsd_reclaim+0x1e VOP_RECLAIM_APV+0x82 vgonel+0x208 vrecycle+0x4a zfs_freebsd_inactive+0xd VOP_INACTIVE_APV+0x82 vinactive+0xfc

This happens every few days for me, and luckily is only a developer-faced CI instance in AWS EC2. The respective process is jailed. Running sync hangs but other things like SSH and non-disk features of Buildbot are "working".

@avkarenow
Copy link

11.2-RELEASE-p2, the same problem with zilog->zl_issuer_lock (D process state). Any chance to resolve this issue soon?

@rincebrain
Copy link
Contributor

For the FBSD people in here, you may be interested in FreeBSD bug 229614, ZFS lockup in zil_commit_impl, which had a fix merged for releng/11.2 in r342226.

For the ZoL people here, I don't necessarily think this is the same bug in need of some kind of port of that patch, I'm only reasonably certain the FBSD users were hitting that specific issue.

@ptempier
Copy link

ptempier commented Dec 31, 2018

i am getting about a similar issue here, the server is :
Linux xxx 4.14.87-1-lts #1 SMP Sun Dec 9 11:37:14 CET 2018 x86_64 GNU/Linux, It's archlinux
PowerEdge FC430 it s acting as a rancher host (docker), running urbackup server client and zabbix (it (it s a backup server)
The disk is PERC FD33xD with 8 drives in raid 6 and no error in the logs.

ZFS: Loaded module v0.7.12-1, ZFS pool version 5000, ZFS filesystem version 5
spl version: 0.7.12-1

We mostly have this issue on our 2 backup servers , i think i saw the same issue only twice on another rancher hosts. But we we end up rebooting theses 2 servers every month , because apps can't write to disk and become stuck (sync won't return)

Other main symptom is that there s a kworker thread stuck at 100%cpu (example kworker/u145:0 )

echo l > /proc/sysrq-trigger , echo w > /proc/sysrq-trigger

some dmsg froma previous bug
prev.log

same "stuck" , different time
dmesg_next.txt

same "stuck" , different time
dmesg.txt

in one it s stuck in node enter , in another in node exit
[lun. déc. 31 15:06:52 2018] ? zfs_znode_hold_enter+0x117/0x170 [zfs]
[1648414.428469] zfs_znode_hold_exit+0xbf/0x120 [zfs]

zpool status
pool: zvol
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
zvol ONLINE 0 0 0
sda2 ONLINE 0 0 0
errors: No known data errors

zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zvol 6,31T 5,78T 541G - 76% 91% 1.00x ONLINE -

flamegraph
perf-kernel

svg
perf-kernel.svg.gz

top - 14:38:34 up 19 days, 4:37, 1 user, load average: 134,24, 132,96, 130,10
Tâches: 1084 total, 2 en cours, 1081 en veille, 0 arrêté, 1 zombie
%Cpu(s): 0,0/1,4 1[| ]
GiB Mem : 52,3/62,8 [ ]
GiB Éch : 0,0/0,0 [ ]

PID UTIL. PR NI VIRT RES %CPU %MEM TEMPS+ S COM.
68505 root 20 0 0,0m 0,0m 100,0 0,0 494:46.77 R kworker/u145:0
12461 root 20 0 2443,2m 28,6m 1,3 0,0 3:24.79 S scheduler
70990 root 20 0 15,4m 5,6m 1,3 0,0 0:00.42 R top
5658 root 20 0 6040,5m 91,8m 0,7 0,1 141:00.70 S dockerd
1 root 20 0 187,4m 9,5m 0,0 0,0 0:29.29 S systemd

cat /proc/meminfo
MemTotal: 65866980 kB
MemFree: 26736328 kB
MemAvailable: 31372616 kB
Buffers: 467272 kB
Cached: 3538692 kB
SwapCached: 0 kB
Active: 13561692 kB
Inactive: 1159308 kB
Active(anon): 10724036 kB
Inactive(anon): 19044 kB
Active(file): 2837656 kB
Inactive(file): 1140264 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 344 kB
Writeback: 0 kB
AnonPages: 10709208 kB
Mapped: 2089216 kB
Shmem: 28036 kB
Slab: 20487688 kB
SReclaimable: 1288620 kB
SUnreclaim: 19199068 kB
KernelStack: 33776 kB
PageTables: 50392 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 32933488 kB
Committed_AS: 11471484 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 7153664 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 34028272 kB
DirectMap2M: 32985088 kB
DirectMap1G: 2097152 kB

@drescherjm
Copy link

kworker thread stuck at 100%cpu (example kworker/u145:0 )

I have seen that a few times on 2 different servers triggered by many simultaneous file operations. One such way in gentoo when emerging several big applications on a 16 thread CPU with the zpool on a NVMe drive.

@denizzzka
Copy link

I catching this every ~month on a different 64bit systems

Please provide way to capture logs! Can catch it from RS232 at 24/7

@rincebrain
Copy link
Contributor

Everyone commenting in here, could you please include the same information that the issue template requests and is in the first post (so distro name+version, SPL+ZFS version, kernel version and architecture), and ideally the debugging information I suggested the original reporter get, to be sure it's the same or a similar problem?

(I'm not a particularly consistent contributor around here, and I don't think I'll have the time to look into this anytime soon, I'm just trying to make sure that whoever looks at it has enough information to move forward.)

@nivedita76
Copy link
Contributor Author

nivedita76 commented Jan 6, 2019 via email

@ptempier
Copy link

ptempier commented Mar 7, 2019

i am now pretty sure that s linked to multiple sync at the same time
I lowered the parallele parameter from 20 to 10 in urbackup
and i get 43 days uptime which is a record for me

So the way it works is that once a backup is finished you end up with a folder with all the files and sqlite file for the file index. and to consider a backup finished, it issue a sync on the related files (it may actully be just the sqlite file)

And by lowering from 20 to 10 parralel backups it can't happend that theres more than 10 different parts of the fs which are getting a sync call at the same time.

@nivedita76
Copy link
Contributor Author

That bug report does look very similar. Looks like the check they added back in was dropped in 9870149#diff-1896218f0934fd36ff5e067c93464143 for ZoL

On Mon, Dec 31, 2018 at 7:06 AM Rich Ercolani @.***> wrote: For the FBSD people in here, you may be interested in FreeBSD bug 229614, ZFS lockup in zil_commit_impl https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229614, which had a fix merged for releng/11.2 in r342226. For the ZoL people here, I don't necessarily think this is the same bug in need of some kind of port of that patch, I'm only reasonably certain the FBSD users were hitting that specific issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7425 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AkNPWSLapjdv4R1AwBZzgD56g-IXbUGyks5u-f2ogaJpZM4TPNIw .

Bumping. Did anyone check if the FreeBSD patch helps? It's been a while since I had a hang myself.

@Orfheo
Copy link

Orfheo commented May 21, 2019

Bumping. Did anyone check if the FreeBSD patch helps? It's been a while since I had a hang myself.

Wondering about testing a patch for a bug, if actually it is a ZFS bug, which pop up on my ssytem couple of times for year ;-(

A question.

The systems I own, with ZFS installed, which had a "sync hanging event" are at the same time zfs storages and run a desktop GUI, KDE plasma5 in my case. Other pure zfs storage systems never had a single sync event, running under the very same Gentoo stable setup, emerged once a month.

Just a coincidence? Anyone here share this pattern?

@stale
Copy link

stale bot commented Aug 24, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 24, 2020
@stale stale bot removed the Status: Stale No recent activity for issue label Aug 29, 2020
@stale
Copy link

stale bot commented Aug 29, 2021

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 29, 2021
@stale stale bot closed this as completed Nov 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

11 participants