-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic system hang #7425
Comments
Can you provide your settings?
|
Might be a skein issue ... did anyone do a benchmark with skein on? I use edonr which is the fastest apart from fletcher, which isn't actually very good at detecting multiple bitflips. So edonr should do the trick here. You might want to provoke the system hang with an IO-Benchmark and take a look at |
@nivedita76 Can you share what zpool status looks like? Also, you could try kicking (writing nonzero) to /sys/module/spl/parameters/spl_taskq_kick and it'll try spawning more threads to handle any taskq that's waited for longer than...I forget, 30s? Is this a regression, or did you have his problem before? If it's a regression, bisecting it would be the most straightforward way to try and find the problematic area without more of a smoking gun. SysRq-T's output might also be useful, though that's quite a lot (I think that's the one for all tasks...maybe l or d would be more useful and terse) |
The docs say skein is supposed to be relatively fast as well. |
@nivedita76 I should have been more clear, sorry. I meant that I wanted to see all of zpool status's output to know whether this is 2 or 50 disks, whether there's a dedicated log device, etc. |
Ah sorry. This is a mirror of two nvme partitions. No dedicated zlog. |
Do setting sync to disabled change anything on your issue? Maybe the nvme's are slow on synchronous operations.
I know that, but I was more curious if someone actually tested the implementation. Since those are pretty new, this might not be the case. I can only speak for edonr, which is a fast as fletcher on subjectively perceived performance while sha256 was, on slower machines with high throughput, noticeably slower |
This is not a performance problem. The system hung at night and was still hung in the morning. There’s a deadlock or something like that. |
@nivedita76 Then you probably want to get the output of SysRq-D and/or L a/o T, though I think your dmesg output says you'd need to build a kernel with CONFIG_LOCKDEP enabled to get D. I guess W would also do pretty well depending on what kind of deadlock it is, since any deadlock for a long enough time should log to dmesg about being stuck (more than once), unless there's a "suppressed duplicates" message somewhere. |
I'm trying to turn on the lock validator but running into some issues. First couple I managed to work around with attached patch. But I still get a lock inversion warning pretty quickly and then it turns itself off. So for now I turned off the lock validator and am trying to reproduce the hang. |
I got a list of locks held this time, although not sure I understand it, it shows zl_issuer_lock held by two processes. SysRq-d/w/t/l didn't print out anything on the console, not sure why. I forgot to try the taskq kick thing before rebooting unfortunately. |
@nivedita76 Hm, it might be useful to have a version of ZFS with --enable-debug and zfs_dbgmsg enabled, since it's not an obvious cause of the deadlock (I mean, other than the tautological one, they're both trying to take the same lock). |
I do have enable debug. Should I just grab the contents of dbgmsg or do I have to set something to make it more verbose |
You need to set zfs_flags=[integer value corresponding to the bitmask of messages you want] cf. https://github.com/zfsonlinux/zfs/blob/master/include/sys/zfs_debug.h#L46 We could just start with 1 and go from there. |
@nivedita76 also, did you end up needing even more of a patch to get CONFIG_LOCKDEP rolling than the one in #7425 (comment), and if so, could you please share it? I'm helping someone else with a lock issue and would like to use CONFIG_LOCKDEP there too. |
Ah, sorry about that. Sounded more like a stall for a minute or several minutes. :) |
@rincebrain for lockdep only the bit in zvol.c that changes the zv_suspend_lock to RW_NOLOCKDEP is needed. The other change was to try to get the lock validator (CONFIG_PROVE_LOCKING) to work. It is also nice as it makes the namelen function consistent with the name function in control flow, but isn't needed if you don't turn on PROVE_LOCKING. I haven't made any more patches as I couldn't figure out the next circular dependency it found. |
I'd suggest leaving |
More detailed log from the hang this time after a few alt-sysrq's. |
Another set of logs. This is the full dmesg from boot up. kworker/u178 is spinning at 100% cpu usage with the following backtrace.
|
It seems to be doing something as the backtrace keeps changing a little bit, but it's been at it for hours now. |
This looks like the same thing as #7038, maybe. If it's changing, kicking spl_taskq_kick might do something, but I'm not extremely hopeful. |
Nope didn't help. The dbgmsg shows a txg_sync_thread waiting on a particular txg? |
The txg that dbgmsg says is waiting is 423376, but that number doesn't show up in pool/txgs. dbgmsg:
txgs:
|
Is this similar? Using FreeBSD 11.2-RELEASE-p0. Came here because
This happens every few days for me, and luckily is only a developer-faced CI instance in AWS EC2. The respective process is jailed. Running |
11.2-RELEASE-p2, the same problem with zilog->zl_issuer_lock (D process state). Any chance to resolve this issue soon? |
For the FBSD people in here, you may be interested in FreeBSD bug 229614, ZFS lockup in zil_commit_impl, which had a fix merged for releng/11.2 in r342226. For the ZoL people here, I don't necessarily think this is the same bug in need of some kind of port of that patch, I'm only reasonably certain the FBSD users were hitting that specific issue. |
i am getting about a similar issue here, the server is : ZFS: Loaded module v0.7.12-1, ZFS pool version 5000, ZFS filesystem version 5 We mostly have this issue on our 2 backup servers , i think i saw the same issue only twice on another rancher hosts. But we we end up rebooting theses 2 servers every month , because apps can't write to disk and become stuck (sync won't return) Other main symptom is that there s a kworker thread stuck at 100%cpu (example kworker/u145:0 ) echo l > /proc/sysrq-trigger , echo w > /proc/sysrq-trigger some dmsg froma previous bug same "stuck" , different time same "stuck" , different time in one it s stuck in node enter , in another in node exit zpool status zpool list top - 14:38:34 up 19 days, 4:37, 1 user, load average: 134,24, 132,96, 130,10 PID UTIL. PR NI VIRT RES %CPU %MEM TEMPS+ S COM. cat /proc/meminfo |
I have seen that a few times on 2 different servers triggered by many simultaneous file operations. One such way in gentoo when emerging several big applications on a 16 thread CPU with the zpool on a NVMe drive. |
I catching this every ~month on a different 64bit systems Please provide way to capture logs! Can catch it from RS232 at 24/7 |
Everyone commenting in here, could you please include the same information that the issue template requests and is in the first post (so distro name+version, SPL+ZFS version, kernel version and architecture), and ideally the debugging information I suggested the original reporter get, to be sure it's the same or a similar problem? (I'm not a particularly consistent contributor around here, and I don't think I'll have the time to look into this anytime soon, I'm just trying to make sure that whoever looks at it has enough information to move forward.) |
That bug report does look very similar. Looks like the check they added
back in was dropped in
9870149#diff-1896218f0934fd36ff5e067c93464143
for ZoL
…On Mon, Dec 31, 2018 at 7:06 AM Rich Ercolani ***@***.***> wrote:
For the FBSD people in here, you may be interested in FreeBSD bug 229614, ZFS
lockup in zil_commit_impl
<https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229614>, which had a
fix merged for releng/11.2 in r342226.
For the ZoL people here, I don't necessarily think this is the same bug in
need of some kind of port of that patch, I'm only reasonably certain the
FBSD users were hitting that specific issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7425 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AkNPWSLapjdv4R1AwBZzgD56g-IXbUGyks5u-f2ogaJpZM4TPNIw>
.
|
i am now pretty sure that s linked to multiple sync at the same time So the way it works is that once a backup is finished you end up with a folder with all the files and sqlite file for the file index. and to consider a backup finished, it issue a sync on the related files (it may actully be just the sqlite file) And by lowering from 20 to 10 parralel backups it can't happend that theres more than 10 different parts of the fs which are getting a sync call at the same time. |
Bumping. Did anyone check if the FreeBSD patch helps? It's been a while since I had a hang myself. |
Wondering about testing a patch for a bug, if actually it is a ZFS bug, which pop up on my ssytem couple of times for year ;-( A question. The systems I own, with ZFS installed, which had a "sync hanging event" are at the same time zfs storages and run a desktop GUI, KDE plasma5 in my case. Other pure zfs storage systems never had a single sync event, running under the very same Gentoo stable setup, emerged once a month. Just a coincidence? Anyone here share this pattern? |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
System information
SPL/ZFS are builtin to the kernel
Describe the problem you're observing
Kernel triggers hung task timeout. System kind of works after that but many commands hang.
Describe how to reproduce the problem
This happens when running a full system recompile along with folding @ home client, no easy repro known.
hung.log
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: