txg_sync thread stuck #3602

snajpa · 2015-07-16T04:42:31Z

Running https://github.com/vpsfreecz/zfs version,
Kernel OpenVZ 2.6.32-042stab108.5 kernel, based on roughly RHEL6 504.16.2.el6

It happens (almost?) on all of our machines with this kernel and zfs build.

What can I provide more to hunt this one down?

[root@node9.prg.vpsfree.cz]
 /sys/module/zfs/parameters # cat /proc/2940/stack
[<ffffffffa025638d>] cv_wait_common+0x11d/0x130 [spl]
[<ffffffffa02563f5>] __cv_wait+0x15/0x20 [spl]
[<ffffffffa02f542b>] metaslab_load_wait+0x2b/0x40 [zfs]
[<ffffffffa02f65e4>] metaslab_sync_done+0xb4/0x3e0 [zfs]
[<ffffffffa031618b>] vdev_sync_done+0x4b/0x80 [zfs]
[<ffffffffa02fe2eb>] spa_sync+0x53b/0xb10 [zfs]
[<ffffffffa0314559>] txg_sync_thread+0x389/0x620 [zfs]
[<ffffffffa02517b8>] thread_generic_wrapper+0x68/0x80 [spl]
[<ffffffff810a847e>] kthread+0x9e/0xc0
[<ffffffff8100c38a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

The text was updated successfully, but these errors were encountered:

behlendorf · 2015-07-16T17:11:55Z

@snajpa this could be the same as #3532 which was fixed by commit 1cd7773. The fix isn't part of the 0.6.4.2 but it's a very small patch and easy to cherry pick.

odoucet · 2015-07-16T17:28:46Z

If this is the same as #3532 you can work around this bug by setting metaslab_preload_enabled=0

eolson78 · 2015-07-16T17:54:18Z

@behlendorf I agree with you. We ran into a similar issue and 1cd7773 did solve the issue for us

kernelOfTruth · 2015-07-16T19:02:07Z

Unless @snajpa is running the "top" branch, which I doubt, the patch should be already included in his stack (master branch)

Providing some more info when this gets triggered and special bootup (kernel append) settings - if it differs from the default, top output, dmesg, etc. could help

bknl · 2015-07-17T09:48:31Z

Maybe unrelated problem, as my trace is different. txg_sync isn't exactly stuck, it just endlessly eats 100% CPU.

Running Git Top (up to 905edb4)

Linux version 3.16.7-21-desktop (geeko@buildhost) (gcc version 4.8.3 20140627 [gcc-4_8-branch revision 212064](SUSE Linux) ) #1 SMP PREEMPT Tue Apr 14 07:11:37 UTC 2015 (93c1539)

Stack trace keeps cycling through lots of stuff mostly through sp_kmem_cache_free and extract_buf.

No special flags to module or kernel, no dedup on the pool, some processes on the pool get into D state:

top - 11:47:54 up 4 days, 2:23, 7 users, load average: 7.42, 7.71, 9.78
Tasks: 670 total, 3 running, 667 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.6 us, 9.5 sy, 0.0 ni, 81.7 id, 8.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 99123648 total, 98721168 used, 402480 free, 46214812 buffers
KiB Swap: 0 total, 0 used, 0 free. 50868424 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1507 root 0 -20 0 0 0 R 83.721 0.000 54:57.89 txg_sync
25130 d029754 20 0 680892 55380 36772 D 54.817 0.056 36:01.76 mkvmerge-gui
21468 root 20 0 13840 2048 1496 S 32.226 0.002 54:39.92 mount.ntfs-3g
25381 d029754 20 0 13332 2156 1624 D 28.239 0.002 17:54.82 keychain
1330 root 39 19 0 0 0 S 23.588 0.000 17:28.15 z_null_iss/0
1506 root 0 -20 0 0 0 R 22.591 0.000 13:38.00 txg_quiesce
21560 root 20 0 19576 1800 1600 S 10.631 0.002 16:08.86 tar
479 root 0 -20 0 0 0 S 9.302 0.000 6:28.90 dbu_evict/0
764 root 20 0 12036 4792 1084 S 3.987 0.005 4:15.99 haveged
21533 root 20 0 14516 2228 1532 D 3.987 0.002 6:37.45 mount.ntfs-3g
21559 root 20 0 19704 1948 1692 D 2.990 0.002 5:15.09 tar
244 root 20 0 0 0 0 S 1.329 0.000 2:21.19 kswapd1
121 root 20 0 0 0 0 S 0.664 0.000 0:55.17 ksoftirqd/6
243 root 20 0 0 0 0 S 0.664 0.000 1:21.56 kswapd0
21514 root 20 0 0 0 0 D 0.664 0.000 1:47.93 usb-storage

snajpa · 2015-07-21T19:14:02Z

It was certainly due to low memory situation, somehow the amount of dirty data for syncing has blown out of proportion all the way to hit zfs_dirty_data_max; that was 13 GB for that system, so it has run into problems with allocating this much memory.

So I've set that limit to 1GB, it did help, but didn't resolve the associated performance problems.

The version I'm running already does include fix from PR#3557.

Still, txg sync looks really weird. It has a really long way to go to resemble those nice 5s txg sync windows I was used to seeing. My guess (and judging from watching zfs iostat) is, that ZFS is now overloading some VDEVs to the point the whole txg sync gets delayed because of some single VDEV (2way spindle mirrors in my case). That's on some machines, but on more idle machines the syncs aren't even waiting for the timeout, neither for dirty_data_sync to start syncing out.

What I'm seeing even on totally idle machine - I've set txg_timeout to 15s to make it even more visible:

[root@node1.pgnd.vpsfree.cz]
 ~ # cat /proc/spl/kstat/zfs/vz/txgs 
33 0 0x01 15 1680 267501014477 1712682737583508
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
28147401 1712628627891040 C     26763264     13312        9825280      11       1732     3392797164   6415         3358687049   9676628100  
28147402 1712632020688204 C     41956352     4096         9448448      4        1197     3358699486   6052         9676621444   4071501169  
28147403 1712635379387690 C     36814848     4096         12125696     4        1476     9676631270   4188         4071497571   4974255835  
28147404 1712645056018960 C     11309568     4096         3236352      4        1045     4071506402   5869         4974249377   3347677423  
28147405 1712649127525362 C     12685824     2048         3277824      2        966      4974258565   4379         3347675425   2282312263  
28147406 1712654101783927 C     4748288      2048         2146816      2        756      3347684332   4155         2282307632   2666049623  
28147407 1712657449468259 C     2935296      2048         2544640      2        649      2282315442   4662         2666045705   1928535400  
28147408 1712659731783701 C     7834112      2048         2558464      2        799      2666054026   5141         1928530967   2282664903  
28147409 1712662397837727 C     12192768     4096         2489344      4        783      1928539540   4692         2282660769   3512527866  
28147410 1712664326377267 C     11697664     3072         2738176      3        818      2282669911   4892         3512522510   3174327428  
28147411 1712666609047178 C     6431232      4096         2543616      4        988      3512532014   6177         3174320230   4394941169  
28147412 1712670121579192 C     9940992      6144         5356032      4        866      3174331873   6342         4394932945   3887829943  
28147413 1712673295911065 S     0            0            0            0        0        4394942719   5278         3887825017   0           
28147414 1712677690853784 W     0            0            0            0        0        3887836714   8117         0            0           
28147415 1712681578690498 O     0            0            0            0        0        0            0            0            0

kernelOfTruth · 2015-07-21T19:38:13Z

@snajpa some points to look into for you:

#3616 (ZFS IO can stall for a second or more under memory pressure #3616)
dweeezil@9970711 (already included in the pull below):
#3533 (ARC Sync Up #3533)

#3607 (send performance is very bad while all cpu occupy by low priority process #3607),
and some changes to remedy the situation: #3607 (comment)

sempervictus · 2015-07-22T17:47:28Z

The comment about uneven distribution of workloads to VDEVs caught my eye - i've been seeing similar things in our systems, but without the associated crash, just resulting in hiccups on pools running 6+ disks in a RAIDZ2 setup. ZFS' iostat is pretty useful for this, but the way i managed to pin it down was via iostat -xyz $(lsblk | awk '/^sd/{print$1}') 1 which clearly shows several disks in the same raidz2 vdev being pegged at 100% util while others are basically idle (these are identical disks, on the same LSI controller). These systems aren't using @rdolbeau's ASM patches yet as i'm still ironing those out in lab (fearful of anything which might possibly cause a bad write), but that was to be my next step in testing as i was hoping it was some raidz layout calculation lag...

gmelikov · 2017-07-16T21:34:34Z

Close as stale.

If it's actual - feel free to reopen.

behlendorf added Bug - Minor labels Jul 16, 2015

behlendorf removed Bug - Minor labels Sep 30, 2016

gmelikov closed this as completed Jul 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

txg_sync thread stuck #3602

txg_sync thread stuck #3602

snajpa commented Jul 16, 2015

behlendorf commented Jul 16, 2015

odoucet commented Jul 16, 2015

eolson78 commented Jul 16, 2015

kernelOfTruth commented Jul 16, 2015

bknl commented Jul 17, 2015

snajpa commented Jul 21, 2015

kernelOfTruth commented Jul 21, 2015

sempervictus commented Jul 22, 2015

gmelikov commented Jul 16, 2017

txg_sync thread stuck #3602

txg_sync thread stuck #3602

Comments

snajpa commented Jul 16, 2015

behlendorf commented Jul 16, 2015

odoucet commented Jul 16, 2015

eolson78 commented Jul 16, 2015

kernelOfTruth commented Jul 16, 2015

bknl commented Jul 17, 2015

snajpa commented Jul 21, 2015

kernelOfTruth commented Jul 21, 2015

sempervictus commented Jul 22, 2015

gmelikov commented Jul 16, 2017