ZIO writer threads steal pages from ZONE_DMA #747

ryao · 2012-05-16T08:22:21Z

The following page allocation failure occurred on my home server when testing pull request #726:

[ 5103.843925] z_wr_iss/1: page allocation failure: order:0, mode:0x40d0
[ 5103.843937] Pid: 3825, comm: z_wr_iss/1 Tainted: P O 3.2.17 #1
[ 5103.843943] Call Trace:
[ 5103.843960] [] ? warn_alloc_failed+0xfc/0x150
[ 5103.843970] [] ? __alloc_pages_nodemask+0x580/0x7f0
[ 5103.843982] [] ? new_slab+0x209/0x220
[ 5103.843990] [] ? __slab_alloc.clone.64+0x204/0x290
[ 5103.844030] [] ? kmem_alloc_debug+0x283/0x360 [spl]
[ 5103.844041] [] ? __kmalloc+0x110/0x170
[ 5103.844053] [] ? kmem_alloc_debug+0x283/0x360 [spl]
[ 5103.844062] [] ? __mutex_lock_slowpath+0x1e2/0x2a0
[ 5103.844097] [] ? vdev_queue_io_done+0xf02/0x3760 [zfs]
[ 5103.844119] [] ? zio_nowait+0x9d/0xeb0 [zfs]
[ 5103.844149] [] ? vdev_config_sync+0x9fd/0xca0 [zfs]
[ 5103.844178] [] ? vdev_config_sync+0x190/0xca0 [zfs]
[ 5103.844200] [] ? zio_execute+0x95/0x2a0 [zfs]
[ 5103.844212] [] ? __taskq_dispatch+0x7e0/0xb30 [spl]
[ 5103.844219] [] ? __schedule+0x29e/0x740
[ 5103.844227] [] ? try_to_wake_up+0x270/0x270
[ 5103.844238] [] ? __taskq_dispatch+0x580/0xb30 [spl]
[ 5103.844249] [] ? __taskq_dispatch+0x580/0xb30 [spl]
[ 5103.844258] [] ? kthread+0x96/0xa0
[ 5103.844267] [] ? kernel_thread_helper+0x4/0x10
[ 5103.844274] [] ? kthread_worker_fn+0x180/0x180
[ 5103.844283] [] ? gs_change+0xb/0xb
[ 5103.844287] Mem-Info:
[ 5103.844290] DMA per-cpu:
[ 5103.844296] CPU 0: hi: 0, btch: 1 usd: 0
[ 5103.844301] CPU 1: hi: 0, btch: 1 usd: 0
[ 5103.844306] CPU 2: hi: 0, btch: 1 usd: 0
[ 5103.844310] CPU 3: hi: 0, btch: 1 usd: 0
[ 5103.844315] CPU 4: hi: 0, btch: 1 usd: 0
[ 5103.844319] CPU 5: hi: 0, btch: 1 usd: 0
[ 5103.844323] DMA32 per-cpu:
[ 5103.844328] CPU 0: hi: 186, btch: 31 usd: 4
[ 5103.844333] CPU 1: hi: 186, btch: 31 usd: 21
[ 5103.844338] CPU 2: hi: 186, btch: 31 usd: 21
[ 5103.844343] CPU 3: hi: 186, btch: 31 usd: 19
[ 5103.844347] CPU 4: hi: 186, btch: 31 usd: 63
[ 5103.844352] CPU 5: hi: 186, btch: 31 usd: 25
[ 5103.844355] Normal per-cpu:
[ 5103.844359] CPU 0: hi: 186, btch: 31 usd: 0
[ 5103.844363] CPU 1: hi: 186, btch: 31 usd: 36
[ 5103.844368] CPU 2: hi: 186, btch: 31 usd: 0
[ 5103.844372] CPU 3: hi: 186, btch: 31 usd: 0
[ 5103.844377] CPU 4: hi: 186, btch: 31 usd: 31
[ 5103.844381] CPU 5: hi: 186, btch: 31 usd: 156
[ 5103.844393] active_anon:1444144 inactive_anon:169465 isolated_anon:0
[ 5103.844396] active_file:155 inactive_file:726 isolated_file:0
[ 5103.844398] unevictable:547 dirty:0 writeback:0 unstable:0
[ 5103.844401] free:311 slab_reclaimable:14330 slab_unreclaimable:70666
[ 5103.844403] mapped:1012 shmem:2 pagetables:6077 bounce:0
[ 5103.844418] DMA free:0kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15640kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:336kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[ 5103.844432] lowmem_reserve[]: 0 3118 15214 15214
[ 5103.844450] DMA32 free:1244kB min:13836kB low:17292kB high:20752kB active_anon:455804kB inactive_anon:136824kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3193664kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:9440kB slab_unreclaimable:82656kB kernel_stack:136kB pagetables:3228kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5103.844463] lowmem_reserve[]: 0 0 12096 12096
[ 5103.844480] Normal free:0kB min:53676kB low:67092kB high:80512kB active_anon:5320772kB inactive_anon:541036kB active_file:620kB inactive_file:2904kB unevictable:2188kB isolated(anon):0kB isolated(file):0kB present:12386304kB mlocked:2188kB dirty:0kB writeback:0kB mapped:4044kB shmem:8kB slab_reclaimable:47880kB slab_unreclaimable:199672kB kernel_stack:2752kB pagetables:21080kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5103.844494] lowmem_reserve[]: 0 0 0 0
[ 5103.844500] DMA: 0_4kB 0_8kB 0_16kB 0_32kB 0_64kB 0_128kB 0_256kB 0_512kB 0_1024kB 0_2048kB 0_4096kB = 0kB
[ 5103.844519] DMA32: 0_4kB 0_8kB 0_16kB 0_32kB 0_64kB 0_128kB 0_256kB 0_512kB 0_1024kB 0_2048kB 0_4096kB = 0kB
[ 5103.844536] Normal: 0_4kB 0_8kB 0_16kB 0_32kB 0_64kB 0_128kB 0_256kB 0_512kB 0_1024kB 0_2048kB 0*4096kB = 0kB
[ 5103.844553] 25959 total pagecache pages
[ 5103.844556] 24594 pages in swap cache
[ 5103.844561] Swap cache stats: add 906137, delete 881543, find 47476/56588
[ 5103.844566] Free swap = 13522028kB
[ 5103.844569] Total swap = 16777212kB
[ 5103.747481] 3964912 pages RAM
[ 5103.747481] 102929 pages reserved
[ 5103.747481] 31305 pages shared
[ 5103.747481] 3312872 pages non-shared
[ 5103.747481] SLUB: Unable to allocate memory on node -1 (gfp=0x20)
[ 5103.747481] cache: kmalloc-2048, object size: 2048, buffer size: 2048, default order: 3, min order: 0
[ 5103.747481] node 0: slabs: 135, objs: 2146, free: 0
[ 5103.986906] 3964912 pages RAM
[ 5103.986912] 102929 pages reserved
[ 5103.986916] 31318 pages shared
[ 5103.986919] 3311090 pages non-shared
[ 5103.986926] SLUB: Unable to allocate memory on node -1 (gfp=0xd0)
[ 5103.986933] cache: kmalloc-512, object size: 512, buffer size: 512, default order: 1, min order: 0
[ 5103.986955] node 0: slabs: 473, objs: 7568, free: 1075

That followed the second of the two page allocation failures that I posted in issue #746 by approximately 0.1 seconds. During that time, ZONE_DMA and ZONE_DMA32 were completely exhausted. More than an hour later, /proc/vmstat shows the following:

nr_free_pages 60909
nr_inactive_anon 146961
nr_active_anon 1281075
nr_inactive_file 1101
nr_active_file 163
nr_unevictable 547
nr_mlock 547
nr_anon_pages 163793
nr_mapped 1058
nr_file_pages 17294
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 6059
nr_slab_unreclaimable 45752
nr_page_table_pages 6168
nr_kernel_stack 350
nr_unstable 0
nr_bounce 0
nr_vmscan_write 1070565
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 2
nr_dirtied 1
nr_written 1070566
nr_anon_transparent_hugepages 2461
nr_dirty_threshold 298042
nr_dirty_background_threshold 149021
pgpgin 18488287
pgpgout 102758060
pswpin 156452
pswpout 1070565
pgalloc_dma 3974
pgalloc_dma32 7431300
pgalloc_normal 43409497
pgalloc_movable 0
pgfree 50972105
pgactivate 141692
pgdeactivate 2986602
pgfault 6283991
pgmajfault 31215
pgrefill_dma 0
pgrefill_dma32 2101
pgrefill_normal 198666
pgrefill_movable 0
pgsteal_dma 0
pgsteal_dma32 93959
pgsteal_normal 1244643
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 186073
pgscan_kswapd_normal 2306952
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 13279
pgscan_direct_normal 167872
pgscan_direct_movable 0
pginodesteal 136
slabs_scanned 72779648
kswapd_steal 1306350
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 24
kswapd_high_wmark_hit_quickly 318
kswapd_skip_congestion_wait 119
pageoutrun 28724
allocstall 1004
pgrotated 1068035
compact_blocks_moved 128867
compact_pages_moved 61136
compact_pagemigrate_failed 857
compact_stall 60
compact_fail 42
compact_success 18
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 489
unevictable_pgs_scanned 0
unevictable_pgs_rescued 7920
unevictable_pgs_mlocked 8467
unevictable_pgs_munlocked 7920
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 0
thp_fault_alloc 6880
thp_fault_fallback 132
thp_collapse_alloc 356
thp_collapse_alloc_failed 111
thp_split 2086

The pgalloc_dma and pgrefill_dma values together with the earlier backtraces suggest to me that within the span of 0.1 seconds, ZONE_DMA was completely exhausted and more than an hour later, not a single page has been returned. The system has an IOMMU, which is the only way that I can explain the fact that it is still running.

The system is under constant memory pressure, where 4GB of swap is in use and nearly all RAM is occupied. I have 5 rsyncs of the distfiles for major *BSDs and Gentoo running, so the ZFS code is being exercised heavily. Under these circumstances, the ZIO writer thread would steal pages from ZONE_DMA because they use PF_MEMALLOC and my patches have removed all other uses of PF_MEMALLOC from the code. Under that assumption, either made some fairly long lived allocations were made or the ZIO code is leaking memory.

I believe that this is logically independent from the bugs in pull request #746, so I am filing a separate issue for it.

behlendorf · 2012-05-16T19:07:22Z

The system certainly was completely out of memory at the time of the failure. And PF_MEMALLOC will effectively disable the reclaim making this failure more likely. However, I don't quite see what your getting at about ZONE_DMA remaining exhausted after an hour. It looks to me as if the system has reclaimed a significant number of pages but not it's not clear from the debug output which zones they are in.

ryao · 2012-05-16T21:16:42Z

These two lines from dmesg show that ZONE_DMA was exhausted in approximately 0.1 seconds:

[ 5103.739889] DMA: 0_4kB 1_8kB 1_16kB 0_32kB 2_64kB 1_128kB 1_256kB 0_512kB 1_1024kB 1_2048kB 3_4096kB = 15896kB
[ 5103.844500] DMA: 0_4kB 0_8kB 0_16kB 0_32kB 0_64kB 0_128kB 0_256kB 0_512kB 0_1024kB 0_2048kB 0_4096kB = 0kB

15896KB / 4KB/page = 3974 pages

/proc/zoneinfo claims that ZONE_DMA has 3974 free pages when the system has not been heavily loaded. The following two lines from /proc/vmstat more than an hour later show that 3974 pages were allocated and none were returned:

pgalloc_dma 3974
pgrefill_dma 0

As such, I believe that the ZIO writer threads allocated them during the 0.1 second window in the dmesg output and kept them indefinitely. Either they were claimed as part of some long lived allocations (e.g. SLAB) or they were leaked.

ryao · 2012-05-16T23:02:45Z

It would appear that I was wrong that the ZIO writer threads were the only things using PF_MEMALLOC. kv_alloc() in the SPL uses it as well.

With that said, 49be0cc eliminated the deadlock that 21ade34 was meant to solve, so it should be safe to revert it. I have tentatively added this to my pull request for swap.

ryao · 2012-05-17T04:14:27Z

I not seen this issue after several hours of stress testing that patch. Furthermore, no allocations have been made from ZONE_DMA.

ryao · 2012-05-17T04:17:07Z

Issue #746 and Issue #747 appear to both describe symptoms of PF_MEMALLOC. Since pull request #726 appears to address both issues, I am closing this one.

…s#747) Bumps [backtrace](https://github.com/rust-lang/backtrace-rs) from 0.3.66 to 0.3.67. - [Release notes](https://github.com/rust-lang/backtrace-rs/releases) - [Commits](rust-lang/backtrace-rs@0.3.66...0.3.67) --- updated-dependencies: - dependency-name: backtrace dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

ryao closed this as completed May 17, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZIO writer threads steal pages from ZONE_DMA #747

ZIO writer threads steal pages from ZONE_DMA #747

ryao commented May 16, 2012

behlendorf commented May 16, 2012

ryao commented May 16, 2012

ryao commented May 16, 2012

ryao commented May 17, 2012

ryao commented May 17, 2012

ZIO writer threads steal pages from ZONE_DMA #747

ZIO writer threads steal pages from ZONE_DMA #747

Comments

ryao commented May 16, 2012

behlendorf commented May 16, 2012

ryao commented May 16, 2012

ryao commented May 16, 2012

ryao commented May 17, 2012

ryao commented May 17, 2012