Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic / odd errors on dataset replicated via zfs send --raw #7317

Closed
jh3141 opened this issue Mar 20, 2018 · 7 comments
Closed

panic / odd errors on dataset replicated via zfs send --raw #7317

jh3141 opened this issue Mar 20, 2018 · 7 comments
Assignees
Labels
Status: Stale No recent activity for issue

Comments

@jh3141
Copy link

jh3141 commented Mar 20, 2018

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 16.04.4
Linux Kernel Linux athena 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Architecture x86_64
ZFS Version 0.7.0-366_g743253d
SPL Version 0.7.0-31_g43983eb

Describe the problem you're observing

I've had two issues trying to use zfs recv to verify a backup archive taken using zfs send -w by receiving into the same pool that the data originated from, just with a different dataset name. The source dataset is encrypted via passphrase and contains approximately 250GB of data. Testing with a new dataset containing only test data did not replicate the issue.

The first time I tried this, the commands I used were:

nohup zfs send --raw storage/source > /mnt/backups/backup1
nohup zfs receive storage/test < /mnt/backups/backup1
zfs mount -l storage/test

I received a kernel panic immediately after entering the passphrase.

I destroyed the test dataset and tried again using a snapshot, and with the commands in a pipe:

zfs send --raw storage/source@snap1 | zfs receive storage/test
zfs mount -l storage/test

This time there was no immediate panic, but when I tried to view the results I got an Invalid Exchange error message from ls, and unmounting the dataset hung. A panic apparently occurred while I was attempting to shut the system down to remove the dataset (see log below)

Describe how to reproduce the problem

Above commands apparently fail on every attempt with the given dataset (although testing takes a while so hasn't been done more than these two occasions), but attempts to reproduce with a new dataset have failed.

Include any warning/errors/backtraces from the system logs

Mar 19 15:42:53 athena kernel: [86003.121669] VERIFY3(sa.sa_magic == 0x2F505A) failed (16847124 == 3100762)
Mar 19 15:42:53 athena kernel: [86003.121828] PANIC at zfs_vfsops.c:587:zfs_space_delta_cb()
Mar 19 15:42:53 athena kernel: [86003.121939] Showing stack for process 1065
Mar 19 15:42:53 athena kernel: [86003.121951] CPU: 3 PID: 1065 Comm: z_upgrade Tainted: P           OE   4.4.0-116-generic #140-Ubuntu
Mar 19 15:42:53 athena kernel: [86003.121958] Hardware name: Gigabyte Technology Co., Ltd. AM1M-S2P/AM1M-S2P, BIOS F1 01/09/2014
Mar 19 15:42:53 athena kernel: [86003.121965]  0000000000000286 91e40b3ae4e37467 ffff8800979ab988 ffffffff813ffc13
Mar 19 15:42:53 athena kernel: [86003.121978]  ffffffffc114cc74 000000000000024b ffff8800979ab998 ffffffffc047a4b2
Mar 19 15:42:53 athena kernel: [86003.121988]  ffff8800979abb20 ffffffffc047a585 ffff8800979ab9e0 ffffffff00000030
Mar 19 15:42:53 athena kernel: [86003.121998] Call Trace:
Mar 19 15:42:53 athena kernel: [86003.122018]  [<ffffffff813ffc13>] dump_stack+0x63/0x90
Mar 19 15:42:53 athena kernel: [86003.122055]  [<ffffffffc047a4b2>] spl_dumpstack+0x42/0x50 [spl]
Mar 19 15:42:53 athena kernel: [86003.122082]  [<ffffffffc047a585>] spl_panic+0xc5/0x100 [spl]
Mar 19 15:42:53 athena kernel: [86003.122246]  [<ffffffffc0fbc1dd>] ? abd_iterate_func+0x6d/0x110 [zfs]
Mar 19 15:42:53 athena kernel: [86003.122271]  [<ffffffffc0476af8>] ? spl_kmem_cache_free+0x138/0x1c0 [spl]
Mar 19 15:42:53 athena kernel: [86003.122476]  [<ffffffffc109d258>] ? zio_buf_free+0x58/0x60 [zfs]
Mar 19 15:42:53 athena kernel: [86003.122489]  [<ffffffff8184cdb2>] ? mutex_lock+0x12/0x30
Mar 19 15:42:53 athena kernel: [86003.122652]  [<ffffffffc0fee9bd>] ? dmu_zfetch+0x15d/0x520 [zfs]
Mar 19 15:42:53 athena kernel: [86003.122798]  [<ffffffffc0fcee1e>] ? dbuf_rele_and_unlock+0x27e/0x520 [zfs]
Mar 19 15:42:53 athena kernel: [86003.123005]  [<ffffffffc1084ee2>] zfs_space_delta_cb+0xe2/0x230 [zfs]
Mar 19 15:42:53 athena kernel: [86003.123019]  [<ffffffff811f4266>] ? __kmalloc_node+0x1a6/0x300
Mar 19 15:42:53 athena kernel: [86003.123177]  [<ffffffffc0fe0fc4>] dmu_objset_userquota_get_ids+0x184/0x440 [zfs]
Mar 19 15:42:53 athena kernel: [86003.123340]  [<ffffffffc0ff1218>] dnode_setdirty+0x38/0x100 [zfs]
Mar 19 15:42:53 athena kernel: [86003.123486]  [<ffffffffc0fd125e>] dbuf_dirty+0x44e/0x890 [zfs]
Mar 19 15:42:53 athena kernel: [86003.123632]  [<ffffffffc0fd17c0>] dmu_buf_will_dirty_impl+0x120/0x130 [zfs]
Mar 19 15:42:53 athena kernel: [86003.123778]  [<ffffffffc0fd17e6>] dmu_buf_will_dirty+0x16/0x20 [zfs]
Mar 19 15:42:53 athena kernel: [86003.123932]  [<ffffffffc0fddf5c>] dmu_objset_space_upgrade+0x15c/0x1e0 [zfs]
Mar 19 15:42:53 athena kernel: [86003.124089]  [<ffffffffc0fe141d>] dmu_objset_id_quota_upgrade_cb+0xcd/0x150 [zfs]
Mar 19 15:42:53 athena kernel: [86003.124243]  [<ffffffffc0fdd136>] dmu_objset_upgrade_task_cb+0x86/0xf0 [zfs]
Mar 19 15:42:53 athena kernel: [86003.124269]  [<ffffffffc0478714>] taskq_thread+0x2b4/0x4b0 [spl]
Mar 19 15:42:53 athena kernel: [86003.124281]  [<ffffffff810aec00>] ? wake_up_q+0x70/0x70
Mar 19 15:42:53 athena kernel: [86003.124308]  [<ffffffffc0478460>] ? task_done+0xb0/0xb0 [spl]
Mar 19 15:42:53 athena kernel: [86003.124319]  [<ffffffff810a30f7>] kthread+0xe7/0x100
Mar 19 15:42:53 athena kernel: [86003.124330]  [<ffffffff810a3010>] ? kthread_create_on_node+0x1e0/0x1e0
Mar 19 15:42:53 athena kernel: [86003.124340]  [<ffffffff8184f422>] ret_from_fork+0x42/0x80
Mar 19 15:42:53 athena kernel: [86003.124349]  [<ffffffff810a3010>] ? kthread_create_on_node+0x1e0/0x1e0

The panic from the first attempt didn't get logged, and unfortunately the shot I tried to take of the screen with my phone camera was too blurry to be legible, but the failure appeared to have occurred in spl_kmem_cache_free, which was called from a recursively-called function that may have been dbuf_sync_list. Unfortunately the message showing the cause of the failure scrolled off the top of my screen due to an extensive stack trace.

@jh3141
Copy link
Author

jh3141 commented Mar 20, 2018

No, the system is using typical desktop memory.

@jh3141
Copy link
Author

jh3141 commented Mar 20, 2018

The pool in question was created using and has only ever been used on the same version of zfs (although it has been used to receive datasets sent from older systems), so I presume the format changes aren't relevant to this issue?

@jh3141
Copy link
Author

jh3141 commented Mar 21, 2018

I just repeated the second test, and results were similar although slightly different: the incorrect value reported in the VERIFY3 line is different (so it doesn't seem as though an incorrect value is being copied from the source dataset), as are some of the entries in the stack trace:

Mar 21 09:28:17 athena kernel: [150104.010980] VERIFY3(sa.sa_magic == 0x2F505A) failed (16912659 == 3100762)
Mar 21 09:28:17 athena kernel: [150104.011140] PANIC at zfs_vfsops.c:587:zfs_space_delta_cb()
Mar 21 09:28:17 athena kernel: [150104.011253] Showing stack for process 1104
Mar 21 09:28:17 athena kernel: [150104.011265] CPU: 2 PID: 1104 Comm: z_upgrade Tainted: P           OE   4.4.0-116-generic #140-Ubuntu
Mar 21 09:28:17 athena kernel: [150104.011272] Hardware name: Gigabyte Technology Co., Ltd. AM1M-S2P/AM1M-S2P, BIOS F1 01/09/2014
Mar 21 09:28:17 athena kernel: [150104.011279]  0000000000000286 af7331cb821f6b01 ffff88023272f988 ffffffff813ffc13
Mar 21 09:28:17 athena kernel: [150104.011292]  ffffffffc110ac74 000000000000024b ffff88023272f998 ffffffffc04db4b2
Mar 21 09:28:17 athena kernel: [150104.011302]  ffff88023272fb20 ffffffffc04db585 ffff88023272f9e0 ffffffff00000030
Mar 21 09:28:17 athena kernel: [150104.011313] Call Trace:
Mar 21 09:28:17 athena kernel: [150104.011333]  [<ffffffff813ffc13>] dump_stack+0x63/0x90
Mar 21 09:28:17 athena kernel: [150104.011369]  [<ffffffffc04db4b2>] spl_dumpstack+0x42/0x50 [spl]
Mar 21 09:28:17 athena kernel: [150104.011396]  [<ffffffffc04db585>] spl_panic+0xc5/0x100 [spl]
Mar 21 09:28:17 athena kernel: [150104.011559]  [<ffffffffc0f7a1dd>] ? abd_iterate_func+0x6d/0x110 [zfs]
Mar 21 09:28:17 athena kernel: [150104.011584]  [<ffffffffc04d7af8>] ? spl_kmem_cache_free+0x138/0x1c0 [spl]
Mar 21 09:28:17 athena kernel: [150104.011597]  [<ffffffff811f18f9>] ? ___slab_alloc+0x1e9/0x470
Mar 21 09:28:17 athena kernel: [150104.011609]  [<ffffffff811f18f9>] ? ___slab_alloc+0x1e9/0x470
Mar 21 09:28:17 athena kernel: [150104.011632]  [<ffffffffc04d6103>] ? spl_kmem_zalloc+0xa3/0x180 [spl]
Mar 21 09:28:17 athena kernel: [150104.011840]  [<ffffffffc1042ee2>] zfs_space_delta_cb+0xe2/0x230 [zfs]
Mar 21 09:28:17 athena kernel: [150104.011999]  [<ffffffffc0f9efc4>] dmu_objset_userquota_get_ids+0x184/0x440 [zfs]
Mar 21 09:28:17 athena kernel: [150104.012162]  [<ffffffffc0faf218>] dnode_setdirty+0x38/0x100 [zfs]
Mar 21 09:28:17 athena kernel: [150104.012308]  [<ffffffffc0f8f25e>] dbuf_dirty+0x44e/0x890 [zfs]
Mar 21 09:28:17 athena kernel: [150104.012455]  [<ffffffffc0f8f7c0>] dmu_buf_will_dirty_impl+0x120/0x130 [zfs]
Mar 21 09:28:17 athena kernel: [150104.012600]  [<ffffffffc0f8f7e6>] dmu_buf_will_dirty+0x16/0x20 [zfs]
Mar 21 09:28:17 athena kernel: [150104.012755]  [<ffffffffc0f9bf5c>] dmu_objset_space_upgrade+0x15c/0x1e0 [zfs]
Mar 21 09:28:17 athena kernel: [150104.012911]  [<ffffffffc0f9f41d>] dmu_objset_id_quota_upgrade_cb+0xcd/0x150 [zfs]
Mar 21 09:28:17 athena kernel: [150104.013065]  [<ffffffffc0f9b136>] dmu_objset_upgrade_task_cb+0x86/0xf0 [zfs]
Mar 21 09:28:17 athena kernel: [150104.013091]  [<ffffffffc04d9714>] taskq_thread+0x2b4/0x4b0 [spl]
Mar 21 09:28:17 athena kernel: [150104.013103]  [<ffffffff810aec00>] ? wake_up_q+0x70/0x70
Mar 21 09:28:17 athena kernel: [150104.013131]  [<ffffffffc04d9460>] ? task_done+0xb0/0xb0 [spl]
Mar 21 09:28:17 athena kernel: [150104.013142]  [<ffffffff810a30f7>] kthread+0xe7/0x100
Mar 21 09:28:17 athena kernel: [150104.013152]  [<ffffffff810a3010>] ? kthread_create_on_node+0x1e0/0x1e0
Mar 21 09:28:17 athena kernel: [150104.013163]  [<ffffffff8184f422>] ret_from_fork+0x42/0x80
Mar 21 09:28:17 athena kernel: [150104.013173]  [<ffffffff810a3010>] ? kthread_create_on_node+0x1e0/0x1e0```

@tcaputi
Copy link
Contributor

tcaputi commented Apr 4, 2018

This issue is actually probably not caused by memory corruption. We have seen this issue a couple of times before (see issue #7117). We thought we had solved it, but if you are on the version of ZFS you say you are, then we may not have....

If you can reproduce this easily we may have a good shot at debugging it. Can you run echo 4294967262 > /sys/module/zfs/parameters/zfs_flags, trigger the crash again, and then provide the last 1000 lines or so of /proc/spl/ksta/zfs/dbgmsg?

@tcaputi tcaputi self-assigned this Apr 4, 2018
@tcaputi
Copy link
Contributor

tcaputi commented Apr 4, 2018

@jh3141
Oh, also run echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable before causing the crash. Both of these settings may already be set, but it doesn't hurt to be sure.

@tcaputi
Copy link
Contributor

tcaputi commented Apr 9, 2018

@jh3141 any updates?

@stale
Copy link

stale bot commented Aug 25, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020
@stale stale bot closed this as completed Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue
Projects
None yet
Development

No branches or pull requests

3 participants
@tcaputi @jh3141 and others