zfs-2.1.7 zfs recv panic in dsl_dataset_deactivate_feature_impl #14252

peterska · 2022-12-02T21:22:01Z

System information

Type	Version/Name
Distribution Name	Fedora
Distribution Version	36
Kernel Version	6.0.10-200.fc36.x86_64
Architecture	Intel x86_64
OpenZFS Version	zfs-2.1.7-1

Describe the problem you're observing

When using zfs send to make a backup on a remote machine, the receiver throws a PANIC in one of the zfs functions and the file system deadlocks.

Describe how to reproduce the problem

zfs send -RLec -I snapshot_name | ssh remote_host "zfs receive -duF remote_zvol"

Include any warning/errors/backtraces from the system logs

System Events
=-=-=-=-=-=-=
Dec  2 15:32:14 raid3 kernel: VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
Dec  2 15:32:14 raid3 kernel: PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
Dec  2 15:32:14 raid3 kernel: Showing stack for process 1604
Dec  2 15:32:14 raid3 kernel: CPU: 3 PID: 1604 Comm: txg_sync Tainted: P           OE      6.0.10-200.fc36.x86_64 #1
Dec  2 15:32:14 raid3 kernel: Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD3H/Z87X-UD3H-CF, BIOS 10b 06/12/2014
Dec  2 15:32:14 raid3 kernel: Call Trace:
Dec  2 15:32:14 raid3 kernel: <TASK>
Dec  2 15:32:14 raid3 kernel: dump_stack_lvl+0x44/0x5c
Dec  2 15:32:14 raid3 kernel: spl_panic+0xf0/0x108 [spl]
Dec  2 15:32:14 raid3 kernel: dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs]
Dec  2 15:32:14 raid3 kernel: dsl_dataset_clone_swap_sync_impl+0x88e/0xa70 [zfs]
Dec  2 15:32:14 raid3 kernel: ? _raw_spin_unlock+0x15/0x30
Dec  2 15:32:14 raid3 kernel: ? dsl_dataset_hold_flags+0xb0/0x230 [zfs]
Dec  2 15:32:14 raid3 kernel: dmu_recv_end_sync+0x180/0x560 [zfs]
Dec  2 15:32:14 raid3 kernel: ? preempt_count_add+0x6a/0xa0
Dec  2 15:32:14 raid3 kernel: dsl_sync_task_sync+0xa5/0xf0 [zfs]
Dec  2 15:32:14 raid3 kernel: dsl_pool_sync+0x3d3/0x4e0 [zfs]
Dec  2 15:32:14 raid3 kernel: spa_sync+0x555/0xf60 [zfs]
Dec  2 15:32:14 raid3 kernel: ? _raw_spin_unlock+0x15/0x30
Dec  2 15:32:14 raid3 kernel: ? spa_txg_history_init_io+0xf3/0x110 [zfs]
Dec  2 15:32:14 raid3 kernel: txg_sync_thread+0x227/0x3e0 [zfs]
Dec  2 15:32:14 raid3 kernel: ? txg_fini+0x260/0x260 [zfs]
Dec  2 15:32:14 raid3 kernel: ? __thread_exit+0x20/0x20 [spl]
Dec  2 15:32:14 raid3 kernel: thread_generic_wrapper+0x57/0x70 [spl]
Dec  2 15:32:14 raid3 kernel: kthread+0xe6/0x110
Dec  2 15:32:14 raid3 kernel: ? kthread_complete_and_exit+0x20/0x20
Dec  2 15:32:14 raid3 kernel: ret_from_fork+0x1f/0x30
Dec  2 15:32:14 raid3 kernel: </TASK>

The text was updated successfully, but these errors were encountered:

ryao · 2022-12-02T21:42:00Z

ZFS is trying to deactivate a feature that is not active and it very much does not like that.

Do any variations of the send command without -L, -e or -c avoid the panic? Knowing if turning any of them off makes this work would help in debugging this.

peterska · 2022-12-02T21:52:30Z

The error message talks about the large_blocks feature, although it is only enabled and not active at both the sender and receiver. The -L flag is probably responsible for triggering this. Could it be related to #13782 ?

I will not be able to try anything until tomorrow. The machine is at the office and has currently experienced this issue and I am unable to ssh in.

mabod · 2022-12-03T22:34:16Z

I can confirm that this happens also to me. zfs core dump during send/receive:

It happens here if sender has resordsize 128k and receivers has recordsize 1M:

# zfs get recordsize zHome/home zstore/data/BACKUP/rakete_home/home 
NAME                                 PROPERTY    VALUE    SOURCE
zHome/home                           recordsize  128K     inherited from zHome
zstore/data/BACKUP/rakete_home/home  recordsize  1M       inherited from zstore

    syncoid --sendoptions="L" --mbuffer-size=512M --no-sync-snap zHome/home zstore/data/BACKUP/rakete_home/home
    NEWEST SNAPSHOT: 2022-12-03--23:23
    Sending incremental zHome/home@2022-11-29--08:05 ... 2022-12-03--23:23 (~ 34 KB):
0,00 B 0:00:00 [0,00 B/s] [>                                                                   ]  0%            
    cannot receive: failed to read from stream
CRITICAL ERROR:  zfs send -L  -I 'zHome/home'@'2022-11-29--08:05' 'zHome/home'@'2022-12-03--23:23' | mbuffer  -q -s 128k -m 512M 2>/dev/null | pv -p -t -e -r -b -s 35264 |  zfs receive  -s -F 'zstore/data/BACKUP/rakete_home/home' 2>&1 failed: 256 at /usr/bin/syncoid line 817.

peterska · 2022-12-04T00:15:40Z

I tried various permuations of zfs send flags and here are the results

zfs send -RLec -I snapshot_name : crashes
zfs send -Rec -I snapshot_name : crashes
zfs send -Rc -I snapshot_name : crashes
zfs send -R -I snapshot_name : works

I use zstd-1 compression on all my datasets. compression is inherited from the top dataset.
I use the default recordsize for all datasets except for one that uses 8K

ryao · 2022-12-04T00:50:05Z

Does zfs send -RLe -I snapshot_name work? What I am really wondering is which flags (just 1 I hope) cause crashes to narrow the scope that needs to be debugged.

peterska · 2022-12-04T01:32:04Z

I have left the office now and won't be there for a while. Most of us work from home.
Next time someone goes to the office I can try it for you.

peterska · 2022-12-04T05:55:20Z

just got the same panic using only zfs send -R -I snapshot_name so trying send -RLe will not shed any light on the issue.
Here is the full stacktrace:

[Sun Dec  4 16:52:10 2022] VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
[Sun Dec  4 16:52:10 2022] PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
[Sun Dec  4 16:52:10 2022] Showing stack for process 1627
[Sun Dec  4 16:52:10 2022] CPU: 7 PID: 1627 Comm: txg_sync Tainted: P           OE      6.0.10-200.fc36.x86_64 #1
[Sun Dec  4 16:52:10 2022] Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD3H/Z87X-UD3H-CF, BIOS 10b 06/12/2014
[Sun Dec  4 16:52:10 2022] Call Trace:
[Sun Dec  4 16:52:10 2022]  <TASK>
[Sun Dec  4 16:52:10 2022]  dump_stack_lvl+0x44/0x5c
[Sun Dec  4 16:52:10 2022]  spl_panic+0xf0/0x108 [spl]
[Sun Dec  4 16:52:10 2022]  dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs]
[Sun Dec  4 16:52:10 2022]  dsl_dataset_clone_swap_sync_impl+0x88e/0xa70 [zfs]
[Sun Dec  4 16:52:10 2022]  ? _raw_spin_unlock+0x15/0x30
[Sun Dec  4 16:52:10 2022]  ? dsl_dataset_hold_flags+0xb0/0x230 [zfs]
[Sun Dec  4 16:52:10 2022]  dmu_recv_end_sync+0x180/0x560 [zfs]
[Sun Dec  4 16:52:10 2022]  ? preempt_count_add+0x6a/0xa0
[Sun Dec  4 16:52:10 2022]  dsl_sync_task_sync+0xa5/0xf0 [zfs]
[Sun Dec  4 16:52:10 2022]  dsl_pool_sync+0x3d3/0x4e0 [zfs]
[Sun Dec  4 16:52:10 2022]  spa_sync+0x555/0xf60 [zfs]
[Sun Dec  4 16:52:10 2022]  ? _raw_spin_unlock+0x15/0x30
[Sun Dec  4 16:52:10 2022]  ? spa_txg_history_init_io+0xf3/0x110 [zfs]
[Sun Dec  4 16:52:10 2022]  txg_sync_thread+0x227/0x3e0 [zfs]
[Sun Dec  4 16:52:10 2022]  ? txg_fini+0x260/0x260 [zfs]
[Sun Dec  4 16:52:10 2022]  ? __thread_exit+0x20/0x20 [spl]
[Sun Dec  4 16:52:10 2022]  thread_generic_wrapper+0x57/0x70 [spl]
[Sun Dec  4 16:52:10 2022]  kthread+0xe6/0x110
[Sun Dec  4 16:52:10 2022]  ? kthread_complete_and_exit+0x20/0x20
[Sun Dec  4 16:52:10 2022]  ret_from_fork+0x1f/0x30
[Sun Dec  4 16:52:10 2022]  </TASK>

ryao · 2022-12-04T05:57:09Z

I was afraid that might happen. Thanks for confirming it.

mabod · 2022-12-05T06:49:53Z

Am Sonntag, dem 04.12.2022 um 05:38 -0800 schrieb George Amanakis:

As far as I can see #13782 is not included in 2.1.7. Is it possible to retry with current master or cherry-pick that commit on 2.1.7 and try again? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

This patch is already part of zfs 2.1.7 See at the bottom of this page: zfs-2.1.6...zfs-2.1.7

gamanakis · 2022-12-05T21:11:13Z

Is this reproducible on 2.1.6?

peterska · 2022-12-05T21:32:44Z

No, I've been running 2.1.6 with zfs send -RLec since it's release. The script has been running unchanged for about 5 years and I have never experienced this issue. It happened instantly once I updated to zfs 2.1.7

ryao · 2022-12-06T00:03:34Z

Could you try reverting c8d2ab0 to see if that makes it stop? I do not see any thing else that changed in 2.1.7 that could possibly be related to this.

peterska · 2022-12-06T00:05:29Z

Thanks, I wil try reverting c8d2ab0 It might be a couple of days before I can try it.

peterska · 2022-12-08T09:45:59Z

I've reverted c8d2ab0 and tried the zfs send again. Unfortunately my pool seems to be in a weird state and fails with lots of errors about snapshots not existing. After the failure, on the recv end the receive dataset has been renamed to xxxrecv-xxxxx-1. I tried renaming the dataset back to raid3/raid2 but each time I did the zfs send it was renamed to raid3/raid2recv-xxxxx-1 again.

I am running a pool scrub and will try again once it has finished.

The sending pool is raid2 and it is recevied into raid3/raid2 on the receiving end.

All the snapshots mentioned in the errors below exist on the receving end, they've just be renamed to raid3/raid2recv-65018-1

Here is an extract of some of the errors.

cannot open 'raid3/raid2/git-migration@20221204-10.53.25-Sun-backup': dataset does not exist
cannot open 'raid3/raid2/git-migration': dataset does not exist
cannot receive: specified fs (raid3/raid2) does not exist

attempting destroy raid3/raid2
failed - trying rename raid3/raid2 to raid3/raid2recv-65018-1

local fs raid3/raid2/home does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring
local fs raid3/raid2/git does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring
local fs raid3/raid2/databases does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring
local fs raid3/raid2/products does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring

mabod · 2022-12-09T10:51:48Z

I did a full test again, but reverting the patch did not help. At least, what I found, reverting this patch prevents a complete freeze of the pool. But I still see coredumps during send/receive

This is the patch I am reverting:

From 1af2632e439156f5edd404b35e5c9e444b607733 Mon Sep 17 00:00:00 2001
From: George Amanakis <gamanakis@gmail.com>
Date: Tue, 30 Aug 2022 22:15:56 +0200
Subject: [PATCH] Fix setting the large_block feature after receiving a
 snapshot

We are not allowed to dirty a filesystem when done receiving
a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
not be set on that filesystem since the filesystem is not on
dp_dirty_datasets, and a subsequent encrypted raw send will fail.
Fix this by checking in dsl_dataset_snapshot_sync_impl() if the feature
needs to be activated and do so if appropriate.

Signed-off-by: George Amanakis <gamanakis@gmail.com>
---
 module/zfs/dsl_dataset.c                      | 15 ++++
 tests/runfiles/common.run                     |  2 +-
 tests/zfs-tests/tests/Makefile.am             |  1 +
 .../rsend/send_raw_large_blocks.ksh           | 78 +++++++++++++++++++
 4 files changed, 95 insertions(+), 1 deletion(-)
 create mode 100755 tests/zfs-tests/tests/functional/rsend/send_raw_large_blocks.ksh

diff --git a/module/zfs/dsl_dataset.c b/module/zfs/dsl_dataset.c
index c7577fc584a..4da4effca60 100644
--- a/module/zfs/dsl_dataset.c
+++ b/module/zfs/dsl_dataset.c
@@ -1760,6 +1760,21 @@ dsl_dataset_snapshot_sync_impl(dsl_dataset_t *ds, const char *snapname,
 		}
 	}
 
+	/*
+	 * We are not allowed to dirty a filesystem when done receiving
+	 * a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
+	 * not be set and a subsequent encrypted raw send will fail. Hence
+	 * activate this feature if needed here.
+	 */
+	for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
+		if (zfeature_active(f, ds->ds_feature_activation[f]) &&
+		    !(zfeature_active(f, ds->ds_feature[f]))) {
+			dsl_dataset_activate_feature(dsobj, f,
+			    ds->ds_feature_activation[f], tx);
+			ds->ds_feature[f] = ds->ds_feature_activation[f];
+		}
+	}
+
 	ASSERT3U(ds->ds_prev != 0, ==,
 	    dsl_dataset_phys(ds)->ds_prev_snap_obj != 0);
 	if (ds->ds_prev) {

With stock zfs 2.1.7 (incl. this patch) my pool completely freezes when doing send / receive:

syncoid --sendoptions="L" --mbuffer-size=512M --no-sync-snap zHome/home zstore/data/BACKUP/rakete_home/home
    NEWEST SNAPSHOT: 2022-12-09--10:47
    Sending incremental zHome/home@2022-12-09--10:32 ... 2022-12-09--10:47 (~ 4 KB):
0,00 B 0:00:00 [0,00 B/s] [>                                                                                                                                                                                                                                             ]  0%            
    cannot receive: failed to read from stream
CRITICAL ERROR:  zfs send -L  -I 'zHome/home'@'2022-12-09--10:32' 'zHome/home'@'2022-12-09--10:47' | mbuffer  -q -s 128k -m 512M 2>/dev/null | pv -p -t -e -r -b -s 4720 |  zfs receive  -s -F 'zstore/data/BACKUP/rakete_home/home' 2>&1 failed: 256 at /usr/bin/syncoid line 817.

gives the following in the log:

Dez 09 10:35:47 rakete kernel: INFO: task txg_sync:3044 blocked for more than 122 seconds.
Dez 09 10:35:47 rakete kernel:       Tainted: P           OE      6.0.11-zen1-1-zen #1
Dez 09 10:35:47 rakete kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dez 09 10:35:47 rakete kernel: task:txg_sync        state:D stack:    0 pid: 3044 ppid:     2 flags:0x00004000
Dez 09 10:35:47 rakete kernel: Call Trace:
Dez 09 10:35:47 rakete kernel:  <TASK>
Dez 09 10:35:47 rakete kernel:  __schedule+0xb43/0x1350
Dez 09 10:35:47 rakete kernel:  schedule+0x5e/0xd0
Dez 09 10:35:47 rakete kernel:  spl_panic+0x10a/0x10c [spl 29da18a0fede4076df583dd8fb83790522bfe897]
Dez 09 10:35:47 rakete kernel:  dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  dsl_dataset_clone_swap_sync_impl+0x90b/0xe30 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  dsl_dataset_rollback_sync+0x109/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  dsl_sync_task_sync+0xac/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  dsl_pool_sync+0x40d/0x5c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  spa_sync+0x56c/0xf90 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  ? spa_txg_history_init_io+0x193/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  txg_sync_thread+0x22b/0x3f0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  ? txg_wait_open+0xf0/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  ? __thread_exit+0x20/0x20 [spl 29da18a0fede4076df583dd8fb83790522bfe897]
Dez 09 10:35:47 rakete kernel:  thread_generic_wrapper+0x5e/0x70 [spl 29da18a0fede4076df583dd8fb83790522bfe897]
Dez 09 10:35:47 rakete kernel:  kthread+0xde/0x110
Dez 09 10:35:47 rakete kernel:  ? kthread_complete_and_exit+0x20/0x20
Dez 09 10:35:47 rakete kernel:  ret_from_fork+0x22/0x30
Dez 09 10:35:47 rakete kernel:  </TASK>

I see multiple of those blocked task messages for various tasks. The pool is then frozen. Reboot takes ages and I had to force reboot with REISUB.

When I reverted the patch I did not experience freezes but the same error and coredumps. Unfortuanlety I can not provide a coredump because my system says: Resource limits disable core dumping for process 82335 (zfs) And I do not know why that is.

Anyways, interesting is also that this error resp. coredump does not happen every time. Approximately 1 out of 4 send/receive processes execute fine.

And just to reassure you: This is a new bug in 2.1.7. I am using this backup process with syncoid since many years without any problem. Reverting back to zfs 2.1.6 gives me a stable system again.

gamanakis · 2022-12-09T10:56:25Z

Could you do a git bisect with good being 2.1.6 and bad 2.1.7?

…

On Fri, Dec 9, 2022, 11:51 AM mabod ***@***.***> wrote: I did a full test again, but reverting the patch did not help. At least, what I found, reverting this patch prevents a complete freeze of the pool. But I still see coredumps during send/receive This is the patch I am reverting: From 1af2632 Mon Sep 17 00:00:00 2001 From: George Amanakis ***@***.***> Date: Tue, 30 Aug 2022 22:15:56 +0200 Subject: [PATCH] Fix setting the large_block feature after receiving a snapshot We are not allowed to dirty a filesystem when done receiving a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will not be set on that filesystem since the filesystem is not on dp_dirty_datasets, and a subsequent encrypted raw send will fail. Fix this by checking in dsl_dataset_snapshot_sync_impl() if the feature needs to be activated and do so if appropriate. Signed-off-by: George Amanakis ***@***.***> --- module/zfs/dsl_dataset.c | 15 ++++ tests/runfiles/common.run | 2 +- tests/zfs-tests/tests/Makefile.am | 1 + .../rsend/send_raw_large_blocks.ksh | 78 +++++++++++++++++++ 4 files changed, 95 insertions(+), 1 deletion(-) create mode 100755 tests/zfs-tests/tests/functional/rsend/send_raw_large_blocks.ksh diff --git a/module/zfs/dsl_dataset.c b/module/zfs/dsl_dataset.c index c7577fc584a..4da4effca60 100644 --- a/module/zfs/dsl_dataset.c +++ b/module/zfs/dsl_dataset.c @@ -1760,6 +1760,21 @@ dsl_dataset_snapshot_sync_impl(dsl_dataset_t *ds, const char *snapname, } } + /* + * We are not allowed to dirty a filesystem when done receiving + * a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will + * not be set and a subsequent encrypted raw send will fail. Hence + * activate this feature if needed here. + */ + for (spa_feature_t f = 0; f < SPA_FEATURES; f++) { + if (zfeature_active(f, ds->ds_feature_activation[f]) && + !(zfeature_active(f, ds->ds_feature[f]))) { + dsl_dataset_activate_feature(dsobj, f, + ds->ds_feature_activation[f], tx); + ds->ds_feature[f] = ds->ds_feature_activation[f]; + } + } + ASSERT3U(ds->ds_prev != 0, ==, dsl_dataset_phys(ds)->ds_prev_snap_obj != 0); if (ds->ds_prev) { With stock zfs 2.1.7 (incl. this patch) my pool completely freezes when doing send / receive: syncoid --sendoptions="L" --mbuffer-size=512M --no-sync-snap zHome/home zstore/data/BACKUP/rakete_home/home NEWEST SNAPSHOT: 2022-12-09--10:47 Sending incremental ***@***.***:32 ... 2022-12-09--10:47 (~ 4 KB): 0,00 B 0:00:00 [0,00 B/s] [> ] 0% cannot receive: failed to read from stream CRITICAL ERROR: zfs send -L -I 'zHome/home'@'2022-12-09--10:32' 'zHome/home'@'2022-12-09--10:47' | mbuffer -q -s 128k -m 512M 2>/dev/null | pv -p -t -e -r -b -s 4720 | zfs receive -s -F 'zstore/data/BACKUP/rakete_home/home' 2>&1 failed: 256 at /usr/bin/syncoid line 817. gives the following in the log: Dez 09 10:35:47 rakete kernel: INFO: task txg_sync:3044 blocked for more than 122 seconds. Dez 09 10:35:47 rakete kernel: Tainted: P OE 6.0.11-zen1-1-zen #1 Dez 09 10:35:47 rakete kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dez 09 10:35:47 rakete kernel: task:txg_sync state:D stack: 0 pid: 3044 ppid: 2 flags:0x00004000 Dez 09 10:35:47 rakete kernel: Call Trace: Dez 09 10:35:47 rakete kernel: <TASK> Dez 09 10:35:47 rakete kernel: __schedule+0xb43/0x1350 Dez 09 10:35:47 rakete kernel: schedule+0x5e/0xd0 Dez 09 10:35:47 rakete kernel: spl_panic+0x10a/0x10c [spl 29da18a0fede4076df583dd8fb83790522bfe897] Dez 09 10:35:47 rakete kernel: dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: dsl_dataset_clone_swap_sync_impl+0x90b/0xe30 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: dsl_dataset_rollback_sync+0x109/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: dsl_sync_task_sync+0xac/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: dsl_pool_sync+0x40d/0x5c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: spa_sync+0x56c/0xf90 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: ? spa_txg_history_init_io+0x193/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: txg_sync_thread+0x22b/0x3f0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: ? txg_wait_open+0xf0/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85] Dez 09 10:35:47 rakete kernel: ? __thread_exit+0x20/0x20 [spl 29da18a0fede4076df583dd8fb83790522bfe897] Dez 09 10:35:47 rakete kernel: thread_generic_wrapper+0x5e/0x70 [spl 29da18a0fede4076df583dd8fb83790522bfe897] Dez 09 10:35:47 rakete kernel: kthread+0xde/0x110 Dez 09 10:35:47 rakete kernel: ? kthread_complete_and_exit+0x20/0x20 Dez 09 10:35:47 rakete kernel: ret_from_fork+0x22/0x30 Dez 09 10:35:47 rakete kernel: </TASK> I see multiple of those blocked task messages for various tasks. The pool is then frozen. Reboot takes ages and I had to force reboot with REISUB. When I reverted the patch I did not experience freezes but the same error and coredumps. Unfortuanlety I can not provide a coredump because my system says: Resource limits disable core dumping for process 82335 (zfs) And I do not know why that is. Anyways, interesting is also that this error resp. coredump does not happen every time. Approximately 1 out of 4 send/receive processes execute fine. And just to reassure you: This is a new bug in 2.1.7. I am using this backup process with syncoid since many years without any problem. Reverting back to zfs 2.1.6 gives me a stable system again. — Reply to this email directly, view it on GitHub <#14252 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2Y2IOXPMFSF6EKS3ZV4KLWMMFM5ANCNFSM6AAAAAASSJ2TW4> . You are receiving this because you commented.Message ID: ***@***.***>

mabod · 2022-12-09T11:11:11Z

I could do that but I need some help. I am not a programmer resp. git expert. I need guidance on this.

I have a couple of questions to start with:

do I need to do the bisect for zfs-dkms and zfs-utils or just for zfs-utils? (I am on endeavouros)
How could I incorporate the git bisect into the pkgbuild process so that I get a new package for each bisect? Or should I just compile the soruce separetly and then only use the resulting binary?
What would be last good commit to use?

gamanakis · 2022-12-09T13:45:45Z

The straightforward way would be to compile manually:

git bisect start
git bisect good zfs-2.1.6
git bisect bad zfs-2.1.7
compile
load the modules with "./scripts/zfs.sh -v" and set the PATH to the main OpenZFS source directory
test, enter accordingly "git bisect good" or "git bisect bad"
goto step 4

mabod · 2022-12-09T14:39:13Z

step 5 does not work for me. I have the module running and I have active pools. How do I reload the zfs module in a running live environment?
And what do you mean with "set the PATH to the main OpenZFS source directory" ?

gamanakis · 2022-12-09T15:04:46Z

./scripts/zfs.sh -vu should unload the modules. Make sure you have exported all pools.

peterska · 2022-12-09T21:05:08Z

I have done some comprehensive testing after reverting c8d2ab0 and can report that I am no longer getting the zfs recv panic in dsl_dataset_deactivate_feature_impl. I tried lots incremental sends using the full zfs send -RLec and also a full send. That definitely fixed it. @mabod your coredump issue is probably related to another commit.

gamanakis · 2022-12-09T22:20:15Z

@peterska Would you be willing to try a possible fix?

peterska · 2022-12-10T00:34:39Z

@gamanakis Let me review the patch and if it looks ok to me I will test it. Will not be able to test it until Monday Sydney time.

mabod · 2022-12-10T08:17:41Z

./scripts/zfs.sh -vu should unload the modules. Make sure you have exported all pools.

I exported all pools. I rebooted. I verfied with the 'zpool status' command that no polls are importedt. But when I excecute "./scripts/zfs.sh -vu" it fails by saying "module is still in use". Why is that?

mabod · 2022-12-10T08:38:29Z

@mabod your coredump issue is probably related to another commit.

yes, must be. I checked with the reverted pacth again. The coredump does exist. no doubt.
May be it has to do with encryption. The datasets are encrypted.

gamanakis · 2022-12-11T17:45:41Z

@peterska I am thinking that in your case the code inserted in c8d2ab0 runs in open context, which shouldn't be the case. I am trying to reproduce this on 2.1.7 with no success. This is my latest iteration:

 64 log_must zpool create -f pool_lb $TESTDIR/vdev_a
 65
 66 log_must zfs create pool_lb/fs
 67 log_must dd if=/dev/urandom of=$file bs=1024 count=1024
 68 log_must zfs snapshot pool_lb/fs@snap1
 69
 70 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 71 log_must zfs snapshot pool_lb/fs@snap2
 72 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 73 log_must zfs snapshot pool_lb/fs@snap3
 74 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 75 log_must zfs snapshot pool_lb/fs@snap4
 76 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 77 log_must zfs snapshot pool_lb/fs@snap5
 78 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 79 log_must zfs snapshot pool_lb/fs@snap6
 80 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 81 log_must zfs snapshot pool_lb/fs@snap7
 82 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 83 log_must zfs snapshot pool_lb/fs@snap8
 84 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 85 log_must zfs snapshot pool_lb/fs@snap9
 86
 87 log_must eval "zfs send -RLec pool_lb/fs@snap5 > $backup"
 88 log_must eval "zfs recv pool_lb/testfs5 < $backup"
 89
 90 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
 91 log_must zfs snapshot pool_lb/testfs5@snap10
 92 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
 93 log_must zfs snapshot pool_lb/testfs5@snap20
 94 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
 95 log_must zfs snapshot pool_lb/testfs5@snap30
 96 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
 97 log_must zfs snapshot pool_lb/testfs5@snap40
 98 log_must eval "zfs send -RLec -I @snap5 pool_lb/fs@snap9 > $backup"
 99 log_must eval "zfs recv -F pool_lb/testfs5 < $backup"
100
101 log_must eval "zfs send -RLec pool_lb/testfs5@snap9 > $raw_backup"

The patch I was thinking about is the following:

diff --git a/module/zfs/dsl_dataset.c b/module/zfs/dsl_dataset.c
index 71d876cae..da49ca94b 100644
--- a/module/zfs/dsl_dataset.c
+++ b/module/zfs/dsl_dataset.c
@@ -1753,13 +1753,17 @@ dsl_dataset_snapshot_sync_impl(dsl_dataset_t *ds, const char *snapname,
         * not be set and a subsequent encrypted raw send will fail. Hence
         * activate this feature if needed here.
         */
-       for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
-               if (zfeature_active(f, ds->ds_feature_activation[f]) &&
-                   !(zfeature_active(f, ds->ds_feature[f]))) {
-                       dsl_dataset_activate_feature(dsobj, f,
-                           ds->ds_feature_activation[f], tx);
-                       ds->ds_feature[f] = ds->ds_feature_activation[f];
+       if (dmu_tx_is_syncing(tx)) {
+               for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
+                       if (zfeature_active(f, ds->ds_feature_activation[f]) &&
+                           !(zfeature_active(f, ds->ds_feature[f]))) {
+                               dsl_dataset_activate_feature(dsobj, f,
+                                   ds->ds_feature_activation[f], tx);
+                               ds->ds_feature[f] = ds->ds_feature_activation[f];
+                       }
                }
+       } else {
+               cmn_err(CE_NOTE, "not syncing context!");
        }

        ASSERT3U(ds->ds_prev != 0, ==,

This runs the code inserted in c8d2ab0 only in syncing context, otherwise it dumps a note in dmesg. It would help if you could end up with a script reproducing the bug in a pool I have access to, or if you could run patch above.

peterska · 2022-12-11T22:13:20Z

@gamanakis the patch looks ok to me. I will try it later today and let you know the results.

To help you reproduce the panic, here are the non default settings of my pool and dataset:

pool:
NAME   PROPERTY                       VALUE                          SOURCE
raid2  ashift                         12                             local
raid2  autotrim                       on                             local
raid2  feature@async_destroy          enabled                        local
raid2  feature@empty_bpobj            active                         local
raid2  feature@lz4_compress           active                         local
raid2  feature@multi_vdev_crash_dump  enabled                        local
raid2  feature@spacemap_histogram     active                         local
raid2  feature@enabled_txg            active                         local
raid2  feature@hole_birth             active                         local
raid2  feature@extensible_dataset     active                         local
raid2  feature@embedded_data          active                         local
raid2  feature@bookmarks              enabled                        local
raid2  feature@filesystem_limits      enabled                        local
raid2  feature@large_blocks           enabled                        local
raid2  feature@large_dnode            enabled                        local
raid2  feature@sha512                 enabled                        local
raid2  feature@skein                  enabled                        local
raid2  feature@edonr                  enabled                        local
raid2  feature@userobj_accounting     active                         local
raid2  feature@encryption             enabled                        local
raid2  feature@project_quota          active                         local
raid2  feature@device_removal         enabled                        local
raid2  feature@obsolete_counts        enabled                        local
raid2  feature@zpool_checkpoint       enabled                        local
raid2  feature@spacemap_v2            active                         local
raid2  feature@allocation_classes     enabled                        local
raid2  feature@resilver_defer         enabled                        local
raid2  feature@bookmark_v2            enabled                        local
raid2  feature@redaction_bookmarks    enabled                        local
raid2  feature@redacted_datasets      enabled                        local
raid2  feature@bookmark_written       enabled                        local
raid2  feature@log_spacemap           active                         local
raid2  feature@livelist               enabled                        local
raid2  feature@device_rebuild         enabled                        local
raid2  feature@zstd_compress          active                         local
raid2  feature@draid                  enabled                        local


dataset:

NAME   PROPERTY              VALUE                  SOURCE
raid2  compression           zstd-1                 local
raid2  atime                 off                    local
raid2  xattr                 sa                     local
raid2  relatime              on                     local

peterska · 2022-12-13T00:25:13Z

@gamanakis I finally got a chance to test out your patch today. I could not apply it directly, I guess github changed some spaces/tabs, so I patched it manually. The bottom of this comment shows the surrounding code so I can convince you I applied it correctly.

I ran about 8 zfs sned -RLec commands and there were no lockups/panics. I have left the new code running on my server so we can detect any issues as it goes about it daily backup tasks.

        for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
                if (zfeature_active(f, ds->ds_feature[f])) {
                        dsl_dataset_activate_feature(dsobj, f,
                            ds->ds_feature[f], tx);
                }
        }

        /*
         * We are not allowed to dirty a filesystem when done receiving
         * a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
         * not be set and a subsequent encrypted raw send will fail. Hence
         * activate this feature if needed here.
         */
        if (dmu_tx_is_syncing(tx)) {
               for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
                       if (zfeature_active(f, ds->ds_feature_activation[f]) &&
                           !(zfeature_active(f, ds->ds_feature[f]))) {
                               dsl_dataset_activate_feature(dsobj, f,
                                   ds->ds_feature_activation[f], tx);
                               ds->ds_feature[f] = ds->ds_feature_activation[f];
                       }
                }
        } else {
               cmn_err(CE_NOTE, "not syncing context!");
        }

        ASSERT3U(ds->ds_prev != 0, ==,
            dsl_dataset_phys(ds)->ds_prev_snap_obj != 0);

Clete2 · 2022-12-28T19:51:17Z

Just to summarize my findings so far:

The last couple of days I was running 2.1.7 with the PR from @gamanakis : #14304 At first it looked really promising. No crashes at first. But then I had another freeze 2 days ago.

Then I assumed that zstd compression could eventually cause the issue. I have lz4 on the sender side and zstd on the receiver side. I moved all the receiver side to lz4. But then I got another freeze. So zstd is not the issue.

May be it is encryption related. Sender and receiver are encrypted with a keyfile. But I do not know.

Now I am back to square one and I have stock 2.1.7 running with the large-block commit reverted: c8d2ab0

So far I have no freezes.

I don’t think it’s encryption related. I got a crash when receiving into 2.1.7 from 0.8.x. Neither side has any encryption enabled.

Sender has lz4 whereas receiver is zstd-19. I’ll try reverting to lz4 receiver when I get a chance. Of note, the first full send (using syncoid) worked without issue. It was the next, incremental send which crashed my system.

rincebrain · 2022-12-28T19:59:13Z

I could certainly believe it's not encryption related, I've just seen the quota feature and encryption interact poorly before.

My guess would be that it's a timing issue where something doesn't actually block on something else happening first, and if you use encryption or high compression with large blocks or the like, one task takes long enough that the other hits this.

rincebrain · 2022-12-29T01:29:58Z

Quickly, I added two log lines in the actual deactivate_impl, and got:

[  466.783588] NOTICE: Feature 18 deactivation attempting
[  466.784023] NOTICE: Feature 18 deactivation completed
[  466.790930] NOTICE: Feature 16 deactivation attempting
[  466.791353] NOTICE: Feature 16 deactivation completed
[  466.791726] NOTICE: Feature 18 deactivation attempting
[  466.792083] NOTICE: Feature 18 deactivation completed
[  468.010089] NOTICE: Feature 16 deactivation attempting
[  468.010451] NOTICE: Feature 16 deactivation completed
[  468.010761] NOTICE: Feature 16 deactivation attempting
[  468.011065] VERIFY0(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)

So it seems it tries to deactivate, if I can count, SPA_FEATURE_USEROBJ_ACCOUNTING twice, and boom goes the dynamite...

gamanakis · 2022-12-29T20:31:56Z

I think it has to do with the code here:

diff --git a/module/zfs/dmu_objset.c b/module/zfs/dmu_objset.c
index c17c829a0..0e8427f19 100644
--- a/module/zfs/dmu_objset.c
+++ b/module/zfs/dmu_objset.c
@@ -2408,13 +2408,6 @@ dmu_objset_id_quota_upgrade_cb(objset_t *os)
            dmu_objset_userobjspace_present(os))
                return (SET_ERROR(ENOTSUP));

-       if (dmu_objset_userobjused_enabled(os))
-               dmu_objset_ds(os)->ds_feature_activation[
-                   SPA_FEATURE_USEROBJ_ACCOUNTING] = (void *)B_TRUE;
-       if (dmu_objset_projectquota_enabled(os))
-               dmu_objset_ds(os)->ds_feature_activation[
-                   SPA_FEATURE_PROJECT_QUOTA] = (void *)B_TRUE;
-
        err = dmu_objset_space_upgrade(os);
        if (err)
                return (err);

If I remove that then it doesn't panic.

infinet · 2022-12-30T04:06:45Z

The above change to dmu_objset.c works. It fixes my syncoid backup, which send from zfs 2.0 pool to an encrypted zfs 2.1 pool. feature@userobj_accounting and feature@project_quota are activated in both pools.

mabod · 2023-01-05T12:30:32Z

@gamanakis :
I am testing the new PR #14304 since a couple of days. (with patch for dmu_objset.c)
I want to confirm here that it solves my send/receive issues.

Clete2 · 2023-01-05T16:08:25Z

Is there a workaround that doesn’t require a code change? Maybe a flag I can pass into the send or receive command to avoid the code path that causes panic?

I’m in a situation where one server is offsite and unmanaged and if I happen to break it by injecting my own zfs modules then recovery will be tricky.

When activating filesystem features after receiving a snapshot, do so only in syncing context. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14304 Closes #14252

When activating filesystem features after receiving a snapshot, do so only in syncing context. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes openzfs#14304 Closes openzfs#14252

phreaker0 · 2023-01-28T18:50:00Z

I just got this line (the only one which I could read of the screen) and needed to reboot:

VERIFY0(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)

with zfs 2.1.9:

$ zfs version
zfs-2.1.9-1
zfs-kmod-2.1.9-1
$ uname -r
5.15.0-58-generic

gamanakis · 2023-01-28T19:53:44Z

@phreaker0 can you provide additional information? were you sending/receiving when this happened?

phreaker0 · 2023-01-28T21:14:41Z

@gamanakis i wasn't around when it happened, what I did before I left:

upgrade from 2.1.6 to 2.1.9
reboot
destroy all snapshots listed in the data errors section of my ssd pool (the <0x0> ones from ZFS corruption related to snapshots post-2.0.x upgrade #12014)
zpool scrub ssd pool to clear the data errors
there is a regular replication of all datasets from the encrypted ssd pool to the local encrypted hdd pool (replication stream is unencrypted) each hour
i left

but I guess the crash happened pretty soon, because my influxdb logging stuff didn't record anything any more. The ssd pool is used as ROOT. Therefore I couldn't login any more after the panic to gather more information, I only had the screen output.

I'm trying to reproduce it now, but so far it works, I will keep an active ssh session with the kernel log in case of a crash and will report back.

phreaker0 · 2023-02-06T19:15:37Z

FYI: I can't reproduce it anymore, it ran for several days.

Issue: openzfs/zfs#14252 Pull request: openzfs/zfs#14304

scratchings · 2023-02-17T09:29:28Z

I've just had this happen on RHEL 9 with 2.1.9, pool comprised from LUKS encrypted devices (which I'm using because of the send/receive issues with native encryption #11679 - aargh!).
A receive from a Nexenta 4 host to this system (unencrypted) (1M record size/lz4 on the destination, 128K record size on the source) using syncoid+mbuffer on destination reported:

VERIFY0(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)

With /var/log/messages reporting:

Feb 17 08:51:32 fs6 kernel: VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
Feb 17 08:51:32 fs6 kernel: PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
Feb 17 08:51:32 fs6 kernel: Showing stack for process 1123981
Feb 17 08:51:32 fs6 kernel: CPU: 4 PID: 1123981 Comm: txg_sync Kdump: loaded Tainted: P OE --------- --- 5.14.0-162.12.1.el9_1.x86_64 #1
Feb 17 08:51:32 fs6 kernel: Hardware name: Dell Inc. PowerEdge R7525/0590KW, BIOS 2.9.3 08/05/2022
Feb 17 08:51:32 fs6 kernel: Call Trace:
Feb 17 08:51:32 fs6 kernel: dump_stack_lvl+0x34/0x48
Feb 17 08:51:32 fs6 kernel: spl_panic+0xd1/0xe9 [spl]
Feb 17 08:51:32 fs6 kernel: ? dbuf_rele_and_unlock+0x387/0x6b0 [zfs]
Feb 17 08:51:32 fs6 kernel: ? zap_remove_impl+0xb3/0x120 [zfs]
Feb 17 08:51:32 fs6 kernel: ? kfree+0xac/0x3f0
Feb 17 08:51:32 fs6 kernel: ? spl_kmem_alloc+0xb2/0x100 [spl]
Feb 17 08:51:32 fs6 kernel: dsl_dataset_deactivate_feature_impl+0xfa/0x100 [zfs]
Feb 17 08:51:32 fs6 kernel: dsl_dataset_clone_swap_sync_impl+0x888/0xb50 [zfs]
Feb 17 08:51:32 fs6 kernel: dsl_dataset_rollback_sync+0xf0/0x1d0 [zfs]
Feb 17 08:51:32 fs6 kernel: ? dsl_dataset_hold_flags+0x9d/0x230 [zfs]
Feb 17 08:51:32 fs6 kernel: ? dsl_dataset_rollback_check+0x2f9/0x430 [zfs]
Feb 17 08:51:32 fs6 kernel: dsl_sync_task_sync+0xaa/0xf0 [zfs]
Feb 17 08:51:32 fs6 kernel: dsl_pool_sync+0x40c/0x520 [zfs]
Feb 17 08:51:32 fs6 kernel: spa_sync_iterate_to_convergence+0xf0/0x2f0 [zfs]
Feb 17 08:51:32 fs6 kernel: spa_sync+0x471/0x930 [zfs]
Feb 17 08:51:32 fs6 kernel: txg_sync_thread+0x27a/0x400 [zfs]
Feb 17 08:51:32 fs6 kernel: ? txg_fini+0x260/0x260 [zfs]
Feb 17 08:51:32 fs6 kernel: thread_generic_wrapper+0x59/0x70 [spl]
Feb 17 08:51:32 fs6 kernel: ? __thread_exit+0x20/0x20 [spl]
Feb 17 08:51:32 fs6 kernel: kthread+0x149/0x170
Feb 17 08:51:32 fs6 kernel: ? set_kthread_struct+0x50/0x50
Feb 17 08:51:32 fs6 kernel: ret_from_fork+0x22/0x30

As this is part of a HA pair, the host was fenced shortly afterwards and service resumed on the peer host, where the pool imported with no errors. Pushing the service back to the original host and restarting the transfer has so far not resulted in a repetition.

Load on the system is minimal during the transfer, ca 1.3.

duramuss · 2023-02-18T19:30:25Z

Same here on zfs 2.1.9 when doing a rollback on an unmounted dataset:

Linux duranux2 6.0.19 #1 SMP PREEMPT_DYNAMIC Thu Feb 9 01:12:28 CET 2023 x86_64 GNU/Linux

févr. 18 20:07:32 duranux2 kernel: VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
févr. 18 20:07:32 duranux2 kernel: PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
févr. 18 20:07:32 duranux2 kernel: Showing stack for process 436
févr. 18 20:07:32 duranux2 kernel: CPU: 2 PID: 436 Comm: txg_sync Tainted: P O 6.0.19 #1
févr. 18 20:07:32 duranux2 kernel: Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 3603 11/09/2012
févr. 18 20:07:32 duranux2 kernel: Call Trace:
févr. 18 20:07:32 duranux2 kernel:
févr. 18 20:07:32 duranux2 kernel: dump_stack_lvl+0x37/0x47
févr. 18 20:07:32 duranux2 kernel: spl_panic+0xbc/0xd4 [spl]
févr. 18 20:07:32 duranux2 kernel: ? dbuf_rele_and_unlock+0xf1/0x660 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? zap_cursor_advance+0x20/0x20 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? avl_find+0x52/0x90 [zavl]
févr. 18 20:07:32 duranux2 kernel: ? mze_find+0xbb/0xe0 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? zap_name_alloc+0xba/0x1c0 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? zap_lockdir+0x5e/0x90 [zfs]
févr. 18 20:07:32 duranux2 kernel: dsl_dataset_deactivate_feature_impl+0xdf/0xf0 [zfs]
févr. 18 20:07:32 duranux2 kernel: dsl_dataset_clone_swap_sync_impl+0x825/0xa10 [zfs]
févr. 18 20:07:32 duranux2 kernel: dsl_dataset_rollback_sync+0xd1/0x170 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? dsl_dataset_hold_flags+0x8b/0x210 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? dsl_dataset_rollback_check+0x2a4/0x3e0 [zfs]
févr. 18 20:07:32 duranux2 kernel: dsl_sync_task_sync+0xa4/0xf0 [zfs]
févr. 18 20:07:32 duranux2 kernel: dsl_pool_sync+0x3a6/0x4c0 [zfs]
févr. 18 20:07:32 duranux2 kernel: spa_sync+0x542/0xf30 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? spa_txg_history_init_io+0xde/0xf0 [zfs]
févr. 18 20:07:32 duranux2 kernel: txg_sync_thread+0x210/0x3d0 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? txg_rele_to_sync+0x90/0x90 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? __thread_exit+0x20/0x20 [spl]
févr. 18 20:07:32 duranux2 kernel: thread_generic_wrapper+0x59/0x70 [spl]
févr. 18 20:07:32 duranux2 kernel: kthread+0xd9/0x100
févr. 18 20:07:32 duranux2 kernel: ? kthread_exit+0x30/0x30
févr. 18 20:07:32 duranux2 kernel: ret_from_fork+0x22/0x30
févr. 18 20:07:32 duranux2 kernel:

gamanakis · 2023-02-18T21:40:34Z

@scratchings @duramuss There is a more suitable fix, see 34ce4c4 and #14502. Hopefully the upcoming 2.1.10 will include them.

When activating filesystem features after receiving a snapshot, do so only in syncing context. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes openzfs#14304 Closes openzfs#14252

slavanap · 2023-03-08T12:24:48Z

Same here. After 30 days of uptime. No ZFS commands issued during last 2 weeks, just normal ops.
ZFS has 2 normal sets, 1 zvol and 1 encrypted set. LZ4 compression, all other settings are set to defaults.

mar 04 21:15:12 sl kernel: VERIFY0(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
mar 04 21:15:12 sl kernel: PANIC at dsl_dataset.c:1129:dsl_dataset_deactivate_feature_impl()
mar 04 21:15:12 sl kernel: Showing stack for process 40697
mar 04 21:15:12 sl kernel: CPU: 2 PID: 40697 Comm: txg_sync Tainted: P           OE     5.13.0-39-generic #44~20.04.1-Ubuntu
mar 04 21:15:12 sl kernel: Hardware name: System manufacturer System Product Name/PRIME Z270-P, BIOS 0610 05/11/2017
mar 04 21:15:12 sl kernel: Call Trace:
mar 04 21:15:12 sl kernel:  <TASK>
mar 04 21:15:12 sl kernel:  dump_stack+0x7d/0x9c
mar 04 21:15:12 sl kernel:  spl_dumpstack+0x29/0x2b [spl]
mar 04 21:15:12 sl kernel:  spl_panic+0xd4/0xfc [spl]
mar 04 21:15:12 sl kernel:  ? spl_kmem_free+0x28/0x30 [spl]
mar 04 21:15:12 sl kernel:  ? kfree+0xd8/0x2a0
mar 04 21:15:12 sl kernel:  ? dbuf_rele+0x3d/0x50 [zfs]
mar 04 21:15:12 sl kernel:  ? dmu_buf_rele+0xe/0x10 [zfs]
mar 04 21:15:12 sl kernel:  ? zap_unlockdir+0x3f/0x60 [zfs]
mar 04 21:15:12 sl kernel:  ? zap_remove_norm+0x76/0xa0 [zfs]
mar 04 21:15:12 sl kernel:  dsl_dataset_deactivate_feature_impl+0x101/0x110 [zfs]
mar 04 21:15:12 sl kernel:  dsl_dataset_clone_swap_sync_impl+0x874/0x9d0 [zfs]
mar 04 21:15:12 sl kernel:  ? dbuf_rele+0x3d/0x50 [zfs]
mar 04 21:15:12 sl kernel:  ? dmu_buf_rele+0xe/0x10 [zfs]
mar 04 21:15:12 sl kernel:  ? dsl_dir_rele+0x30/0x40 [zfs]
mar 04 21:15:12 sl kernel:  dmu_recv_end_sync+0x26f/0x5a0 [zfs]
mar 04 21:15:12 sl kernel:  dsl_sync_task_sync+0xb6/0x100 [zfs]
mar 04 21:15:12 sl kernel:  dsl_pool_sync+0x3d6/0x4f0 [zfs]
mar 04 21:15:12 sl kernel:  spa_sync+0x55e/0xfd0 [zfs]
mar 04 21:15:12 sl kernel:  ? spa_txg_history_init_io+0x106/0x110 [zfs]
mar 04 21:15:12 sl kernel:  txg_sync_thread+0x229/0x3b0 [zfs]
mar 04 21:15:12 sl kernel:  ? txg_quiesce_thread+0x340/0x340 [zfs]
mar 04 21:15:12 sl kernel:  thread_generic_wrapper+0x79/0x90 [spl]
mar 04 21:15:12 sl kernel:  ? spl_taskq_fini+0x80/0x80 [spl]
mar 04 21:15:12 sl kernel:  kthread+0x12b/0x150
mar 04 21:15:12 sl kernel:  ? set_kthread_struct+0x40/0x40
mar 04 21:15:12 sl kernel:  ret_from_fork+0x22/0x30
mar 04 21:15:12 sl kernel:  </TASK>

$ zfs --version
zfs-2.1.99-1641_gc935fe2e9
zfs-kmod-2.1.99-1641_gc935fe2e9

Will rebuild from master.

gamanakis · 2023-03-08T15:04:06Z

@slavanap this should be fixed in current master and in the upcoming 2.1.10 release.

scratchings · 2023-03-09T16:27:31Z

I patched 2.1.9 using 34ce4c4 and #14502 and deployed this yesterday, I'm sorry to say I'm still getting panics - two today:

Message from syslogd@fs7 at Mar 9 16:24:12 ...
kernel:VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)

Message from syslogd@fs7 at Mar 9 16:24:12 ...
kernel:PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_featur

rincebrain · 2023-03-09T17:26:06Z

I'd probably try eee9362 instead of 0f32b1f if that's still failing for you. If it still fails with 34ce4c4 and eee9362 , that's differently interesting. If not, then we know 0f32b1f either needs more work or there's something about master that makes it behave differently than 2.1 here, depending.

gamanakis · 2023-03-09T21:40:05Z

To everyone interested: give zfs-2.1.10-staging a try. It contains already all the fixes.

When activating filesystem features after receiving a snapshot, do so only in syncing context. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes openzfs#14304 Closes openzfs#14252 (cherry picked from commit eee9362)

peterska added the Type: Defect Incorrect behavior (e.g. crash, hang) label Dec 2, 2022

mabod mentioned this issue Dec 12, 2022

coredumps with zfs send/receive with 2.1.7 #14281

Closed

behlendorf closed this as completed in #14304 Jan 12, 2023

archlinux-github pushed a commit to archlinux/aur that referenced this issue Feb 9, 2023

Backport fix for issue related to send/receive.

adff772

Issue: openzfs/zfs#14252 Pull request: openzfs/zfs#14304

archlinux-github pushed a commit to archlinux/aur that referenced this issue Feb 9, 2023

Backport fix for issue related to send/receive.

184e232

Issue: openzfs/zfs#14252 Pull request: openzfs/zfs#14304

siilike mentioned this issue Feb 18, 2023

ZFS corruption related to snapshots post-2.0.x upgrade #12014

Open

siilike mentioned this issue Feb 21, 2023

permanent errors (ereport.fs.zfs.authentication) reported after syncoid snapshot/send workload #11688

Open

zfs-2.1.7 zfs recv panic in dsl_dataset_deactivate_feature_impl #14252

zfs-2.1.7 zfs recv panic in dsl_dataset_deactivate_feature_impl #14252

Comments

peterska commented Dec 2, 2022

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

ryao commented Dec 2, 2022

peterska commented Dec 2, 2022

mabod commented Dec 3, 2022 • edited Loading

peterska commented Dec 4, 2022

ryao commented Dec 4, 2022

peterska commented Dec 4, 2022

peterska commented Dec 4, 2022

ryao commented Dec 4, 2022

mabod commented Dec 5, 2022 via email

gamanakis commented Dec 5, 2022

peterska commented Dec 5, 2022

ryao commented Dec 6, 2022

peterska commented Dec 6, 2022

peterska commented Dec 8, 2022

mabod commented Dec 9, 2022

gamanakis commented Dec 9, 2022 via email

mabod commented Dec 9, 2022

gamanakis commented Dec 9, 2022 • edited Loading

mabod commented Dec 9, 2022

gamanakis commented Dec 9, 2022

peterska commented Dec 9, 2022

gamanakis commented Dec 9, 2022

peterska commented Dec 10, 2022

mabod commented Dec 10, 2022

mabod commented Dec 10, 2022

gamanakis commented Dec 11, 2022 • edited Loading

peterska commented Dec 11, 2022

peterska commented Dec 13, 2022

Clete2 commented Dec 28, 2022 • edited Loading

rincebrain commented Dec 28, 2022 • edited Loading

rincebrain commented Dec 29, 2022

gamanakis commented Dec 29, 2022

infinet commented Dec 30, 2022

mabod commented Jan 5, 2023

Clete2 commented Jan 5, 2023

phreaker0 commented Jan 28, 2023

gamanakis commented Jan 28, 2023

phreaker0 commented Jan 28, 2023

phreaker0 commented Feb 6, 2023

scratchings commented Feb 17, 2023

duramuss commented Feb 18, 2023

gamanakis commented Feb 18, 2023

slavanap commented Mar 8, 2023 • edited Loading

gamanakis commented Mar 8, 2023

scratchings commented Mar 9, 2023

rincebrain commented Mar 9, 2023

gamanakis commented Mar 9, 2023

mabod commented Dec 3, 2022 •

edited

Loading

gamanakis commented Dec 9, 2022 •

edited

Loading

gamanakis commented Dec 11, 2022 •

edited

Loading

Clete2 commented Dec 28, 2022 •

edited

Loading

rincebrain commented Dec 28, 2022 •

edited

Loading

slavanap commented Mar 8, 2023 •

edited

Loading