Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs-2.1.7 zfs recv panic in dsl_dataset_deactivate_feature_impl #14252

Closed
peterska opened this issue Dec 2, 2022 · 73 comments · Fixed by #14304
Closed

zfs-2.1.7 zfs recv panic in dsl_dataset_deactivate_feature_impl #14252

peterska opened this issue Dec 2, 2022 · 73 comments · Fixed by #14304
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@peterska
Copy link

peterska commented Dec 2, 2022

System information

Type Version/Name
Distribution Name Fedora
Distribution Version 36
Kernel Version 6.0.10-200.fc36.x86_64
Architecture Intel x86_64
OpenZFS Version zfs-2.1.7-1

Describe the problem you're observing

When using zfs send to make a backup on a remote machine, the receiver throws a PANIC in one of the zfs functions and the file system deadlocks.

Describe how to reproduce the problem

zfs send -RLec -I snapshot_name | ssh remote_host "zfs receive -duF remote_zvol"

Include any warning/errors/backtraces from the system logs

System Events
=-=-=-=-=-=-=
Dec  2 15:32:14 raid3 kernel: VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
Dec  2 15:32:14 raid3 kernel: PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
Dec  2 15:32:14 raid3 kernel: Showing stack for process 1604
Dec  2 15:32:14 raid3 kernel: CPU: 3 PID: 1604 Comm: txg_sync Tainted: P           OE      6.0.10-200.fc36.x86_64 #1
Dec  2 15:32:14 raid3 kernel: Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD3H/Z87X-UD3H-CF, BIOS 10b 06/12/2014
Dec  2 15:32:14 raid3 kernel: Call Trace:
Dec  2 15:32:14 raid3 kernel: <TASK>
Dec  2 15:32:14 raid3 kernel: dump_stack_lvl+0x44/0x5c
Dec  2 15:32:14 raid3 kernel: spl_panic+0xf0/0x108 [spl]
Dec  2 15:32:14 raid3 kernel: dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs]
Dec  2 15:32:14 raid3 kernel: dsl_dataset_clone_swap_sync_impl+0x88e/0xa70 [zfs]
Dec  2 15:32:14 raid3 kernel: ? _raw_spin_unlock+0x15/0x30
Dec  2 15:32:14 raid3 kernel: ? dsl_dataset_hold_flags+0xb0/0x230 [zfs]
Dec  2 15:32:14 raid3 kernel: dmu_recv_end_sync+0x180/0x560 [zfs]
Dec  2 15:32:14 raid3 kernel: ? preempt_count_add+0x6a/0xa0
Dec  2 15:32:14 raid3 kernel: dsl_sync_task_sync+0xa5/0xf0 [zfs]
Dec  2 15:32:14 raid3 kernel: dsl_pool_sync+0x3d3/0x4e0 [zfs]
Dec  2 15:32:14 raid3 kernel: spa_sync+0x555/0xf60 [zfs]
Dec  2 15:32:14 raid3 kernel: ? _raw_spin_unlock+0x15/0x30
Dec  2 15:32:14 raid3 kernel: ? spa_txg_history_init_io+0xf3/0x110 [zfs]
Dec  2 15:32:14 raid3 kernel: txg_sync_thread+0x227/0x3e0 [zfs]
Dec  2 15:32:14 raid3 kernel: ? txg_fini+0x260/0x260 [zfs]
Dec  2 15:32:14 raid3 kernel: ? __thread_exit+0x20/0x20 [spl]
Dec  2 15:32:14 raid3 kernel: thread_generic_wrapper+0x57/0x70 [spl]
Dec  2 15:32:14 raid3 kernel: kthread+0xe6/0x110
Dec  2 15:32:14 raid3 kernel: ? kthread_complete_and_exit+0x20/0x20
Dec  2 15:32:14 raid3 kernel: ret_from_fork+0x1f/0x30
Dec  2 15:32:14 raid3 kernel: </TASK>
@peterska peterska added the Type: Defect Incorrect behavior (e.g. crash, hang) label Dec 2, 2022
@ryao
Copy link
Contributor

ryao commented Dec 2, 2022

ZFS is trying to deactivate a feature that is not active and it very much does not like that.

Do any variations of the send command without -L, -e or -c avoid the panic? Knowing if turning any of them off makes this work would help in debugging this.

@peterska
Copy link
Author

peterska commented Dec 2, 2022

The error message talks about the large_blocks feature, although it is only enabled and not active at both the sender and receiver. The -L flag is probably responsible for triggering this. Could it be related to #13782 ?

I will not be able to try anything until tomorrow. The machine is at the office and has currently experienced this issue and I am unable to ssh in.

@mabod
Copy link

mabod commented Dec 3, 2022

I can confirm that this happens also to me. zfs core dump during send/receive:

It happens here if sender has resordsize 128k and receivers has recordsize 1M:

# zfs get recordsize zHome/home zstore/data/BACKUP/rakete_home/home 
NAME                                 PROPERTY    VALUE    SOURCE
zHome/home                           recordsize  128K     inherited from zHome
zstore/data/BACKUP/rakete_home/home  recordsize  1M       inherited from zstore
    syncoid --sendoptions="L" --mbuffer-size=512M --no-sync-snap zHome/home zstore/data/BACKUP/rakete_home/home
    NEWEST SNAPSHOT: 2022-12-03--23:23
    Sending incremental zHome/home@2022-11-29--08:05 ... 2022-12-03--23:23 (~ 34 KB):
0,00 B 0:00:00 [0,00 B/s] [>                                                                   ]  0%            
    cannot receive: failed to read from stream
CRITICAL ERROR:  zfs send -L  -I 'zHome/home'@'2022-11-29--08:05' 'zHome/home'@'2022-12-03--23:23' | mbuffer  -q -s 128k -m 512M 2>/dev/null | pv -p -t -e -r -b -s 35264 |  zfs receive  -s -F 'zstore/data/BACKUP/rakete_home/home' 2>&1 failed: 256 at /usr/bin/syncoid line 817.

@peterska
Copy link
Author

peterska commented Dec 4, 2022

I tried various permuations of zfs send flags and here are the results

zfs send -RLec -I snapshot_name : crashes
zfs send -Rec -I snapshot_name : crashes
zfs send -Rc -I snapshot_name : crashes
zfs send -R -I snapshot_name : works

I use zstd-1 compression on all my datasets. compression is inherited from the top dataset.
I use the default recordsize for all datasets except for one that uses 8K

@ryao
Copy link
Contributor

ryao commented Dec 4, 2022

Does zfs send -RLe -I snapshot_name work? What I am really wondering is which flags (just 1 I hope) cause crashes to narrow the scope that needs to be debugged.

@peterska
Copy link
Author

peterska commented Dec 4, 2022

I have left the office now and won't be there for a while. Most of us work from home.
Next time someone goes to the office I can try it for you.

@peterska
Copy link
Author

peterska commented Dec 4, 2022

just got the same panic using only zfs send -R -I snapshot_name so trying send -RLe will not shed any light on the issue.
Here is the full stacktrace:

[Sun Dec  4 16:52:10 2022] VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
[Sun Dec  4 16:52:10 2022] PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
[Sun Dec  4 16:52:10 2022] Showing stack for process 1627
[Sun Dec  4 16:52:10 2022] CPU: 7 PID: 1627 Comm: txg_sync Tainted: P           OE      6.0.10-200.fc36.x86_64 #1
[Sun Dec  4 16:52:10 2022] Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD3H/Z87X-UD3H-CF, BIOS 10b 06/12/2014
[Sun Dec  4 16:52:10 2022] Call Trace:
[Sun Dec  4 16:52:10 2022]  <TASK>
[Sun Dec  4 16:52:10 2022]  dump_stack_lvl+0x44/0x5c
[Sun Dec  4 16:52:10 2022]  spl_panic+0xf0/0x108 [spl]
[Sun Dec  4 16:52:10 2022]  dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs]
[Sun Dec  4 16:52:10 2022]  dsl_dataset_clone_swap_sync_impl+0x88e/0xa70 [zfs]
[Sun Dec  4 16:52:10 2022]  ? _raw_spin_unlock+0x15/0x30
[Sun Dec  4 16:52:10 2022]  ? dsl_dataset_hold_flags+0xb0/0x230 [zfs]
[Sun Dec  4 16:52:10 2022]  dmu_recv_end_sync+0x180/0x560 [zfs]
[Sun Dec  4 16:52:10 2022]  ? preempt_count_add+0x6a/0xa0
[Sun Dec  4 16:52:10 2022]  dsl_sync_task_sync+0xa5/0xf0 [zfs]
[Sun Dec  4 16:52:10 2022]  dsl_pool_sync+0x3d3/0x4e0 [zfs]
[Sun Dec  4 16:52:10 2022]  spa_sync+0x555/0xf60 [zfs]
[Sun Dec  4 16:52:10 2022]  ? _raw_spin_unlock+0x15/0x30
[Sun Dec  4 16:52:10 2022]  ? spa_txg_history_init_io+0xf3/0x110 [zfs]
[Sun Dec  4 16:52:10 2022]  txg_sync_thread+0x227/0x3e0 [zfs]
[Sun Dec  4 16:52:10 2022]  ? txg_fini+0x260/0x260 [zfs]
[Sun Dec  4 16:52:10 2022]  ? __thread_exit+0x20/0x20 [spl]
[Sun Dec  4 16:52:10 2022]  thread_generic_wrapper+0x57/0x70 [spl]
[Sun Dec  4 16:52:10 2022]  kthread+0xe6/0x110
[Sun Dec  4 16:52:10 2022]  ? kthread_complete_and_exit+0x20/0x20
[Sun Dec  4 16:52:10 2022]  ret_from_fork+0x1f/0x30
[Sun Dec  4 16:52:10 2022]  </TASK>

@ryao
Copy link
Contributor

ryao commented Dec 4, 2022

I was afraid that might happen. Thanks for confirming it.

@mabod
Copy link

mabod commented Dec 5, 2022 via email

@gamanakis
Copy link
Contributor

Is this reproducible on 2.1.6?

@peterska
Copy link
Author

peterska commented Dec 5, 2022

No, I've been running 2.1.6 with zfs send -RLec since it's release. The script has been running unchanged for about 5 years and I have never experienced this issue. It happened instantly once I updated to zfs 2.1.7

@ryao
Copy link
Contributor

ryao commented Dec 6, 2022

Could you try reverting c8d2ab0 to see if that makes it stop? I do not see any thing else that changed in 2.1.7 that could possibly be related to this.

@peterska
Copy link
Author

peterska commented Dec 6, 2022

Thanks, I wil try reverting c8d2ab0 It might be a couple of days before I can try it.

@peterska
Copy link
Author

peterska commented Dec 8, 2022

I've reverted c8d2ab0 and tried the zfs send again. Unfortunately my pool seems to be in a weird state and fails with lots of errors about snapshots not existing. After the failure, on the recv end the receive dataset has been renamed to xxxrecv-xxxxx-1. I tried renaming the dataset back to raid3/raid2 but each time I did the zfs send it was renamed to raid3/raid2recv-xxxxx-1 again.

I am running a pool scrub and will try again once it has finished.

The sending pool is raid2 and it is recevied into raid3/raid2 on the receiving end.

All the snapshots mentioned in the errors below exist on the receving end, they've just be renamed to raid3/raid2recv-65018-1

Here is an extract of some of the errors.

cannot open 'raid3/raid2/git-migration@20221204-10.53.25-Sun-backup': dataset does not exist
cannot open 'raid3/raid2/git-migration': dataset does not exist
cannot receive: specified fs (raid3/raid2) does not exist

attempting destroy raid3/raid2
failed - trying rename raid3/raid2 to raid3/raid2recv-65018-1

local fs raid3/raid2/home does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring
local fs raid3/raid2/git does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring
local fs raid3/raid2/databases does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring
local fs raid3/raid2/products does not have fromsnap (20221204-16.51.32-Sun-backup in stream); must have been deleted locally; ignoring

@mabod
Copy link

mabod commented Dec 9, 2022

I did a full test again, but reverting the patch did not help. At least, what I found, reverting this patch prevents a complete freeze of the pool. But I still see coredumps during send/receive

This is the patch I am reverting:

From 1af2632e439156f5edd404b35e5c9e444b607733 Mon Sep 17 00:00:00 2001
From: George Amanakis <gamanakis@gmail.com>
Date: Tue, 30 Aug 2022 22:15:56 +0200
Subject: [PATCH] Fix setting the large_block feature after receiving a
 snapshot

We are not allowed to dirty a filesystem when done receiving
a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
not be set on that filesystem since the filesystem is not on
dp_dirty_datasets, and a subsequent encrypted raw send will fail.
Fix this by checking in dsl_dataset_snapshot_sync_impl() if the feature
needs to be activated and do so if appropriate.

Signed-off-by: George Amanakis <gamanakis@gmail.com>
---
 module/zfs/dsl_dataset.c                      | 15 ++++
 tests/runfiles/common.run                     |  2 +-
 tests/zfs-tests/tests/Makefile.am             |  1 +
 .../rsend/send_raw_large_blocks.ksh           | 78 +++++++++++++++++++
 4 files changed, 95 insertions(+), 1 deletion(-)
 create mode 100755 tests/zfs-tests/tests/functional/rsend/send_raw_large_blocks.ksh

diff --git a/module/zfs/dsl_dataset.c b/module/zfs/dsl_dataset.c
index c7577fc584a..4da4effca60 100644
--- a/module/zfs/dsl_dataset.c
+++ b/module/zfs/dsl_dataset.c
@@ -1760,6 +1760,21 @@ dsl_dataset_snapshot_sync_impl(dsl_dataset_t *ds, const char *snapname,
 		}
 	}
 
+	/*
+	 * We are not allowed to dirty a filesystem when done receiving
+	 * a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
+	 * not be set and a subsequent encrypted raw send will fail. Hence
+	 * activate this feature if needed here.
+	 */
+	for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
+		if (zfeature_active(f, ds->ds_feature_activation[f]) &&
+		    !(zfeature_active(f, ds->ds_feature[f]))) {
+			dsl_dataset_activate_feature(dsobj, f,
+			    ds->ds_feature_activation[f], tx);
+			ds->ds_feature[f] = ds->ds_feature_activation[f];
+		}
+	}
+
 	ASSERT3U(ds->ds_prev != 0, ==,
 	    dsl_dataset_phys(ds)->ds_prev_snap_obj != 0);
 	if (ds->ds_prev) {

With stock zfs 2.1.7 (incl. this patch) my pool completely freezes when doing send / receive:

syncoid --sendoptions="L" --mbuffer-size=512M --no-sync-snap zHome/home zstore/data/BACKUP/rakete_home/home
    NEWEST SNAPSHOT: 2022-12-09--10:47
    Sending incremental zHome/home@2022-12-09--10:32 ... 2022-12-09--10:47 (~ 4 KB):
0,00 B 0:00:00 [0,00 B/s] [>                                                                                                                                                                                                                                             ]  0%            
    cannot receive: failed to read from stream
CRITICAL ERROR:  zfs send -L  -I 'zHome/home'@'2022-12-09--10:32' 'zHome/home'@'2022-12-09--10:47' | mbuffer  -q -s 128k -m 512M 2>/dev/null | pv -p -t -e -r -b -s 4720 |  zfs receive  -s -F 'zstore/data/BACKUP/rakete_home/home' 2>&1 failed: 256 at /usr/bin/syncoid line 817.

gives the following in the log:

Dez 09 10:35:47 rakete kernel: INFO: task txg_sync:3044 blocked for more than 122 seconds.
Dez 09 10:35:47 rakete kernel:       Tainted: P           OE      6.0.11-zen1-1-zen #1
Dez 09 10:35:47 rakete kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dez 09 10:35:47 rakete kernel: task:txg_sync        state:D stack:    0 pid: 3044 ppid:     2 flags:0x00004000
Dez 09 10:35:47 rakete kernel: Call Trace:
Dez 09 10:35:47 rakete kernel:  <TASK>
Dez 09 10:35:47 rakete kernel:  __schedule+0xb43/0x1350
Dez 09 10:35:47 rakete kernel:  schedule+0x5e/0xd0
Dez 09 10:35:47 rakete kernel:  spl_panic+0x10a/0x10c [spl 29da18a0fede4076df583dd8fb83790522bfe897]
Dez 09 10:35:47 rakete kernel:  dsl_dataset_deactivate_feature_impl+0xfb/0x100 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  dsl_dataset_clone_swap_sync_impl+0x90b/0xe30 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  dsl_dataset_rollback_sync+0x109/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  dsl_sync_task_sync+0xac/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  dsl_pool_sync+0x40d/0x5c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  spa_sync+0x56c/0xf90 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  ? spa_txg_history_init_io+0x193/0x1c0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  txg_sync_thread+0x22b/0x3f0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  ? txg_wait_open+0xf0/0xf0 [zfs ee2da272eb7c5d953c9392bd55d5f2734ee05f85]
Dez 09 10:35:47 rakete kernel:  ? __thread_exit+0x20/0x20 [spl 29da18a0fede4076df583dd8fb83790522bfe897]
Dez 09 10:35:47 rakete kernel:  thread_generic_wrapper+0x5e/0x70 [spl 29da18a0fede4076df583dd8fb83790522bfe897]
Dez 09 10:35:47 rakete kernel:  kthread+0xde/0x110
Dez 09 10:35:47 rakete kernel:  ? kthread_complete_and_exit+0x20/0x20
Dez 09 10:35:47 rakete kernel:  ret_from_fork+0x22/0x30
Dez 09 10:35:47 rakete kernel:  </TASK>

I see multiple of those blocked task messages for various tasks. The pool is then frozen. Reboot takes ages and I had to force reboot with REISUB.

When I reverted the patch I did not experience freezes but the same error and coredumps. Unfortuanlety I can not provide a coredump because my system says: Resource limits disable core dumping for process 82335 (zfs) And I do not know why that is.

Anyways, interesting is also that this error resp. coredump does not happen every time. Approximately 1 out of 4 send/receive processes execute fine.

And just to reassure you: This is a new bug in 2.1.7. I am using this backup process with syncoid since many years without any problem. Reverting back to zfs 2.1.6 gives me a stable system again.

@gamanakis
Copy link
Contributor

gamanakis commented Dec 9, 2022 via email

@mabod
Copy link

mabod commented Dec 9, 2022

I could do that but I need some help. I am not a programmer resp. git expert. I need guidance on this.

I have a couple of questions to start with:

  1. do I need to do the bisect for zfs-dkms and zfs-utils or just for zfs-utils? (I am on endeavouros)
  2. How could I incorporate the git bisect into the pkgbuild process so that I get a new package for each bisect? Or should I just compile the soruce separetly and then only use the resulting binary?
  3. What would be last good commit to use?

@gamanakis
Copy link
Contributor

gamanakis commented Dec 9, 2022

The straightforward way would be to compile manually:

  1. git bisect start
  2. git bisect good zfs-2.1.6
  3. git bisect bad zfs-2.1.7
  4. compile
  5. load the modules with "./scripts/zfs.sh -v" and set the PATH to the main OpenZFS source directory
  6. test, enter accordingly "git bisect good" or "git bisect bad"
  7. goto step 4

@mabod
Copy link

mabod commented Dec 9, 2022

step 5 does not work for me. I have the module running and I have active pools. How do I reload the zfs module in a running live environment?
And what do you mean with "set the PATH to the main OpenZFS source directory" ?

@gamanakis
Copy link
Contributor

./scripts/zfs.sh -vu should unload the modules. Make sure you have exported all pools.

@peterska
Copy link
Author

peterska commented Dec 9, 2022

I have done some comprehensive testing after reverting c8d2ab0 and can report that I am no longer getting the zfs recv panic in dsl_dataset_deactivate_feature_impl. I tried lots incremental sends using the full zfs send -RLec and also a full send. That definitely fixed it. @mabod your coredump issue is probably related to another commit.

@gamanakis
Copy link
Contributor

@peterska Would you be willing to try a possible fix?

@peterska
Copy link
Author

@gamanakis Let me review the patch and if it looks ok to me I will test it. Will not be able to test it until Monday Sydney time.

@mabod
Copy link

mabod commented Dec 10, 2022

./scripts/zfs.sh -vu should unload the modules. Make sure you have exported all pools.

I exported all pools. I rebooted. I verfied with the 'zpool status' command that no polls are importedt. But when I excecute "./scripts/zfs.sh -vu" it fails by saying "module is still in use". Why is that?

@mabod
Copy link

mabod commented Dec 10, 2022

@mabod your coredump issue is probably related to another commit.

yes, must be. I checked with the reverted pacth again. The coredump does exist. no doubt.
May be it has to do with encryption. The datasets are encrypted.

@gamanakis
Copy link
Contributor

gamanakis commented Dec 11, 2022

@peterska I am thinking that in your case the code inserted in c8d2ab0 runs in open context, which shouldn't be the case. I am trying to reproduce this on 2.1.7 with no success. This is my latest iteration:

 64 log_must zpool create -f pool_lb $TESTDIR/vdev_a
 65
 66 log_must zfs create pool_lb/fs
 67 log_must dd if=/dev/urandom of=$file bs=1024 count=1024
 68 log_must zfs snapshot pool_lb/fs@snap1
 69
 70 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 71 log_must zfs snapshot pool_lb/fs@snap2
 72 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 73 log_must zfs snapshot pool_lb/fs@snap3
 74 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 75 log_must zfs snapshot pool_lb/fs@snap4
 76 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 77 log_must zfs snapshot pool_lb/fs@snap5
 78 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 79 log_must zfs snapshot pool_lb/fs@snap6
 80 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 81 log_must zfs snapshot pool_lb/fs@snap7
 82 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 83 log_must zfs snapshot pool_lb/fs@snap8
 84 log_must dd if=/dev/urandom of=$file1 bs=1024 count=1024
 85 log_must zfs snapshot pool_lb/fs@snap9
 86
 87 log_must eval "zfs send -RLec pool_lb/fs@snap5 > $backup"
 88 log_must eval "zfs recv pool_lb/testfs5 < $backup"
 89
 90 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
 91 log_must zfs snapshot pool_lb/testfs5@snap10
 92 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
 93 log_must zfs snapshot pool_lb/testfs5@snap20
 94 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
 95 log_must zfs snapshot pool_lb/testfs5@snap30
 96 log_must dd if=/dev/urandom of=/pool_lb/testfs5/mine bs=1024 count=1024
 97 log_must zfs snapshot pool_lb/testfs5@snap40
 98 log_must eval "zfs send -RLec -I @snap5 pool_lb/fs@snap9 > $backup"
 99 log_must eval "zfs recv -F pool_lb/testfs5 < $backup"
100
101 log_must eval "zfs send -RLec pool_lb/testfs5@snap9 > $raw_backup"

The patch I was thinking about is the following:

diff --git a/module/zfs/dsl_dataset.c b/module/zfs/dsl_dataset.c
index 71d876cae..da49ca94b 100644
--- a/module/zfs/dsl_dataset.c
+++ b/module/zfs/dsl_dataset.c
@@ -1753,13 +1753,17 @@ dsl_dataset_snapshot_sync_impl(dsl_dataset_t *ds, const char *snapname,
         * not be set and a subsequent encrypted raw send will fail. Hence
         * activate this feature if needed here.
         */
-       for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
-               if (zfeature_active(f, ds->ds_feature_activation[f]) &&
-                   !(zfeature_active(f, ds->ds_feature[f]))) {
-                       dsl_dataset_activate_feature(dsobj, f,
-                           ds->ds_feature_activation[f], tx);
-                       ds->ds_feature[f] = ds->ds_feature_activation[f];
+       if (dmu_tx_is_syncing(tx)) {
+               for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
+                       if (zfeature_active(f, ds->ds_feature_activation[f]) &&
+                           !(zfeature_active(f, ds->ds_feature[f]))) {
+                               dsl_dataset_activate_feature(dsobj, f,
+                                   ds->ds_feature_activation[f], tx);
+                               ds->ds_feature[f] = ds->ds_feature_activation[f];
+                       }
                }
+       } else {
+               cmn_err(CE_NOTE, "not syncing context!");
        }

        ASSERT3U(ds->ds_prev != 0, ==,

This runs the code inserted in c8d2ab0 only in syncing context, otherwise it dumps a note in dmesg. It would help if you could end up with a script reproducing the bug in a pool I have access to, or if you could run patch above.

@peterska
Copy link
Author

@gamanakis the patch looks ok to me. I will try it later today and let you know the results.

To help you reproduce the panic, here are the non default settings of my pool and dataset:

pool:
NAME   PROPERTY                       VALUE                          SOURCE
raid2  ashift                         12                             local
raid2  autotrim                       on                             local
raid2  feature@async_destroy          enabled                        local
raid2  feature@empty_bpobj            active                         local
raid2  feature@lz4_compress           active                         local
raid2  feature@multi_vdev_crash_dump  enabled                        local
raid2  feature@spacemap_histogram     active                         local
raid2  feature@enabled_txg            active                         local
raid2  feature@hole_birth             active                         local
raid2  feature@extensible_dataset     active                         local
raid2  feature@embedded_data          active                         local
raid2  feature@bookmarks              enabled                        local
raid2  feature@filesystem_limits      enabled                        local
raid2  feature@large_blocks           enabled                        local
raid2  feature@large_dnode            enabled                        local
raid2  feature@sha512                 enabled                        local
raid2  feature@skein                  enabled                        local
raid2  feature@edonr                  enabled                        local
raid2  feature@userobj_accounting     active                         local
raid2  feature@encryption             enabled                        local
raid2  feature@project_quota          active                         local
raid2  feature@device_removal         enabled                        local
raid2  feature@obsolete_counts        enabled                        local
raid2  feature@zpool_checkpoint       enabled                        local
raid2  feature@spacemap_v2            active                         local
raid2  feature@allocation_classes     enabled                        local
raid2  feature@resilver_defer         enabled                        local
raid2  feature@bookmark_v2            enabled                        local
raid2  feature@redaction_bookmarks    enabled                        local
raid2  feature@redacted_datasets      enabled                        local
raid2  feature@bookmark_written       enabled                        local
raid2  feature@log_spacemap           active                         local
raid2  feature@livelist               enabled                        local
raid2  feature@device_rebuild         enabled                        local
raid2  feature@zstd_compress          active                         local
raid2  feature@draid                  enabled                        local


dataset:

NAME   PROPERTY              VALUE                  SOURCE
raid2  compression           zstd-1                 local
raid2  atime                 off                    local
raid2  xattr                 sa                     local
raid2  relatime              on                     local


@peterska
Copy link
Author

@gamanakis I finally got a chance to test out your patch today. I could not apply it directly, I guess github changed some spaces/tabs, so I patched it manually. The bottom of this comment shows the surrounding code so I can convince you I applied it correctly.

I ran about 8 zfs sned -RLec commands and there were no lockups/panics. I have left the new code running on my server so we can detect any issues as it goes about it daily backup tasks.

        for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
                if (zfeature_active(f, ds->ds_feature[f])) {
                        dsl_dataset_activate_feature(dsobj, f,
                            ds->ds_feature[f], tx);
                }
        }

        /*
         * We are not allowed to dirty a filesystem when done receiving
         * a snapshot. In this case the flag SPA_FEATURE_LARGE_BLOCKS will
         * not be set and a subsequent encrypted raw send will fail. Hence
         * activate this feature if needed here.
         */
        if (dmu_tx_is_syncing(tx)) {
               for (spa_feature_t f = 0; f < SPA_FEATURES; f++) {
                       if (zfeature_active(f, ds->ds_feature_activation[f]) &&
                           !(zfeature_active(f, ds->ds_feature[f]))) {
                               dsl_dataset_activate_feature(dsobj, f,
                                   ds->ds_feature_activation[f], tx);
                               ds->ds_feature[f] = ds->ds_feature_activation[f];
                       }
                }
        } else {
               cmn_err(CE_NOTE, "not syncing context!");
        }

        ASSERT3U(ds->ds_prev != 0, ==,
            dsl_dataset_phys(ds)->ds_prev_snap_obj != 0);

@Clete2
Copy link

Clete2 commented Dec 28, 2022

Just to summarize my findings so far:

The last couple of days I was running 2.1.7 with the PR from @gamanakis : #14304 At first it looked really promising. No crashes at first. But then I had another freeze 2 days ago.

Then I assumed that zstd compression could eventually cause the issue. I have lz4 on the sender side and zstd on the receiver side. I moved all the receiver side to lz4. But then I got another freeze. So zstd is not the issue.

May be it is encryption related. Sender and receiver are encrypted with a keyfile. But I do not know.

Now I am back to square one and I have stock 2.1.7 running with the large-block commit reverted: c8d2ab0

So far I have no freezes.

I don’t think it’s encryption related. I got a crash when receiving into 2.1.7 from 0.8.x. Neither side has any encryption enabled.

Sender has lz4 whereas receiver is zstd-19. I’ll try reverting to lz4 receiver when I get a chance. Of note, the first full send (using syncoid) worked without issue. It was the next, incremental send which crashed my system.

@rincebrain
Copy link
Contributor

rincebrain commented Dec 28, 2022

I could certainly believe it's not encryption related, I've just seen the quota feature and encryption interact poorly before.

My guess would be that it's a timing issue where something doesn't actually block on something else happening first, and if you use encryption or high compression with large blocks or the like, one task takes long enough that the other hits this.

@rincebrain
Copy link
Contributor

Quickly, I added two log lines in the actual deactivate_impl, and got:

[  466.783588] NOTICE: Feature 18 deactivation attempting
[  466.784023] NOTICE: Feature 18 deactivation completed
[  466.790930] NOTICE: Feature 16 deactivation attempting
[  466.791353] NOTICE: Feature 16 deactivation completed
[  466.791726] NOTICE: Feature 18 deactivation attempting
[  466.792083] NOTICE: Feature 18 deactivation completed
[  468.010089] NOTICE: Feature 16 deactivation attempting
[  468.010451] NOTICE: Feature 16 deactivation completed
[  468.010761] NOTICE: Feature 16 deactivation attempting
[  468.011065] VERIFY0(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)

So it seems it tries to deactivate, if I can count, SPA_FEATURE_USEROBJ_ACCOUNTING twice, and boom goes the dynamite...

@gamanakis
Copy link
Contributor

I think it has to do with the code here:

diff --git a/module/zfs/dmu_objset.c b/module/zfs/dmu_objset.c
index c17c829a0..0e8427f19 100644
--- a/module/zfs/dmu_objset.c
+++ b/module/zfs/dmu_objset.c
@@ -2408,13 +2408,6 @@ dmu_objset_id_quota_upgrade_cb(objset_t *os)
            dmu_objset_userobjspace_present(os))
                return (SET_ERROR(ENOTSUP));

-       if (dmu_objset_userobjused_enabled(os))
-               dmu_objset_ds(os)->ds_feature_activation[
-                   SPA_FEATURE_USEROBJ_ACCOUNTING] = (void *)B_TRUE;
-       if (dmu_objset_projectquota_enabled(os))
-               dmu_objset_ds(os)->ds_feature_activation[
-                   SPA_FEATURE_PROJECT_QUOTA] = (void *)B_TRUE;
-
        err = dmu_objset_space_upgrade(os);
        if (err)
                return (err);

If I remove that then it doesn't panic.

@infinet
Copy link

infinet commented Dec 30, 2022

The above change to dmu_objset.c works. It fixes my syncoid backup, which send from zfs 2.0 pool to an encrypted zfs 2.1 pool. feature@userobj_accounting and feature@project_quota are activated in both pools.

@mabod
Copy link

mabod commented Jan 5, 2023

@gamanakis :
I am testing the new PR #14304 since a couple of days. (with patch for dmu_objset.c)
I want to confirm here that it solves my send/receive issues.

@Clete2
Copy link

Clete2 commented Jan 5, 2023

Is there a workaround that doesn’t require a code change? Maybe a flag I can pass into the send or receive command to avoid the code path that causes panic?

I’m in a situation where one server is offsite and unmanaged and if I happen to break it by injecting my own zfs modules then recovery will be tricky.

behlendorf pushed a commit that referenced this issue Jan 12, 2023
When activating filesystem features after receiving a snapshot, do 
so only in syncing context.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14304 
Closes #14252
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Jan 13, 2023
When activating filesystem features after receiving a snapshot, do
so only in syncing context.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes openzfs#14304
Closes openzfs#14252
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Jan 18, 2023
When activating filesystem features after receiving a snapshot, do
so only in syncing context.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes openzfs#14304
Closes openzfs#14252
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Jan 19, 2023
When activating filesystem features after receiving a snapshot, do
so only in syncing context.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes openzfs#14304
Closes openzfs#14252
@phreaker0
Copy link

I just got this line (the only one which I could read of the screen) and needed to reboot:

VERIFY0(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)

with zfs 2.1.9:

$ zfs version
zfs-2.1.9-1
zfs-kmod-2.1.9-1
$ uname -r
5.15.0-58-generic

@gamanakis
Copy link
Contributor

@phreaker0 can you provide additional information? were you sending/receiving when this happened?

@phreaker0
Copy link

@gamanakis i wasn't around when it happened, what I did before I left:

  • upgrade from 2.1.6 to 2.1.9
  • reboot
  • destroy all snapshots listed in the data errors section of my ssd pool (the <0x0> ones from ZFS corruption related to snapshots post-2.0.x upgrade #12014)
  • zpool scrub ssd pool to clear the data errors
  • there is a regular replication of all datasets from the encrypted ssd pool to the local encrypted hdd pool (replication stream is unencrypted) each hour
  • i left

but I guess the crash happened pretty soon, because my influxdb logging stuff didn't record anything any more. The ssd pool is used as ROOT. Therefore I couldn't login any more after the panic to gather more information, I only had the screen output.

I'm trying to reproduce it now, but so far it works, I will keep an active ssh session with the kernel log in case of a crash and will report back.

@phreaker0
Copy link

FYI: I can't reproduce it anymore, it ran for several days.

archlinux-github pushed a commit to archlinux/aur that referenced this issue Feb 9, 2023
archlinux-github pushed a commit to archlinux/aur that referenced this issue Feb 9, 2023
@scratchings
Copy link

I've just had this happen on RHEL 9 with 2.1.9, pool comprised from LUKS encrypted devices (which I'm using because of the send/receive issues with native encryption #11679 - aargh!).
A receive from a Nexenta 4 host to this system (unencrypted) (1M record size/lz4 on the destination, 128K record size on the source) using syncoid+mbuffer on destination reported:

VERIFY0(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)

With /var/log/messages reporting:

Feb 17 08:51:32 fs6 kernel: VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
Feb 17 08:51:32 fs6 kernel: PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
Feb 17 08:51:32 fs6 kernel: Showing stack for process 1123981
Feb 17 08:51:32 fs6 kernel: CPU: 4 PID: 1123981 Comm: txg_sync Kdump: loaded Tainted: P OE --------- --- 5.14.0-162.12.1.el9_1.x86_64 #1
Feb 17 08:51:32 fs6 kernel: Hardware name: Dell Inc. PowerEdge R7525/0590KW, BIOS 2.9.3 08/05/2022
Feb 17 08:51:32 fs6 kernel: Call Trace:
Feb 17 08:51:32 fs6 kernel: dump_stack_lvl+0x34/0x48
Feb 17 08:51:32 fs6 kernel: spl_panic+0xd1/0xe9 [spl]
Feb 17 08:51:32 fs6 kernel: ? dbuf_rele_and_unlock+0x387/0x6b0 [zfs]
Feb 17 08:51:32 fs6 kernel: ? zap_remove_impl+0xb3/0x120 [zfs]
Feb 17 08:51:32 fs6 kernel: ? kfree+0xac/0x3f0
Feb 17 08:51:32 fs6 kernel: ? spl_kmem_alloc+0xb2/0x100 [spl]
Feb 17 08:51:32 fs6 kernel: dsl_dataset_deactivate_feature_impl+0xfa/0x100 [zfs]
Feb 17 08:51:32 fs6 kernel: dsl_dataset_clone_swap_sync_impl+0x888/0xb50 [zfs]
Feb 17 08:51:32 fs6 kernel: dsl_dataset_rollback_sync+0xf0/0x1d0 [zfs]
Feb 17 08:51:32 fs6 kernel: ? dsl_dataset_hold_flags+0x9d/0x230 [zfs]
Feb 17 08:51:32 fs6 kernel: ? dsl_dataset_rollback_check+0x2f9/0x430 [zfs]
Feb 17 08:51:32 fs6 kernel: dsl_sync_task_sync+0xaa/0xf0 [zfs]
Feb 17 08:51:32 fs6 kernel: dsl_pool_sync+0x40c/0x520 [zfs]
Feb 17 08:51:32 fs6 kernel: spa_sync_iterate_to_convergence+0xf0/0x2f0 [zfs]
Feb 17 08:51:32 fs6 kernel: spa_sync+0x471/0x930 [zfs]
Feb 17 08:51:32 fs6 kernel: txg_sync_thread+0x27a/0x400 [zfs]
Feb 17 08:51:32 fs6 kernel: ? txg_fini+0x260/0x260 [zfs]
Feb 17 08:51:32 fs6 kernel: thread_generic_wrapper+0x59/0x70 [spl]
Feb 17 08:51:32 fs6 kernel: ? __thread_exit+0x20/0x20 [spl]
Feb 17 08:51:32 fs6 kernel: kthread+0x149/0x170
Feb 17 08:51:32 fs6 kernel: ? set_kthread_struct+0x50/0x50
Feb 17 08:51:32 fs6 kernel: ret_from_fork+0x22/0x30

As this is part of a HA pair, the host was fenced shortly afterwards and service resumed on the peer host, where the pool imported with no errors. Pushing the service back to the original host and restarting the transfer has so far not resulted in a repetition.

Load on the system is minimal during the transfer, ca 1.3.

@duramuss
Copy link

Same here on zfs 2.1.9 when doing a rollback on an unmounted dataset:

Linux duranux2 6.0.19 #1 SMP PREEMPT_DYNAMIC Thu Feb 9 01:12:28 CET 2023 x86_64 GNU/Linux

févr. 18 20:07:32 duranux2 kernel: VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
févr. 18 20:07:32 duranux2 kernel: PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_feature_impl()
févr. 18 20:07:32 duranux2 kernel: Showing stack for process 436
févr. 18 20:07:32 duranux2 kernel: CPU: 2 PID: 436 Comm: txg_sync Tainted: P O 6.0.19 #1
févr. 18 20:07:32 duranux2 kernel: Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 3603 11/09/2012
févr. 18 20:07:32 duranux2 kernel: Call Trace:
févr. 18 20:07:32 duranux2 kernel:
févr. 18 20:07:32 duranux2 kernel: dump_stack_lvl+0x37/0x47
févr. 18 20:07:32 duranux2 kernel: spl_panic+0xbc/0xd4 [spl]
févr. 18 20:07:32 duranux2 kernel: ? dbuf_rele_and_unlock+0xf1/0x660 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? zap_cursor_advance+0x20/0x20 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? avl_find+0x52/0x90 [zavl]
févr. 18 20:07:32 duranux2 kernel: ? mze_find+0xbb/0xe0 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? zap_name_alloc+0xba/0x1c0 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? zap_lockdir+0x5e/0x90 [zfs]
févr. 18 20:07:32 duranux2 kernel: dsl_dataset_deactivate_feature_impl+0xdf/0xf0 [zfs]
févr. 18 20:07:32 duranux2 kernel: dsl_dataset_clone_swap_sync_impl+0x825/0xa10 [zfs]
févr. 18 20:07:32 duranux2 kernel: dsl_dataset_rollback_sync+0xd1/0x170 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? dsl_dataset_hold_flags+0x8b/0x210 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? dsl_dataset_rollback_check+0x2a4/0x3e0 [zfs]
févr. 18 20:07:32 duranux2 kernel: dsl_sync_task_sync+0xa4/0xf0 [zfs]
févr. 18 20:07:32 duranux2 kernel: dsl_pool_sync+0x3a6/0x4c0 [zfs]
févr. 18 20:07:32 duranux2 kernel: spa_sync+0x542/0xf30 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? spa_txg_history_init_io+0xde/0xf0 [zfs]
févr. 18 20:07:32 duranux2 kernel: txg_sync_thread+0x210/0x3d0 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? txg_rele_to_sync+0x90/0x90 [zfs]
févr. 18 20:07:32 duranux2 kernel: ? __thread_exit+0x20/0x20 [spl]
févr. 18 20:07:32 duranux2 kernel: thread_generic_wrapper+0x59/0x70 [spl]
févr. 18 20:07:32 duranux2 kernel: kthread+0xd9/0x100
févr. 18 20:07:32 duranux2 kernel: ? kthread_exit+0x30/0x30
févr. 18 20:07:32 duranux2 kernel: ret_from_fork+0x22/0x30
févr. 18 20:07:32 duranux2 kernel:

@gamanakis
Copy link
Contributor

@scratchings @duramuss There is a more suitable fix, see 34ce4c4 and #14502. Hopefully the upcoming 2.1.10 will include them.

lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 3, 2023
When activating filesystem features after receiving a snapshot, do 
so only in syncing context.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes openzfs#14304 
Closes openzfs#14252
@slavanap
Copy link

slavanap commented Mar 8, 2023

Same here. After 30 days of uptime. No ZFS commands issued during last 2 weeks, just normal ops.
ZFS has 2 normal sets, 1 zvol and 1 encrypted set. LZ4 compression, all other settings are set to defaults.

mar 04 21:15:12 sl kernel: VERIFY0(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)
mar 04 21:15:12 sl kernel: PANIC at dsl_dataset.c:1129:dsl_dataset_deactivate_feature_impl()
mar 04 21:15:12 sl kernel: Showing stack for process 40697
mar 04 21:15:12 sl kernel: CPU: 2 PID: 40697 Comm: txg_sync Tainted: P           OE     5.13.0-39-generic #44~20.04.1-Ubuntu
mar 04 21:15:12 sl kernel: Hardware name: System manufacturer System Product Name/PRIME Z270-P, BIOS 0610 05/11/2017
mar 04 21:15:12 sl kernel: Call Trace:
mar 04 21:15:12 sl kernel:  <TASK>
mar 04 21:15:12 sl kernel:  dump_stack+0x7d/0x9c
mar 04 21:15:12 sl kernel:  spl_dumpstack+0x29/0x2b [spl]
mar 04 21:15:12 sl kernel:  spl_panic+0xd4/0xfc [spl]
mar 04 21:15:12 sl kernel:  ? spl_kmem_free+0x28/0x30 [spl]
mar 04 21:15:12 sl kernel:  ? kfree+0xd8/0x2a0
mar 04 21:15:12 sl kernel:  ? dbuf_rele+0x3d/0x50 [zfs]
mar 04 21:15:12 sl kernel:  ? dmu_buf_rele+0xe/0x10 [zfs]
mar 04 21:15:12 sl kernel:  ? zap_unlockdir+0x3f/0x60 [zfs]
mar 04 21:15:12 sl kernel:  ? zap_remove_norm+0x76/0xa0 [zfs]
mar 04 21:15:12 sl kernel:  dsl_dataset_deactivate_feature_impl+0x101/0x110 [zfs]
mar 04 21:15:12 sl kernel:  dsl_dataset_clone_swap_sync_impl+0x874/0x9d0 [zfs]
mar 04 21:15:12 sl kernel:  ? dbuf_rele+0x3d/0x50 [zfs]
mar 04 21:15:12 sl kernel:  ? dmu_buf_rele+0xe/0x10 [zfs]
mar 04 21:15:12 sl kernel:  ? dsl_dir_rele+0x30/0x40 [zfs]
mar 04 21:15:12 sl kernel:  dmu_recv_end_sync+0x26f/0x5a0 [zfs]
mar 04 21:15:12 sl kernel:  dsl_sync_task_sync+0xb6/0x100 [zfs]
mar 04 21:15:12 sl kernel:  dsl_pool_sync+0x3d6/0x4f0 [zfs]
mar 04 21:15:12 sl kernel:  spa_sync+0x55e/0xfd0 [zfs]
mar 04 21:15:12 sl kernel:  ? spa_txg_history_init_io+0x106/0x110 [zfs]
mar 04 21:15:12 sl kernel:  txg_sync_thread+0x229/0x3b0 [zfs]
mar 04 21:15:12 sl kernel:  ? txg_quiesce_thread+0x340/0x340 [zfs]
mar 04 21:15:12 sl kernel:  thread_generic_wrapper+0x79/0x90 [spl]
mar 04 21:15:12 sl kernel:  ? spl_taskq_fini+0x80/0x80 [spl]
mar 04 21:15:12 sl kernel:  kthread+0x12b/0x150
mar 04 21:15:12 sl kernel:  ? set_kthread_struct+0x40/0x40
mar 04 21:15:12 sl kernel:  ret_from_fork+0x22/0x30
mar 04 21:15:12 sl kernel:  </TASK>
$ zfs --version
zfs-2.1.99-1641_gc935fe2e9
zfs-kmod-2.1.99-1641_gc935fe2e9

Will rebuild from master.

@gamanakis
Copy link
Contributor

@slavanap this should be fixed in current master and in the upcoming 2.1.10 release.

@scratchings
Copy link

I patched 2.1.9 using 34ce4c4 and #14502 and deployed this yesterday, I'm sorry to say I'm still getting panics - two today:

Message from syslogd@fs7 at Mar 9 16:24:12 ...
kernel:VERIFY3(0 == zap_remove(mos, dsobj, spa_feature_table[f].fi_guid, tx)) failed (0 == 2)

Message from syslogd@fs7 at Mar 9 16:24:12 ...
kernel:PANIC at dsl_dataset.c:1116:dsl_dataset_deactivate_featur

@rincebrain
Copy link
Contributor

I'd probably try eee9362 instead of 0f32b1f if that's still failing for you. If it still fails with 34ce4c4 and eee9362 , that's differently interesting. If not, then we know 0f32b1f either needs more work or there's something about master that makes it behave differently than 2.1 here, depending.

@gamanakis
Copy link
Contributor

To everyone interested: give zfs-2.1.10-staging a try. It contains already all the fixes.

allanjude pushed a commit to allanjude/zfs that referenced this issue Apr 7, 2023
When activating filesystem features after receiving a snapshot, do
so only in syncing context.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes openzfs#14304
Closes openzfs#14252
(cherry picked from commit eee9362)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.