Permanent zpool errors after a few days of making natively encrypted zvol snapshots with sanoid #15837

rjycnfynby · 2024-01-29T23:24:53Z

System information

Type	Version/Name
Distribution Name	Gentoo
Distribution Version	default/linux/amd64/17.1 (stable) profile
Kernel Version	6.6.13-gentoo-dist
Architecture	x86_64
OpenZFS Version	zfs-2.2.2-r1-gentoo / zfs-kmod-2.2.2-r0-gentoo

Describe the problem you're observing

After about several days of making autosnapshots on slightly over then twenty natively encrypted zvols hourly scheduled sanoid cronjob starts making errors "cannot iterate filesystems: I/O error" while making snapshots. "zpool status -vx" command gives an output "errors: Permanent errors have been detected in the following files:" and then blank. Getting the same "cannot iterate filesystems: I/O error" errors while trying to list snapshots on some of the zvols. Currently I have about 16 zvols in such error state. Total amount of snapshots is slightly less than a thousand.

"zfs list -t snap -r tank0 | wc -l" command gives me 23 "cannot iterate filesystems: I/O error" lines and the result of 991 to be exact. No errors in dmesg found. This particular pool was made out of three mirrored WD SA500 SSDs which support trim/unmap on LSI controllers but similar results were observed on a different servers and disks.

At least four different servers with a similar setup were failing the same way. Previously I was getting similar issues with ZFS version 2.1.14 and kernel 6.1.69-gentoo-dist with a slight difference that "zpool status -vx" command was giving more detailed output with the list of exact failing snapshots and I could also see an increasing kcf_ops_failed counter using command "grep . /proc/spl/kstat/kcf/*". Later Gentoo marked ZFS 2.2.2 as a stable and I decided to try it one more time with a newer version.

Different server pools started to failure after 3 or 5 days of uptime. All of them had less than a thousand snapshots.

Describe how to reproduce the problem

Install latest ZFS and binary distribution kernel;
Create zpool with enabled autotrim (probably irrelevant but that's what I'm doing on SSD pools);
Enable LZ4 compression on root dataset (probably irrelevant);
Create dataset with enabled autotrim (probably irrelevant);
Create an encrypted dataset named "encrypted" which will hold the rest of datasets and zvols;
Mount zvols to VMs using Xen 4.16.5 hypervisor (might be irrelevant);
Install sanoid and configure it to make and keep last 36 "hourly", 4 "weekly" and 2 "monthly" snapshots almost for each zvol;
Configure syncoid on a remote server to replicate snapshots from the source server (probably irrelevant);
Then wait for few days until the issue will start to appear on more and more zvols generating errors "cannot iterate filesystems: I/O error" during the snapshot operations;

Include any warning/errors/backtraces from the system logs

I couldn't find any related errors in system logs.

ckruijntjens · 2024-08-10T15:32:58Z

I have the same issue with Debian 12 zfs version 2.2.4-1

Any update on this zfs bug?

rjycnfynby · 2024-08-22T21:20:17Z

I had to migrate data back to an unencrypted pool since it was a constant problem.

ckruijntjens · 2024-08-29T14:24:42Z

I had to migrate data back to an unencrypted pool since it was a constant problem.

i understand but this is not the solution. there is a bug in zfs that is causing this. I hope the are looking at this problem soon as it is really anoying.

IvanVolosyuk · 2024-09-20T04:26:47Z

Is it the same as #12014? Can you try ZFS 2.2.5 or later?

ckruijntjens · 2024-09-20T05:40:12Z

Is it the same as #12014? Can you try ZFS 2.2.5 or later?

No it is not fullt the same. After some time i also get corruption errors in my encrypted pool. However when i look for the files that are corrupted i get an empty list.

If i then start a scrub cancel it after 1 procent, reboot and run a scrub again everything is normal again.

With zfs version 2.2.5 i had the same issue. I did a upgrade to the latest zfs version. Hopefully it is gene with the latest version.

ckruijntjens · 2024-09-23T05:14:52Z

With version 2.2.6 is still get the same errors after a couple of days with a native zfs encrypted pool.

This bug is really anoying.

rjycnfynby · 2024-09-25T23:39:04Z

With version 2.2.6 is still get the same errors after a couple of days with a native zfs encrypted pool.

This bug is really anoying.

I would say that it blocks us from using native encryption in production environments or any environment that is going to work longer than a couple of days with multiple replicating snapshots.

Has anybody managed to replicate this bug in TrueNAS on Linux by any chance?

ckruijntjens · 2024-09-26T05:00:11Z

With version 2.2.6 is still get the same errors after a couple of days with a native zfs encrypted pool.
This bug is really anoying.

I would say that it blocks us from using native encryption in production environments or any environment that is going to work longer than a couple of days with multiple replicating snapshots.

Has anybody managed to replicate this bug in TrueNAS on Linux by any chance?

I agree. I think i am going to create a new pool without native encryption.

aaltonenp · 2024-09-27T15:15:00Z

Is it known in what version this started? I'm still running Ubuntu 20.04 LTS with zfs 0.8.3 and not seeing this problem. But I'm currently sending only one encrypted dataset to another host and from there to a third host where there's multiple encrypted datasets. I'm doing it with pyznap. Maybe my use case is simple enough with only a single dataset to not trigger this, or it started in a later version.

I dread updating to newer version when LTS updates end.

tweax-sarl · 2024-10-14T16:35:59Z

I read an article a couple months ago mentioning a bug in recent openzfs related to encrypted datasets. In the meanwhile I forgot... Installed two servers running Ubuntu 24.04.1 LTS (GNU/Linux 6.8.0-45-generic x86_64) for two months now. OpenZFS is 2.2.2. The machines run a couple of VMs in qemu-kvm. Everything seemed to work fine, until a week ago I started cross-copying the encrypted volumes between each other using sanoid / syncoid for a failover scenario. There is process control to ensure that only one syncoid job runs at a time. Syncoid does not create snapshots itself but only shuffles the latest one to the other server. It took not even one day for the first "cannot iterate filesystems: I/O error" to arrive on one of the two servers. No problem on Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-177-generic x86_64) with OpenZFS 0.8.3, though. Since I was not really keen on reinstalling the servers with Ubuntu 20.04.6 LTS I migrated the volumes onto non-encrypted datasets and hope that this will do the job.

@aaltonenp I think for the moment you are very well advised with Ubuntu 20.04 LTS.

ckruijntjens · 2024-10-18T04:42:06Z

When will the issue be fixed? The latest version still has this problem.........

ckruijntjens · 2024-10-23T07:16:16Z

@zfs team. Why are such critical bugs like this one months open. No fix? Still you push new versions of zfs.

Germano0 · 2024-10-23T12:55:46Z

Cannot reproduce on a RHEL 9 system currently having ZFS 2.1.15, Sanoid 2.2.0 and running this OS since September 2022

ckruijntjens · 2024-10-23T13:17:35Z

Cannot reproduce on a RHEL 9 system currently having ZFS 2.1.15, Sanoid 2.2.0 and running this OS since September 2022

Are you also using syncoid to replicate to other machine?

I am having this issue every 5 days.

Germano0 · 2024-10-23T23:54:53Z

Are you also using syncoid to replicate to other machine?

No, I am replicating only via zfs send/recv

ckruijntjens · 2024-10-24T07:09:26Z

Are you also using syncoid to replicate to other machine?

No, I am replicating only via zfs send/recv

Ok,

I am using the following versions and os.

Debian Bookworm
Using bp kernel - Debian 6.9.7-1~bpo12+1

ZFS VERSION alsobackports.

zfs-2.2.6-1bpo12+3
zfs-kmod-2.2.6-1bpo12+3

sanoid version
/usr/sbin/sanoid version 2.1.0

syncoid version
/usr/sbin/syncoid version 2.1.0

Every 5 days syncoid stops replicating to the other machine because of io errors. My pool show errors however when i want to show what files are the problem with the -v option its an empty list. Ik i scrub 2 times the errors are gone.(first one needs to be canceled after 1 %)

And ofcourse i am using native zfs encryption. i dont understand some people are having problems with this like me and others dont.

mcmilk · 2024-10-27T09:09:54Z

Debian Bookworm Using bp kernel - Debian 6.9.7-1~bpo12+1

ZFS VERSION alsobackports.

zfs-2.2.6-1~~bpo12+3 zfs-kmod-2.2.6-1~~bpo12+3

sanoid version /usr/sbin/sanoid version 2.1.0

syncoid version /usr/sbin/syncoid version 2.1.0

Every 5 days syncoid stops replicating to the other machine because of io errors. My pool show errors however when i want to show what files are the problem with the -v option its an empty list. Ik i scrub 2 times the errors are gone.(first one needs to be canceled after 1 %)

And ofcourse i am using native zfs encryption. i dont understand some people are having problems with this like me and others dont.

What encryption and checksumming algo do you use?

ckruijntjens · 2024-10-27T09:19:05Z

Debian Bookworm Using bp kernel - Debian 6.9.7-1bpo12+1
ZFS VERSION alsobackports.
zfs-2.2.6-1bpo12+3 zfs-kmod-2.2.6-1~bpo12+3
sanoid version /usr/sbin/sanoid version 2.1.0
syncoid version /usr/sbin/syncoid version 2.1.0
Every 5 days syncoid stops replicating to the other machine because of io errors. My pool show errors however when i want to show what files are the problem with the -v option its an empty list. Ik i scrub 2 times the errors are gone.(first one needs to be canceled after 1 %)
And ofcourse i am using native zfs encryption. i dont understand some people are having problems with this like me and others dont.

What encryption and checksumming algo do you use?

Hi i use encryption,

aes-256-gcm

Checksum is default.

mcmilk · 2024-10-27T09:29:12Z

Okay, I will try to dig deeper into this thing. AES for ARM is also ongoing ;-)

ckruijntjens · 2024-10-27T09:32:38Z

Okay, I will try to dig deeper into this thing. AES for ARM is also ongoing ;-)

That would be great. I would be super of we get a fix for this problem. Thank you verry much.

ckruijntjens · 2024-10-27T09:41:44Z

Okay, I will try to dig deeper into this thing. AES for ARM is also ongoing ;-)

Ps if you need info from my system or if i can help with some info of some sort plsase feel free to ask. I am verry willing to help.

ckruijntjens · 2024-11-02T15:59:03Z

Okay, I will try to dig deeper into this thing. AES for ARM is also ongoing ;-)

Hi @mcmilk

Did you find anything? Do you need any help with log files or so?

Kind regards.

ckruijntjens · 2024-11-16T10:44:32Z

Okay, I will try to dig deeper into this thing. AES for ARM is also ongoing ;-)

Strangest thing here. normaly my system would create zfs errors (io) after a few days. always between 2 and 5 days. Now i set a --source-bwlimit=50M (syncoid) and now my system is not creating these errors now.

My system is now running for almost 7 days without zfs errors. I will keep track of it and inform you if the issue returns or not.

ckruijntjens · 2024-11-16T14:53:57Z

Okay, I will try to dig deeper into this thing. AES for ARM is also ongoing ;-)

Strangest thing here. normaly my system would create zfs errors (io) after a few days. always between 2 and 5 days. Now i set a --source-bwlimit=50M (syncoid) and now my system is not creating these errors now.

My system is now running for almost 7 days without zfs errors. I will keep track of it and inform you if the issue returns or not.

nevermind,

Now 7 days uptime and the errors are here again.

Time to reboot and scrub 2 times. i am reverting all to an unencrypted pool. Encryption is to buggy

mcmilk · 2024-11-16T16:51:24Z

@ckruijntjens - I am sorry, the time for me on OpenZFS is very short again.

ckruijntjens · 2024-11-26T16:03:01Z

@ckruijntjens - I am sorry, the time for me on OpenZFS is very short again.

No problem. I am reverting as we speak to a unencrypted pool. encryption is not trustworthy with zfs. The thing that i notice (maybe it will help @ troubleshooting) the errors only happen on encrypted pools where snapshot (sanoid) and syncoid is used. If sanoid or syncoid is not used the pool stays healthy and no io errors.

Germano0 · 2024-11-27T18:32:44Z

#12014 (comment)

IvanVolosyuk · 2024-12-11T08:27:39Z

@robn FYI

ckruijntjens · 2024-12-11T09:09:27Z

Looks pretty ordinary, so more details can be very useful.

What Linux kernel version are you using?

What ZFS properties and parameters did you change?

What kind of load the machine is handing?

Is the system overcommitted in terms of ram?

How much reads and writes happening and how much free space is in the pool?

What kind of hardware / architecture you have, does it use ECC ram?

The more details the better. Are you willing to enable debugging and memory sanitizer kernel / compiler options?

Sure,

I am willing to help in every way you need.

What Linux kernel version are you using? Linux ESX 6.9.7+bpo-amd64 Use Barriers in pre-2.6.24 kernels #1 SMP PREEMPT_DYNAMIC Debian 6.9.7-1~bpo12+1 (2024-07-03) x86_64 GNU/Linux
What ZFS properties and parameters did you change?
How can i see the properties you need? command?
What kind of load the machine is handing?
I am running multiple virtual machines and containers on this system.
Is the system overcommitted in terms of ram? I have lots of ram. Also did a ram test and this is all good.
total used free shared buff/cache available
Mem: 125Gi 104Gi 16Gi 23Gi 28Gi 20Gi
How much reads and writes happening and how much free space is in the pool?
capacity operations bandwidth
pool alloc free read write read write

incuspool 1.03T 2.59T 1.12K 191 43.8M 4.92M

What kind of hardware / architecture you have, does it use ECC ram?
I9 proc, lots of non ecc ram. nvme disks.

Kind regards,

Chris

Germano0 · 2024-12-11T14:29:20Z

@ckruijntjens can you please post sanoid conf file?

ckruijntjens · 2024-12-11T15:37:46Z

Yes of course. This is the config file:

root@ESX:~# cat /etc/sanoid/sanoid.conf
######################################
# This is a sample sanoid.conf file. #
# It should go in /etc/sanoid.       #
######################################

## name your backup modules with the path to their ZFS dataset - no leading slash.
#[zpoolname/datasetname]
#       # pick one or more templates - they're defined (and editable) below. Comma separated, processed in order.
#       # in this example, template_demo's daily value overrides template_production's daily value.
#       use_template = production,demo
#
#       # if you want to, you can override settings in the template directly inside module definitions like this.
#       # in this example, we override the template to only keep 12 hourly and 1 monthly snapshot for this dataset.
#       hourly = 12
#       monthly = 1
#
## you can also handle datasets recursively.
#[zpoolname/parent]
#       use_template = production
#       recursive = yes
#       # if you want sanoid to manage the child datasets but leave this one alone, set process_children_only.
#       process_children_only = yes
#
## you can selectively override settings for child datasets which already fall under a recursive definition.
#[zpoolname/parent/child]
#       # child datasets already initialized won't be wiped out, so if you use a new template, it will
#       # only override the values already set by the parent template, not replace it completely.
#       use_template = demo


# you can also handle datasets recursively in an atomic way without the possibility to override settings for child datasets.
[rpool]
        use_template = production
        recursive = yes

[incuspool]
        use_template = incuspool
        recursive = yes

#############################
# templates below this line #
#############################

# name your templates template_templatename. you can create your own, and use them in your module definitions above.

[template_demo]
        daily = 60

[template_incuspool]
        hourly = 24
        daily = 31
        monthly = 6
        autosnap = yes
        autoprune = yes

[template_production]
        frequently = 0
        hourly = 36
        daily = 30
        monthly = 3
        yearly = 0
        autosnap = yes
        autoprune = yes

[template_backup]
        autoprune = yes
        frequently = 0
        hourly = 30
        daily = 90
        monthly = 12
        yearly = 0

        ### don't take new snapshots - snapshots on backup
        ### datasets are replicated in from source, not
        ### generated locally
        autosnap = no

        ### monitor hourlies and dailies, but don't warn or
        ### crit until they're over 48h old, since replication
        ### is typically daily only
        hourly_warn = 2880
        hourly_crit = 3600
        daily_warn = 48
        daily_crit = 60

[template_hotspare]
        autoprune = yes
        frequently = 0
        hourly = 30
        daily = 90
        monthly = 3
        yearly = 0

        ### don't take new snapshots - snapshots on backup
        ### datasets are replicated in from source, not
        ### generated locally
        autosnap = no

        ### monitor hourlies and dailies, but don't warn or
        ### crit until they're over 4h old, since replication
        ### is typically hourly only
        hourly_warn = 4h
        hourly_crit = 6h
        daily_warn = 2d
        daily_crit = 4d

[template_scripts]
        ### information about the snapshot will be supplied as environment variables,
        ### see the README.md file for details about what is passed when.
        ### run script before snapshot
        pre_snapshot_script = /path/to/script.sh
        ### run script after snapshot
        post_snapshot_script = /path/to/script.sh
        ### run script after pruning snapshot
        pruning_script = /path/to/script.sh
        ### don't take an inconsistent snapshot (skip if pre script fails)
        #no_inconsistent_snapshot = yes
        ### run post_snapshot_script when pre_snapshot_script is failing
        #force_post_snapshot_script = yes
        ### limit allowed execution time of scripts before continuing (<= 0: infinite)
        script_timeout = 5

[template_ignore]
        autoprune = no
        autosnap = no
        monitor = no

ckruijntjens · 2024-12-11T16:19:44Z

Thank you @amotin

And Sorry.

ckruijntjens · 2024-12-11T18:55:34Z

@Germano0 in the other github issue someone post that he switch from sanoid to zfsbackup and the issue is gone. Could the problem be sanoid / syncoid? that something there is not going wright.
See post:

#12014 (comment)

kind regards,

Chris.

Germano0 · 2024-12-11T19:11:20Z

@Germano0 in the other github issue someone post that he switch from sanoid to zfsbackup and the issue is gone. Could the problem be sanoid / syncoid? that something there is not going wright.

I am creating 2 VMs with virtual disks trying to reproduce the same environment.
sanoid / syncoid are OpenZFS wrappers, so I don't think stopping to use them is a feasible fix

ckruijntjens · 2024-12-11T20:21:32Z

@Germano0 in the other github issue someone post that he switch from sanoid to zfsbackup and the issue is gone. Could the problem be sanoid / syncoid? that something there is not going wright.

I am creating 2 VMs with virtual disks trying to reproduce the same environment. sanoid / syncoid are OpenZFS wrappers, so I don't think stopping to use them is a feasible fix

Ok,

I will wait on your findings with the 2 vm machines and sanoid / syncoid. The errors are showing after 4 to 5 days of using sanoid and syncoid.

IvanVolosyuk · 2024-12-12T08:55:15Z

I noticed mention of bookmarks in addition to snapshots. I wonder if that the edge case which is causing the problems.

ckruijntjens · 2024-12-12T13:59:47Z

I noticed mention of bookmarks in addition to snapshots. I wonder if that the edge case which is causing the problems.

i indeed always using bookmarks.

IvanVolosyuk · 2024-12-13T05:16:44Z

I noticed mention of bookmarks in addition to snapshots. I wonder if that the edge case which is causing the problems.

i indeed always using bookmarks.

It would be nice if other people who experience the issue mention if they use bookmarks to confirm our rule this out.

If the issue ionly manifests on sender machine we don't actually need the receiver VM to reproduce, unless resumable send is at play, which looks like the case. That means to reproduce we would need to simulate broken connections between VMs. Is there indications that there was a broken connection and resumed send before the corruptions happened?

I tried to reproduce that with two VMs without much luck. I wonder if something else is needed as well. Memory pressure on arc? Gang blocks?

ckruijntjens · 2024-12-13T06:34:27Z

I noticed mention of bookmarks in addition to snapshots. I wonder if that the edge case which is causing the problems.

i indeed always using bookmarks.

It would be nice if other people who experience the issue mention if they use bookmarks to confirm our rule this out.

If the issue ionly manifests on sender machine we don't actually need the receiver VM to reproduce, unless resumable send is at play, which looks like the case. That means to reproduce we would need to simulate broken connections between VMs. Is there indications that there was a broken connection and resumed send before the corruptions happened?

I tried to reproduce that with two VMs without much luck. I wonder if something else is needed as well. Memory pressure on arc? Gang blocks?

Hi,

I am using bookmarks. Now i am trying without the bookmark option to see if it changes anything. I am using resume on zfs sends. I never see broken connections and resend on my system. Please bare in mind that the corruption only happens after 4 to 5 days. no matter how much data load there is on the pool.

mcmilk · 2024-12-13T06:54:23Z

Is it possible to upload the zpool history output of both pools?
If they are very long, maybe a bit truncated or a link to some webspace.
I think this could help also.

tweax-sarl · 2024-12-13T07:20:38Z

Hi,

It would be nice if other people who experience the issue mention if they use bookmarks to confirm our rule this out.

I never used bookmarks and experienced the issue. I was also using sanoid / syncoid,though. And always on the sending side. Unfortunately, I am not in a position right now, where I can try to reproduce the issue. I hope beginning next year I'll find some time and resources to give it another look. My short-term fix was just to create an unencrypted filesystem in the same pool and moving all volumes on it, and it never happened again. So, difficult to imagine for me, how hardware could trigger that.

ckruijntjens · 2024-12-13T07:32:29Z

Hi,

It would be nice if other people who experience the issue mention if they use bookmarks to confirm our rule this out.

I never used bookmarks and experienced the issue. I was also using sanoid / syncoid,though. And always on the sending side. Unfortunately, I am not in a position right now, where I can try to reproduce the issue. I hope beginning next year I'll find some time and resources to give it another look. My short-term fix was just to create an unencrypted filesystem in the same pool and moving all volumes on it, and it never happened again. So, difficult to imagine for me, how hardware could trigger that.

i agree. i dont think it is a problem with the bookmarks.

ckruijntjens · 2024-12-16T13:37:57Z

Hi,

It would be nice if other people who experience the issue mention if they use bookmarks to confirm our rule this out.

I never used bookmarks and experienced the issue. I was also using sanoid / syncoid,though. And always on the sending side. Unfortunately, I am not in a position right now, where I can try to reproduce the issue. I hope beginning next year I'll find some time and resources to give it another look. My short-term fix was just to create an unencrypted filesystem in the same pool and moving all volumes on it, and it never happened again. So, difficult to imagine for me, how hardware could trigger that.

Indeed now i am not using bookmarks and the error returns after 4 days. So bookmarks is not the issue.

ckruijntjens · 2024-12-16T13:38:25Z

Is it possible to upload the zpool history output of both pools? If they are very long, maybe a bit truncated or a link to some webspace. I think this could help also.

Could you do anything with the logs?

ckruijntjens · 2024-12-20T07:52:08Z

Hi All,

I read that if you are using raw sends the issue is not happening. I am now testing with raw sends to see if this works. If raw sends are working then i think there is something going on with te decryption before sending the data. I will keep you informed of the status.

amotin · 2024-12-20T14:50:18Z

On my last look into this area I've noticed that dbuf layer unlike ARC is generally incapable to handle encrypted and unencrypted version of a buffer same time. I haven't found how it may happen, but I still worry it might be an issue. Would somebody find a repeatable way to trigger it, I'd be happy to dig in.

ckruijntjens · 2024-12-31T16:32:00Z

Hi All,

I read that if you are using raw sends the issue is not happening. I am now testing with raw sends to see if this works. If raw sends are working then i think there is something going on with te decryption before sending the data. I will keep you informed of the status.

I Now use a different approach with sanoid/syncoid. I use sendoptions to recusive send the snapshots. Now i dont have errors on the enrypted pool. Its now running 6 days without errors. before i always had errors within 5 days.

I will keep you informed.

ckruijntjens · 2025-01-01T19:31:28Z

Guys,

I can confirm. If i use the following i get encryption errors within the 5 days.

syncoid --recursive

But when i use it like this the errors are not happening.

syncoid --sendoptions="R"

I am using it now to send it in raw mode. So now the errors are not happening anymore.

ckruijntjens · 2025-01-02T21:58:21Z

Hi All,

I got one question. Now that i am sending raw and with the sendoptions i cant not see the speed of the sending pool. How can i see the speed it is going?

So what i mean. when i check journactl it is not showing any progress bar. If i do it manualy it shows a progress bar.

usrlocalben · 2025-01-12T19:57:27Z

A zpool scrub seems to "clear" the corruption -- but is that a strong indicator? If scrub can make sense of it, why wouldn't all I/O calls just do whatever scrub is doing? It seems suspicious.

rjycnfynby · 2025-01-27T05:11:25Z

A zpool scrub seems to "clear" the corruption -- but is that a strong indicator? If scrub can make sense of it, why wouldn't all I/O calls just do whatever scrub is doing? It seems suspicious.

In my case zpool scrub was not always "clearing" the corruptions.

usrlocalben · 2025-01-27T12:29:41Z

A zpool scrub seems to "clear" the corruption -- but is that a strong indicator? If scrub can make sense of it, why wouldn't all I/O calls just do whatever scrub is doing? It seems suspicious.

In my case zpool scrub was not always "clearing" the corruptions.

Mine did on the first try, but it didn't take long to re-enter the corrupt state. I ultimately destroyed/created it and switched to raw send/recv streams and so far it has been fine.

I just happened to be at the terminal to observe it fail the second time:

Message from syslogd@pve at Jan 17 17:56:39 ...
 kernel:[514501.585534] VERIFY0(dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)

Message from syslogd@pve at Jan 17 17:56:39 ...
 kernel:[514501.593196] PANIC at dmu_recv.c:2093:receive_object()
Message from syslogd@pve at Jan 17 17:56:39 ...
 kernel:[514501.585534] VERIFY0(dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)

Message from syslogd@pve at Jan 17 17:56:39 ...
 kernel:[514501.593196] PANIC at dmu_recv.c:2093:receive_object()

[514501.585534] VERIFY0(dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
[514501.593196] PANIC at dmu_recv.c:2093:receive_object()
[514501.598340] Showing stack for process 3402428
[514501.598342] CPU: 2 PID: 3402428 Comm: receive_writer Tainted: P        W  O       6.8.12-5-pve #1
[514501.598362] Hardware name: Hewlett-Packard HP Z840 Workstation/2129, BIOS M60 v02.61 03/23/2023
[514501.598364] Call Trace:
[514501.598366]  <TASK>
[514501.598370]  dump_stack_lvl+0x76/0xa0
[514501.598379]  dump_stack+0x10/0x20
[514501.598383]  spl_dumpstack+0x29/0x40 [spl]
[514501.598395]  spl_panic+0xfc/0x120 [spl]
[514501.598404]  ? dbuf_rele+0x3b/0x50 [zfs]
[514501.598573]  receive_object+0xd54/0xff0 [zfs]
[514501.598716]  ? __slab_free+0xdf/0x310
[514501.598722]  ? spl_kmem_free+0x31/0x40 [spl]
[514501.598731]  ? kfree+0x240/0x2f0
[514501.598734]  receive_writer_thread+0x2f5/0xa90 [zfs]
[514501.598876]  ? spl_kmem_free+0x31/0x40 [spl]
[514501.598885]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[514501.598893]  ? kfree+0x240/0x2f0
[514501.598896]  ? __pfx_receive_writer_thread+0x10/0x10 [zfs]
[514501.599036]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[514501.599045]  thread_generic_wrapper+0x5f/0x70 [spl]
[514501.599053]  kthread+0xf2/0x120
[514501.599057]  ? __pfx_kthread+0x10/0x10
[514501.599060]  ret_from_fork+0x47/0x70
[514501.599064]  ? __pfx_kthread+0x10/0x10
[514501.599066]  ret_from_fork_asm+0x1b/0x30
[514501.599070]  </TASK>

rjycnfynby · 2025-02-10T00:54:50Z

I can confirm. If i use the following i get encryption errors within the 5 days.

syncoid --recursive

But when i use it like this the errors are not happening.

syncoid --sendoptions="R"

I am using it now to send it in raw mode. So now the errors are not happening anymore.

According to zfs-send manual "-R" stands for "--replicate" and not raw which would be "-w" or "--raw" option. Does it meant that you also adding "w" flag with the "R"?

rjycnfynby · 2025-02-10T01:02:01Z

On my last look into this area I've noticed that dbuf layer unlike ARC is generally incapable to handle encrypted and unencrypted version of a buffer same time. I haven't found how it may happen, but I still worry it might be an issue. Would somebody find a repeatable way to trigger it, I'd be happy to dig in.

You are talking about how it would be possible to trigger the bug right away? In all of my attempts it always started to happened after a few days of sending a dozen of snapshots every hour.

rjycnfynby added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jan 29, 2024

rincebrain added the Component: Encryption "native encryption" feature label Jan 30, 2024

rincebrain mentioned this issue May 4, 2024

Revert deleting the decryption error counters #16161

Open

openzfs deleted a comment from Germano0 Dec 11, 2024

openzfs deleted a comment from ckruijntjens Dec 11, 2024

openzfs deleted a comment from Germano0 Dec 11, 2024

Permanent zpool errors after a few days of making natively encrypted zvol snapshots with sanoid #15837

Permanent zpool errors after a few days of making natively encrypted zvol snapshots with sanoid #15837

Comments

rjycnfynby commented Jan 29, 2024 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

ckruijntjens commented Aug 10, 2024 • edited Loading

rjycnfynby commented Aug 22, 2024

ckruijntjens commented Aug 29, 2024

IvanVolosyuk commented Sep 20, 2024

ckruijntjens commented Sep 20, 2024

ckruijntjens commented Sep 23, 2024

rjycnfynby commented Sep 25, 2024

ckruijntjens commented Sep 26, 2024 • edited Loading

aaltonenp commented Sep 27, 2024

tweax-sarl commented Oct 14, 2024

ckruijntjens commented Oct 18, 2024 • edited Loading

ckruijntjens commented Oct 23, 2024

Germano0 commented Oct 23, 2024

ckruijntjens commented Oct 23, 2024

Germano0 commented Oct 23, 2024

ckruijntjens commented Oct 24, 2024

mcmilk commented Oct 27, 2024

ckruijntjens commented Oct 27, 2024

mcmilk commented Oct 27, 2024

ckruijntjens commented Oct 27, 2024

ckruijntjens commented Oct 27, 2024

ckruijntjens commented Nov 2, 2024

ckruijntjens commented Nov 16, 2024

ckruijntjens commented Nov 16, 2024

mcmilk commented Nov 16, 2024

ckruijntjens commented Nov 26, 2024

Germano0 commented Nov 27, 2024

IvanVolosyuk commented Dec 11, 2024

ckruijntjens commented Dec 11, 2024

Germano0 commented Dec 11, 2024

ckruijntjens commented Dec 11, 2024 • edited by amotin Loading

ckruijntjens commented Dec 11, 2024

ckruijntjens commented Dec 11, 2024

Germano0 commented Dec 11, 2024

ckruijntjens commented Dec 11, 2024

IvanVolosyuk commented Dec 12, 2024

ckruijntjens commented Dec 12, 2024

IvanVolosyuk commented Dec 13, 2024

ckruijntjens commented Dec 13, 2024

mcmilk commented Dec 13, 2024

tweax-sarl commented Dec 13, 2024

ckruijntjens commented Dec 13, 2024

ckruijntjens commented Dec 16, 2024

ckruijntjens commented Dec 16, 2024

ckruijntjens commented Dec 20, 2024

amotin commented Dec 20, 2024

ckruijntjens commented Dec 31, 2024

ckruijntjens commented Jan 1, 2025

ckruijntjens commented Jan 2, 2025 • edited Loading

usrlocalben commented Jan 12, 2025

rjycnfynby commented Jan 27, 2025

usrlocalben commented Jan 27, 2025 • edited Loading

rjycnfynby commented Feb 10, 2025

rjycnfynby commented Feb 10, 2025

rjycnfynby commented Jan 29, 2024 •

edited

Loading

ckruijntjens commented Aug 10, 2024 •

edited

Loading

ckruijntjens commented Sep 26, 2024 •

edited

Loading

ckruijntjens commented Oct 18, 2024 •

edited

Loading

ckruijntjens commented Dec 11, 2024 •

edited by amotin

Loading

ckruijntjens commented Jan 2, 2025 •

edited

Loading

usrlocalben commented Jan 27, 2025 •

edited

Loading