-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PANIC at dmu_recv.c:2092:receive_object() when receiving to encrypted dataset with 2.2.6 #16623
Comments
@scratchings Have you checked to what replication process 594865 triggering the panic belongs? Is it remote raw/encrypted or local unencrypted one? The error 5 returned by Is the target pool your local replication receives to also encrypted or plaintext? I wonder if the panic may be happening on that plain-text receive, since one unlike raw receive does decrypt on receive, and it may produce EIO there as part of decryption or authentication. If those guesses give nothing, I wonder if it is possible to try bpftrace PS: Just as a random thought: haven't you observed random disconnects and replication restarts just before panics by chance? I suppose for remote system it may happen more often than for local. |
Sorry, host long since rebooted, but I was convinced at the time that it was a receive on the old pool that this related to, as the send/receive to the new pool was still happily progressing (although even this eventually stopped working a few days later). The transfers that cause the issue are:
As to the PS, it's not always even a panic, often it can just be that the 'zfs' commands that syncoid issues hang indefinitely, e.g. rollbacks or snapshot enumeration. Worst case, the kernel crashes or the watchdog process reboots the host. This continuation of operation was unusual - don't think I've seen that before, but then I haven't had the second pool for very long. As this usually occurs whilst the machine isn't being actively watched and the syncoid process is running quiet we usually don't even notice for hours or a day or so. I've still got ca 1 week worth of transfers before I'm completely on the LUKS pool, so if there's anything I can do to capture data then let me know, I'd need guidance on how to use bpftrace. |
Further update. I forgot to mention that post hang the forced reboot went badly wrong. The boot loader was unable to activate any SAS devices and thus could not load the OS. Reverting to the previous kernel fixed everything, 427.33 and upgrading to the latest kernel similarly was fine. At this point I noticed that the old kernel had lost one of its drives. Are there conditions where a disk write failure would cause an EIO and hang? |
System information
Describe the problem you're observing
Panic reported for pool receiving snapshots into encrypted pool
Describe how to reproduce the problem
Use Syncoid/Sanoid to receive (and prune) snapshots from remote servers on a cron schedule. Wait some time, crash will occur.
As requested, this is a new ticket reporting a related issue to the closed ticket #11679.
As described in #11679, this WAS using my exclusive lock python script to prevent multiple receives to the same pool, but as I am transferring data out of this pool to a LUKS encrypted pool on a different JBOD, the effected pool is being used as the source of another internal send/receive via syncoid (and several rsync sources which are also transferring data out) (source dataset is not the same as the dataset being received into).
These other egress syncoids are continuing to run and the syncoid that most likely caused the panic remains in the process table, but the dataset is not increasing in size:
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: