-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VERIFY3(0 == zap_remove_int(zfsvfs->z_os, zfsvfs->z_unlinkedobj, zp->z_id, tx)) failed (0 == 2) #6812
Comments
I also got that error, with a filesystem with enabled deduplication. A scrub was running. A rollback on one filesystem was also running. So there was some baseline load on the pool. Otherwise normal low disk operation in progress. Then the removal of one large file just took a lot of time. Noteworthy is, that that file I was removing also was in use the last time the kernel paniced. So maybe there is some sort of corruption with that specific file. All other processes finished, but the rm process is stuck and the stacktrace is below:
And dmesg:
|
We've just had an ocrrance of this issue - details below - to note we do not have deduplication enabled. System information
Describe the problem you're observingWe've just had a crash similar to this, which was preceded by discovering a directory that we could not remove - it stated that the directory wasn't empty, despite ls stating that it was Describe how to reproduce the problemDirectory shows as empty and you can't remove it
[root@server01 child]# rm -rf /mnt/parent/child/baddir/
[root@server01 childl]# mv baddir/ ..
When you attempt to then unmount the filesystem to destroy it:
Include any warning/errors/backtraces from the system logs
|
Strange thing about this one - after rebooting, the directory suddenly has contents again: [root@server01 child]# ls baddir/ -la
Prior to this issue we were diagnosing why after destroying a snapshot, the contents of the directory we had snapshotted had gone missing - which prior to this crash we believed was an application issue, however we now believe that ZFS got itself into such a state that it didn't what the actual state of the folder was. |
I had this assertion failed on my system. it's a debian stretch, running 0.7.12-2+deb10u1~bpo9+1
tried to reboot the system clean, but it was done for. lots of hung task, so I had to pretty much reset the computer
|
I just had the same issue on 3 different hosts. First 2 hosts: Ubuntu 18.04.3 with ZoL 0.7.5-1ubuntu16.6 In all cases we made a rollback of a dataset and the process was stock on zfs umount. The error was:
Servers were completely stuck and nothing could be done to unstuck them.
Since we somehow have a way to reproduce this, we will try to track this down on a test system. What can we do more to diagnose the issue? |
@AceSlash I believe I understand how this issue can occur. The critical thing here is that an online rollback was performed which makes this corner case possible. I've gone ahead and opened PR #9739 with a proposed fix and explanation.
Unfortunately, I wasn't able to reproduce this issue locally on master with some quick manual testing. If you're able to semi-reliably trigger this it would be very helpful if you could test the proposed fix on your test system. In order to rigorously test it, you'll need to do a few things.
|
@behlendorf Unfortunately we tried to reproduce it on a test system (identical to the prod system) but it didn't trigger, all rollback worked normaly. The thing is that the test system has a very low load so if that is load related, it may never trigger on it. If we can reproduce it, we'll go ahead and try the patch, no problem. |
I don't see how it could be directly caused by using LXD. However, it could be timing related which is why it's tricky to reproduce. The initial reports of this issue are that the failure occurred when removing a large file and performing an online rollback at the same time. Is this scenario something which is possible in your production environment? |
If a has rollback has occurred while a file is open and unlinked. Then when the file is closed post rollback it will not exist in the rolled back version of the unlinked object. Therefore, the call to zap_remove_int() may correctly return ENOENT and should be allowed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#6812 Closes openzfs#9739
If a has rollback has occurred while a file is open and unlinked. Then when the file is closed post rollback it will not exist in the rolled back version of the unlinked object. Therefore, the call to zap_remove_int() may correctly return ENOENT and should be allowed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#6812 Closes openzfs#9739
System information
Describe the problem you're observing
spl_panic() causing zfs process to hang forever. 'reboot' also hung so it had to be power reset. This same issue also occurred on Centos 7.4.1708.
Describe how to reproduce the problem
I don't know how to simply reproduce the issue.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: