-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs diff fails to report removed files #2081
Comments
So I poked around a tiny bit, and the issue is reproducible (I just didn't notice it earlier).
In the second call to zfs diff, libzfs_diff.c:write_free_diffs(), ZFS_IOC_NEXT_OBJ is returning ESRCH, and so describe_free is not being called. I'm scared of kernel code, so I'm not going to poke any deeper. |
I looked into this and tested under both ZoL (trying various commits between current master code and going back to 0.6.2) and also under FreeBSD (9-STABLE) and got the same results every time. This looks like it may be a bug in all ZFS implementations. The problem doesn't happen if there's a change to the file system in between the creation of the two snapshots. For example, if you add a "touch b" between the snapshots, both diffs do properly report the deletion of "a". I've not had the need to look into the dmu_diff code but I suspect that's where the problem might lie. I'll look into it further when I get a chance. In the mean time, it would be nice if someone with access to Illumos, Solaris, SmartOS, etc. could try this. I will take a look at Illumos to see whether there are any outstanding issues regarding |
It is the same on SmartOS, tested just now. |
So, I really don't know anything about the internals of ZFS, or the kernel, but I am particularly flummoxed by zfs_ioc_next_obj. I'll mention that it is only used once within ZFS tree, in libzfs_diff.c
Just simply replacing that argument with zero alleviates my bug (but I'm not arrogant enough to test this on a live system until I get some expert feedback). (I obviously recognize that if there is another consumer for this interface, the solution will be more complicated.) |
Confirmed repro by #2081 (comment) on 0.6.3 x86-64. |
Confirmed on OpenIndiana 151a8, and comment added to (previously closed) Illumos issue here: |
So, I've been running aerusso/zfs@0285fbf for the past year. I just rebased it to address @d683ddbb7272a179da3918cc4f922d92a2195ba2; Illumos bug 5314 doesn't look like it has anything to do with this diff issue. Let me summarize my (certainly incomplete) understanding of the issue here. I apologize for what is probably incorrect usage of terminology.
Per the documentation for dnode_next_offset in dnode.c, the "txg" parameter specifies a lower bound on which transaction the dnode can be found in. We are interested in all dnodes that are removed between the first and last transaction in the snapshot. It didn't need to be created in that snapshot to correspond to a removed file. In fact, the behavior of zfs diff in the test case exactly matches this: the transaction that created the data that was deleted in snapshot "2" was produced before, in snapshot "1", definitely predating the first transaction in snapshot "2". If my read of this is correct, it's somewhat of a miracle that any useful information is being extracted from zfs diff: only files modified inside of the transaction they are deleted in will have a transaction post-dating the first transaction of the snapshot, and therefore appear in the diff list. Grepping the zfs source tree, zfs_ioc_next_obj is only used by write_free_diffs. The change in the patch I mentioned above should therefore not conflict with anything (can someone who know more please tell me if zfs_ioc_next_obj is consumed by any other interface?). Furthermore, I think the severity of this bug may be understated: if you are relying on zfs diff to determine if snapshots are safe to be deleted, by checking to see if any files are removed in that snapshot, you will miss files. A worst-case scenario could lead to data loss because of snapshot removals. |
@aerusso first off thanks for digging in to this. Quite honestly no else I'm aware of has had the time to make it a priority. I suspect As for your analysis it looks very reasonable to me and this is absolutely something which I think could have gone overlooked. The Since you've been running with this change for a year now I'm guess it did resolve the issue for you? And specifically it does address the known reproducers which have been posted? If so it would be very helpful if you could open a new pull request with the proposed fix. That way we may be able to get some additional eyes to review the patch, and we can properly attribute the fix to you when it gets merged. When you do so it would be best if you could include a detailed description in the commit message, just like you posted to this issue. Thanks for looking in to this! |
FWIW, I looked into this over a year ago and came to essentially the same conclusion as @aerusso did, but for whatever reason, the issue totally fell off my radar. I also agree that |
I had a plan to look at some datasets using zfs diff, but used up my energy tracking this issue down (and then abandoned the project). This patch resolves the only two test cases I am aware of, those in this thread. |
Per the documentation for dnode_next_offset in dnode.c, the "txg" parameter specifies a lower bound on which transaction the dnode can be found in. We are interested in all dnodes that are removed between the first and last transaction in the snapshot. It doesn't need to be created in that snapshot to correspond to a removed file. In fact, the behavior of zfs diff in the test case exactly matches this: the transaction that created the data that was deleted in snapshot "2" was produced before, in snapshot "1", definitely predating the first transaction in snapshot "2". Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <Tim Chase <tim@onlight.com> Closes openzfs#2081
I think I am running into the same issue; I am running a few containers (OpenVZ) ontop of ZFS (simfs, not ploop/zvol). I wrote a (more or less) simple script which would create snapshots if there have been changes. For that functionality I do use zfs diff a lot. My argumentation is: Why should I keep/create/have/hold a snapshot if it does not contain any valuable information? Even if I do have a snapshot of last week 14:00 - I do not need that snapshot if the same data is available in the snapshot of last week 12:00. You need a way to check if there have been any modifications (valuable data as I call it) between the last snapshot and the current state of the dataset. zfs diff is the answer. And since zfs diff is - apart from running some sort of hash script all the time or using inotifywait - the only useful way to detect such changes/modifications I wouldn't say that this is a rarely used function. I might be totally wrong, though. Just my two cents. Back the the problem itself: Then I did restart the container and tried again: Then I did stop the container Then once again, stopped The interesting part is, that this does only happen for this one container and the object id/number is always differently. I am not sure if this is really an issue or problem at all. The Container is a mail filtering gateway which means there are lots of temporary files. So my question would be: Is this message harmful or can I simply ignore the message? |
@chani there was a recent As for the message I don't believe it's harmful but it should be happening either, so that needs to be understood. |
Just wanted to comment, I'm on 0.6.5.6 at the moment and just ran into this on a pool of mine.
edit: Sorry, I had presumed that 616a57b would have made it into 0.6.5-release between January and now, but it doesn't seem it did. I'll try pulling it in when I update to 0.6.5.8 and see if the issue persists. |
Having applied 616a57b to 0.6.5.8, I still get the same output as in the previous comment for zfs diff and zdb -dddd. |
I am using 0.6.2 from the zfsonlinux apt repositories. This pool has no data errors, and is scrubbed fairly regularly.
The problem is that zfs diff sometimes fails to mention when files are deleted. The following example illustrates (this is real output that I have changed the names in):
Shouldn't this mention that "/tank/fs/oldit/Track01.mp3" specifically has been removed?
Let's check if file creations are affected:
What about this new file? Will it be properly reported if removed?
For the heck of it, what happens if we remove "Track01.mp3" now?
This is strange, right? What happens if we do not create "ab"?
I don't understand what is going on here. Every file I have checked seems to be affected in this way (i.e., Track01.mp3 isn't special here).
Also, this problem is somehow new:I was able to identify file deletions at least a week ago (the snapshots do show the deletions specifically now). EDIT: If the directory is modified in the "fromsnap" the issue does not occur (also see my comment). I don't believe the issue is new, I just don't think I noticed it before.I'll mention that I have at least one snapshot affected by issue #948, though that snapshot is almost a year old.
Thanks
The text was updated successfully, but these errors were encountered: