-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xattr=sa: listxattr EFAULT #1978
Comments
@chrisrd Could you run |
@chrisrd Thanks for the detailed debugging, I'd also be very curious in the |
Zdb output below. I had this occur whilst trying to dump:
I then tried to do exactly the same under strace, and it succeeded. And running it again without strace also succeeded. The other dumps each took around 1m30 - 1m45, so the time taken for the failed dump was also an outlier. There were no kernel messages.
c.f. non-corrupt:
I hadn't noticed before, but looking at the inode numbers shows they come in groups: I imagine that's because they're allocated in sequential order (are they?) and so the groups are time-correlated. Perhaps this relates to activity on the system when they were being created, e.g. a higher than normal load (the box is generally relatively lightly loaded). If so, this may be pointing to some failure under memory pressure, or a locking problem? All corrupt inodes:
Also worthy of note, I recreated the entire filesystem last night in exactly the same manner as the original, and the new version doesn't have any xattr corruption. |
@chrisrd Could you please try "zdb -dddddd" (6 d's) from dweeezil/zfs@803578e on some of the bad ones? |
@chrisrd As a follow-on to my request, since it's not even trying to print any SA xattrs on the bad objects, the nvlist is likely corrupted and I'm hoping we can figure out the nature of the corruption by dumping its contents. Unfortunately, this simplistic debugging code assumes we are getting a reasonable length, but it's a start. One other useful piece of information from your zdb output is that we don't seem to have any spill blocks in use (which I'd not expect to see with the relatively short xattrs used by rsync). It looks like all rsync ought to be doing with fake-super is a whole bunch of lsetxattr() calls. I do find it very interesting that you couldn't reproduce the problem after trying the exact same thing again. As to the inode number, on a clean filesystem with xattr=sa, I'd expect them to be allocated pretty much sequentially. The fact that your bad objects were in clusters is interesting and likely relevant but it's not much to go on at this point. Can you think of anything else your system may have been doing at the time when you ran the first rsync that created the corrupted filesystem. Also, was that first rsync writing to a brand new clean filesystem? |
G'day @dweeezil - Interestingly (scarily?!?), I got an error the first time dumping the first inode (and there was no other output):
The next dump of the same inode worked. For each rsync the filesystem was freshly created just prior to the rsync (it's part of a script so it's consistent). The box only receives these rsyncs, and runs ceph on some other disks (4 x osds, with the storage attached to VMs running on a separate box). The first (corrupted) rsync was running at 8am Sunday - there wasn't much else going on!
|
@chrisrd That's good information and shows that the xattr SA is being truncated in some manner. The part that is there appears to be totally correct but it's cut off where the actual string array should begin. I'll poke around the xattr SA code a bit to see whether I can identify any place where this might happen. Since you were able to perform the rsync again from scratch with no problems, however, it sounds like we're looking at a race condition or some other odd corner case. One other interesting thing is that the two corrupted objects you looked at have 0676 perms which is a bit odd. I don't suppose that, by some chance, all the corrupted objects have that mode? There's no reason that a particular mode should cause this problem but it might be a clue that points us in the right direction. |
@dweeezil Good pickup. All those files on the source, and only those files, that have mode 0676 or 0670 ( Note: the "source" files have been dumped onto an ext4 fs from external sources, using The plot thickens... I started seeing corruption on the filesystem reported above as having been recreated and with no corruption afterwards. I thought I might have gone nuts somewhere along the line and reported the wrong thing somehow, so I once again recreated that filesystem (zfs create + rsync), along with 2 other filesystems also experiencing the corruption. After these filesystems were filled I checked for corruption (using Then, after a reboot, all of those filesystems are exhibiting corrupt xattrs! I.e. something in memory is not making it to disk, or is only partly making it to disk. Hmmm, that's ringing some bells... I vaguely recall some other issue where this was happening. @behlendorf - do you remember something like this? More weirdness... The first-reported filesystem is still exhibiting the same symptoms after the recreation and reboot, i.e. all and only source files with mode 0676 or 0670 have corrupted xattrs on the destination. One of the other filesystems is the same (except there are 9 source files with mode 0676 or 0670 without xattrs, so these files don't have any xattrs on the destination to be corrupt). For the remaining filesystem, it's all and only files with mode 0650 that are corrupt, and files with 0676 or 0670 aren't corrupt! It doesn't appear to be the length of the xattr. I investigated one of the filesystems (with mode 0676 or 0670 corruptions) and found (counting the entire length of name and value dumped by
Where to from here?? |
@chrisrd That's all very good information. After noticing the file mode issue, I did actually try recreating the problem by hand but wasn't able to. I think the next step is to strace rsync to find its exact sequence of system calls involving the creation of the file, the setting of the xattr and any other inode-changing operations such as chmod, etc. I gather the first rsync (with fake-super) is running as a non-root user. Is the second (-XX) rsync running as root? Your observation about seeing the corruption following a reboot certainly indicates cached-but-not-flushed metadata. You could check the output of I'm going to try to reproduce this today. |
@dweeezil Yup, first rsync is non-root, second is root. I removed one of the files ("UPS.txt") with the corruption, then straced an rsync of just that file using the same args as for the full set (
OK, looks like I can consistently reproduce the problem with that file. I'm not too sure what to look at in the ...actually, the zdb before the remount shows it's already stuffed before the remount.
|
@chrisrd Thanks for the detailed information. I've made a trivial reproducer and am looking into it now. |
@chrisrd I found the cause of the problem: The xattr SA is clobbered when a mode change occurs that requires additional (or, possibly fewer) DACL entries. Unfortunately, the stock zdb program doesn't show them but my WIP enhanced version does (at least show the count). A mode 0644 file looks like this:
but changing the mode to 606 requires an additional 2 DACL entries:
which is why the bonus increases to 184 from 168. I've not yet figured out why this breaks the xattr but it's most certainly the cause. EDIT: This is likely the cause of some of the other issues related to a corrupted bonus buffer when SA xattrs are in use. |
I've got a fix for this. Pull request to come shortly. |
When updating a SA, the sa_lengths array is used to track the sizes of all variable-sized SAs. It had not been properly updated when copying an already-existing variable-sized SA. This didn't come up too often because variable-sized SAs are relatively rare. Fixes openzfs#1978.
@dweeezil Fantasic! I can stop trying to find my way through the twisty little passages of the zfs SA code and on-disk layout. For what it's worth, a simple reproducer I've been using, with your zdb patch dweeezil/zfs@803578e, is:
|
During the update process in sa_modify_attrs(), the sizes of existing variably-sized SA entries are obtained from sa_lengths[]. The case where a variably-sized SA was being replaced neglected to increment the index into sa_lengths[], so subsequent variable-length SAs would be rewritten with the wrong length. This patch adds the missing increment operation so all variably-sized SA entries are stored with their correct lengths. Previously, a size-changing update of a variably-sized SA that occurred when there were other variably-sized SAs in the bonus buffer would cause the subsequent SAs to be corrupted. The most common case in which this would occur is when a mode change caused the ZPL_DACL_ACES entry to change size when a ZPL_DXATTR (SA xattr) entry already existed. The following sequence would have caused a failure when xattr=sa was in force and would corrupt the bonus buffer: open(filename, O_WRONLY | O_CREAT, 0600); ... lsetxattr(filename, ...); /* create xattr SA */ chmod(filename, 0650); /* enlarges the ACL */
@behlendorf I'm kinda hoping that might be the last of them! |
@chrisrd Me too. |
During the update process in sa_modify_attrs(), the sizes of existing variably-sized SA entries are obtained from sa_lengths[]. The case where a variably-sized SA was being replaced neglected to increment the index into sa_lengths[], so subsequent variable-length SAs would be rewritten with the wrong length. This patch adds the missing increment operation so all variably-sized SA entries are stored with their correct lengths. Previously, a size-changing update of a variably-sized SA that occurred when there were other variably-sized SAs in the bonus buffer would cause the subsequent SAs to be corrupted. The most common case in which this would occur is when a mode change caused the ZPL_DACL_ACES entry to change size when a ZPL_DXATTR (SA xattr) entry already existed. The following sequence would have caused a failure when xattr=sa was in force and would corrupt the bonus buffer: open(filename, O_WRONLY | O_CREAT, 0600); ... lsetxattr(filename, ...); /* create xattr SA */ chmod(filename, 0650); /* enlarges the ACL */ Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#1978
During the update process in sa_modify_attrs(), the sizes of existing variably-sized SA entries are obtained from sa_lengths[]. The case where a variably-sized SA was being replaced neglected to increment the index into sa_lengths[], so subsequent variable-length SAs would be rewritten with the wrong length. This patch adds the missing increment operation so all variably-sized SA entries are stored with their correct lengths. Previously, a size-changing update of a variably-sized SA that occurred when there were other variably-sized SAs in the bonus buffer would cause the subsequent SAs to be corrupted. The most common case in which this would occur is when a mode change caused the ZPL_DACL_ACES entry to change size when a ZPL_DXATTR (SA xattr) entry already existed. The following sequence would have caused a failure when xattr=sa was in force and would corrupt the bonus buffer. This is a long standing issue but it was exposed under Linux and FreeBSD where the use of multiple variable length SAs is common. open(filename, O_WRONLY | O_CREAT, 0600); ... lsetxattr(filename, ...); /* create xattr SA */ chmod(filename, 0650); /* enlarges the ACL */ openzfs/zfs#1978 Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Unfortunately it seems there's still something wrong with xattr=sa :-(
I had given up on using xattrs at all for a while, but with the recent commits to resolve #1890 and #1648, I was optimistic that xattr=sa would be trouble free. However I'm running into #503 again.
On linux-3.10.22 w/ c1ab64d + cherry-picked nedbass/zfs@915aa8e nedbass/zfs@00edcdc
(created before these commits made it into master), I'm getting:
All the xattrs on the filesystem are the result of
rsync --fake-super
, which means each file and directory has a single xattr which looks likeuser.rsync.%stat="100666 0,0 65534:65534"
.The filesystem was created after installing the above ZoL, and with xattr=sa set at create. There are no symlinks in the filesystem, as shown by:
Per #503, I used this stap script:
...plus this patch ('cos I couldn't get stap to dump the info itself):
And kern.log has:
The filesystem was all written in a single rsync session, and has ended up with 713474 files and 109375 directories each with a single xattr per above, and 88 files (no dirs) with corrupt xattrs. I have other filesystems created exactly the same way which have all good xattrs.
86 of the 88 bad files have
size=64
per the above debug message, 2 of them havesize=56
.I'll put the filesystem aside for the moment in case there's anything else useful to be dug out of it.
The text was updated successfully, but these errors were encountered: